Thursday, September 15th

 

 

Overview| Wednesday| Thursday| Friday

09:30-10:30 Session 8: Keynote II

Chair:

09:30

Design Is Not What You Think It Is
SPEAKER: Peter Biľak

ABSTRACT. In this talk, Peter Biľak will examine the ways that current publishing practices are rooted in the 19th century, and how in order to move forward, we may have to go back to the roots and reconnect with readers. He will also talk about his recent project, Works That Work magazine, which set out to rethink publishing paradigms, starting with its financing, distribution and production. Works That Work aims to discuss design outside of the traditional design discourse, and Biľak will argue for widening the understanding of the design discipline.

10:30-11:00 Coffee Break

11:00-12:15 Session 9: Text Analysis II: Classification

11:00

SEL: a Unified Algorithm for Entity Linking and Saliency Detection

ABSTRACT. The Entity Linking task consists in automatically identifying and linking the entities mentioned in a text to their URIs in a given Knowledge Base, e.g., Wikipedia. Entity Linking has a large impact in several text analysis and information retrieval related tasks. This task is very challenging due to natural language ambiguity. However, not all the entities mentioned in a document have the same relevance and utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in a document, also known as Salient Entities, is attracting increasing interest. In this paper we propose SEL, a novel supervised two-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is based on a classifier aimed at identifying a set of candidate entities that are likely to be mentioned in the document, thus maximizing the precision of the method without hindering its recall. The second step is still based on machine learning, and aims at choosing from the previous set the entities that actually occur in the document. Indeed, we tested two different versions of the second step, one aimed at solving only the entity linking task, and the other that, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on two different datasets show that the proposed algorithm outperforms state-of-the-art competitors, and is able to detect salient entities with high accuracy.

11:30

Automated Intrinsic Text Classification for Component Content Management Applications in Technical Communication
SPEAKER: Jan Oevermann

ABSTRACT. Classification models are used in component content management to identify content components for retrieval, reuse and distribution. Intrinsic metadata, such as the assigned information class, play an important role in these tasks. With the increasing demand for efficient classification of content components, the sector of technical documentation needs mechanisms that allow for an automation of such tasks. Vector space model based approaches can lead to sufficient results, while maintaining good performance, but they must be adapted to the peculiarities that characterize modular technical documents.

In this paper we will present domain specific differences, as well as characteristics, that are special to the field of technical documentation and derive methods to adapt widespread classification and retrieval techniques for these tasks. We verify our approach with data provided from companies in the sector of plant and mechanical engineering and use it for supervised learning and automated classification.

11:45

Centroid Terms as Text Representatives

ABSTRACT. Algorithms to topically cluster and classify texts rely on information about their semantic distances and similarities. Standard methods based on the bag-of-words model to determine this information return only rough estimations regarding the relatedness of texts. Moreover, they are per se unable to find generalising terms or abstractions describing the textual contents. A new method to determine centroid terms in texts and to evaluate their similarity using those representing terms will be introduced. In first experiments, its results and advantages will be discussed.

12:00

Frequent Multi-Byte Character Subtring Extraction using a Succinct Data Structure

ABSTRACT. The frequent string mining is widely used in text processing to extract text features. Most researches have focused on text in single-byte characters. However, they have problems when applying to text represented with multi-byte charac- ters such as Japanese and Chinese text. The main drawback is huge memory usage for treating the multi-byte character strings. To solve this problem, we apply the wavelet tree-based compressed suffix array instead of the normal suffix array to reduce the memory usage. Then, we propose a novel technique which utilizes the rank operation to improve the run-time efficiency. The experi- mental evaluation shows the proposed method reduces the processing time by 45% compared to a method using only compressed suffix array. It also shows the proposed method reduces the memory usage by 75%.

12:45-14:15 Lunch Break

14:45-15:05 Session 12: Poster Lightning Talks

14:45

Multilingual News Article Summarization in Mobile Devices – Demo

ABSTRACT. Mobile devices such as smart phones and tablets are omnipresent in modern societies. Such devices allow browsing the Internet. This demo paper briefly describes two tools for mobile devices that attempt to collect automatically the most important information of news articles in WebPages.

14:47

Rendering Mathematic Formulas for the Web in Madoko
SPEAKER: Daan Leijen

ABSTRACT. Madoko is a novel authoring system for writing complex documents. It is especially well suited for complex academic or industrial documents, like scientific articles, reference manuals, or math-heavy presentations. This application note describes progress made over the last year and details how math formulas are rendered in HTML. Moreover we show how other mechanisms, like replacement rules, help with creating mini domain-specific extensions to cleanly express complex math.

14:49

A PDF Wrapper for Table Processing

ABSTRACT. We propose a PDF document wrapper system that is specifically targeted at table processing applications. We (i) review the PDF specifications and identify particular challenges from the table processing point of view, (ii) specify a table-oriented document model containing the required atomic elements for table extraction and understanding applications. Our evaluation showed that the wrapper was able to detect important features such as page columns, bullets and numbering in all measures, recording over 90% accuracy, leading to better table locating and segmenting.

14:51

Configurable Table Structure Recognition in Untagged PDF documents

ABSTRACT. Today, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures and tables. One of the challenges with these documents is how to extract tables from them. The paper discusses a new system for table structure recognition in untagged PDF documents. It is formulated as a set of configurable parameters and ad-hoc heuristics for recovering table cells. We consider two different configurations for the system and demonstrate experimental results based on the existing competition dataset for both of them.

14:53

Extending data models by declaratively specifying contextual knowledge

ABSTRACT. The research data landscape of the arts and humanities is characterized by a high degree of heterogeneity. To improve interoperability, recent initiatives and research infrastructures are encouraging the use of standards and best practices. However, custom data models are often considered necessary to exactly reflect the requirements of a particular collection or research project.

To address the needs of scholars in the arts and humanities for a composition of research data irrespective of the degree of structuredness and standardization, we propose a concept on the basis of formal languages, which facilitates declarative data modeling by respective domain experts. By identifying and defining grammatical patterns and deriving transformation functions, the structure of data is generated or extended in accordance with the particular context and needs of the domain.

14:55

Using Convolutional Neural Networks for Content Extraction from Online Flyers

ABSTRACT. The rise of online shopping has hurt physical retailers, which struggle to persuade customers to buy products in physical stores rather than online. Marketing flyers are a great mean to increase the visibility of physical retailers, but the unstructured offers appearing in those documents cannot be easily compared with similar online deals, making it hard for a customer to understand whether it is more convenient to order a product online or to buy it from the physical shop. In this work we tackle this problem, introducing a content extraction algorithm that automatically extracts structured data from flyers. Unlike competing approaches that mainly focus on textual content or simply analyze font type, color and text positioning, we propose a new approach that uses Convolutional Neural Networks to classify words extracted from flyers typically used in marketing materials to attract the attention of readers towards specific deals. We obtained good results and a high language and genre independence.

14:57

Combining Taxonomies using Word2Vec

ABSTRACT. Taxonomies have gained a broad usage in a variety of fields due to their extensibility, as well as their use for classification and knowledge organization. Of particular interest is the digital document management domain in which their hierarchical structure can be effectively employed in order to organize documents into content-specific categories. Common or standard taxonomies (e.g., the ACM Computing Classification System) contain concepts that are too general for conceptualizing specific knowledge domains. In this paper we introduce a novel automated approach that combines sub-trees from general taxonomies with specialized seed taxonomies by using specific Natural Language Processing techniques. We provide an extensible and generalizable model for combining taxonomies in the practical context of two very large European research projects. Because the manual combination of taxonomies by domain experts is a highly time consuming task, our model measures the semantic relatedness between concept labels in CBOW or skip-gram Word2vec vector spaces. A preliminary quantitative evaluation of the resulting taxonomies is performed after applying a greedy algorithm with incremental thresholds used for matching and combining topic labels.

14:59

Important Word Organization for Support of Browsing Scholarly Papers Using Author Keywords

ABSTRACT. When beginning researchers read scholarly papers, they often encounter technical terms they do not know, which may take considerable time. We, therefore, have been developing an interface for support of browsing scholarly papers which gives users useful links for technical terms. The interface displays important words extracted from papers on top of the image of papers. In this study, we propose organizing such important words extracted from papers by using author keywords. We first identify important words in papers and then associate the important words and author keywords by using word2vec. Experiments showed that our method improved the classification accuracy of important words compared to a simple baseline and associated an author keyword with about 2.5 relevant important words.

15:01

Selecting Features with Class Based and Importance Weighted Document Frequency in Text Classification
SPEAKER: Baoli Li

ABSTRACT. Document Frequency (DF) is reported to be a simple yet quite effective measure for feature selection in text classification, which is a key step in processing big textual data collections. The calculation is based on how many documents in a collection contain a feature, which can be a word, a phrase, a n-gram, or a specially derived attribute. It is an unsupervised and class independent metric. Features of the same DF value may have quite different distribution over different categories, and thus have different discriminative power over categories. On the other hand, different features play different roles in delivering the content of a document. The chosen features are expected to be the important ones, which carry the main information of a document collection. However, the traditional DF metric considers features equally important. To overcome the above weaknesses of the original document frequency metric, we propose a class based and importance weighted document frequency measure to further refine the original DF to some extent. Extensive experiments on two text classification datasets demonstrate the effectiveness of the proposed metric.

15:03

Bayesian mixture models on connected components for Newspaper article segmentation

ABSTRACT. In this paper we propose a new method for automated segmentation of scanned newspaper pages into articles. Article regions are produced as a result of merging sub-article level content and title regions. We use a Bayesian Gaussian mixture model to model page Connected Component information and cluster input into sub-article components. The Bayesian model is conditioned on a prior distribution over region features, aiding classification into titles and content. Using a Dirichlet prior we are able to automatically estimate correctly the number of title and article regions. The method is tested on a dataset of digitized historical newspapers, where visual experimental results are very promising.

15:05-16:00 Coffee & Poster Session

16:00-16:45 Session 13: Text Analysis III: Summarization

16:00

Appling Link Target Identification and Content Extraction to improve Web News Summarization

ABSTRACT. The existing automatic text summarization systems whenever applied to web-pages of news articles show poor performance as the text is encapsulated within a HTML page. This paper takes advantage of the link identification and content extraction techniques. The results show the validity of such a strategy.

16:15

Towards Cohesive Extractive Summarization through Anaphoric Expression Resolution

ABSTRACT. This paper presents a new method for improving the cohesiveness of summaries generated by extractive summarization systems. The solution presented attempts to improve the legibility and cohesion of the generated summaries through coreference resolution. It is based on a heuristic-based post-processing step that binds dangling coreference to the most important entity in a given coreference chain. The proposed solution was evaluated on the CNN corpus of 3,000 news articles, using four state-of-the-art summarization systems and seventeen techniques for sentence scoring proposed in the literature. The results obtained may be considered encouraging, as the final summaries reached better ROUGE scores, besides being more cohesive.

16:30

Assessing Concept Weighting in Integer Linear Programming based Single-document Summarization

ABSTRACT. Some of the recent state-of-the-art systems for Automatic Text Summarization rely on the concept-based approach using Integer Linear Programming (ILP), mainly for multidocument summarization. A study on the suitability of such an approach to single-document summarization is still missing, however. This work presents an assessment of several methods of concept weighing for a concept-based ILP approach on the single-document summarization scenario. The unigram and bigram representations for concepts are also investigated. The experimental results obtained on the DUC 2001-2002 and the CNN corpora show that bigrams are more suitable than unigrams for the representation of concepts. Among the concept scoring methods investigated, the sentence position method presented the best performance on all evaluation corpora.