Efficient Framework for Understanding Short Texts in Large-Scale Data Collection

G.T. Vineela, D.Gousiya Begum


Short texts are different from long documents, they have unique characteristics which make difficult to understand and handle. Everyday billions of short texts are generated in an enormous volume in the form of search queries, news titles, tags, chatbots, social media posts etc. Most of the generated short texts contain less than 5 words. These short texts, do not always examine the syntax of a written language. Hence, traditional NLP methods do not always apply to short texts. Many applications, including search engines, Question answering system, online advertising etc. rely on short texts. Short texts usually encounter data sparsity and ambiguity problems in representations for their lack of context. Understanding short texts retrieval, classification and processing become a very difficult task.

In this paper, we propose a neural network based approach for understanding short text, where we perform texts as a vectors with Recurrent Neural Networks (RNN), and use a semantic network to determine our intention for clustering and understanding short texts. The task of short text understanding or conceptualization can be divided into three, as text segmentation, type detection, and concept labeling. In text segmentation, first the input text is pre-processed and removes all the stop words if any. Then it is divided into a sequence of terms. Type detection is incorporated into the framework for short text understanding and it help to conduct disambiguation based on various types of contextual information that present in the text. Finally, concept labeling is performed to discover the hidden semantics from a natural language text. The conceptualization can benefit from various online applications such as automatic question-answering, recommendation systems, online advertising, and search engines. All these applications requires an information extraction phase in which the prior step is to extract the concepts from the input text.

Full Text:


Copyright (c) 2018 Edupedia Publications Pvt Ltd

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


All published Articles are Open Access at  https://journals.pen2print.org/index.php/ijr/ 

Paper submission: ijr@pen2print.org