Influence of functional words, term weighting measures and classifiers on Text classification

Dr. P. Vijaya Pal Reddy

Abstract


Automated text classification is a supervised learning task which uses labeled training set of documents to assign a category label to a new document based on a model generated by a classifier. The training set and test set documents needs to  be preprocessed to reduce the influence of non-content words on the model derived from the training set. In this paper it is attempted to address the influence of non-content words on the classifier performance. After preprocessing the documents are represented in a machine understandable format i.e. vector space model. The terms in the document are weighted using various measures such as Term Frequency-Inverse Document Frequency (TF-IDF ), Residual IDF (RIDF), xI metric, Odds Ratio (OR(t)), Information Gain (IG(t)) Chi-squared (χ2 (t, c)) and Mutual Information (MI(t)). It is also addressed the influence of different term weighting measures on text classification in news documents. The classification model can be generated using the vector space representation of training documents set with various classifiers. In this paper an attempt is made for classification model generation using the classifiers such as Naive Bayes classifier (NB), K-Nearest Neighbor (KNN) and Support Vector Machine (SVM). The performance of the models generated using these classifiers are measured with precision, recall, F1 and macro F1 measures with various possible combinations of term weighting measures and with functional words.

Full Text:

PDF




Copyright (c) 2019 Edupedia Publications Pvt Ltd

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

 

All published Articles are Open Access at  https://journals.pen2print.org/index.php/ijr/ 


Paper submission: ijr@pen2print.org