Optimized short text embedding for bilingual similarity using Probase and BabelNet

Natasha J.; Vijayarani J.

doi:XX.XXX/IJARIIT-V5I3-1327

This paper is published in Volume-5, Issue-3, 2019

Paper Details
Abstract & PDF

Area

Big Data and Text Mining

Author

Natasha J., Vijayarani J.

Org/Univ

Anna University, CEG Campus, Chennai, Tamil Nadu, India

Pub. Date

17 May, 2019

Paper ID

V5I3-1327

Publisher

IJARIIT

Edition

Volume-5, Issue-3, 2019

Keywords

Short text, Conceptualization, Probase, BabelNet, Skipgram, Word2Vec, Concept2Vec

Citations

IEEE
Natasha J., Vijayarani J.. Optimized short text embedding for bilingual similarity using Probase and BabelNet, International Journal of Advance Research, Ideas and Innovations in Technology, www.IJARIIT.com.

APA
Natasha J., Vijayarani J. (2019). Optimized short text embedding for bilingual similarity using Probase and BabelNet. International Journal of Advance Research, Ideas and Innovations in Technology, 5(3) www.IJARIIT.com.

MLA
Natasha J., Vijayarani J.. "Optimized short text embedding for bilingual similarity using Probase and BabelNet." International Journal of Advance Research, Ideas and Innovations in Technology 5.3 (2019). www.IJARIIT.com.

Give proper credits, use Citation.

Abstract

Most existing methodologies for text classiﬁcation represent text as vectors of words, to be specific "bag-of-words." This content portrayal results in a high dimensionality of feature space and much of the time experiences surface jumbling. When it comes to short texts, these become even more serious because of their shortness and sparsity and with the bilingual similarity of text it gets more difficult. This paper proposes an approach to deal with both sparsity and computational complexity of bilingual similarity of short text. English short text is mapped with Probase and Hindi short text is mapped with BabelNet a knowledge base with coverage of words and concepts for 248 languages. A semantic network is created to manipulate the word to word and concept to concept correlation. Unlike the earlier approaches of embedding, words and concepts from both English and Hindi short texts are treated separately to yield word embedding (Word2Vec) and concept embedding (Concept2Vec) respectively. The similarity between bilingual short texts is computed using the skip-gram based word embedding and concept embedding. When evaluated with Pilot and STSS 131 short text benchmark datasets, the proposed optimized bilingual short text embedding gives better similarity score

All content is copyright protected.