Skip to main content
9th World Conference on Information Systems and Technologies

Full Program »

Got: Generalization Over Taxonomies, A Software Toolkit For Content Analysis With Taxonomies

GOT is a Python3 software toolkit for content analysis of collections of texts using domain taxonomies. The structure of the toolkit follows a hybrid methodology developed in recent research. The efficiency of this methodology was illustrated in the analysis of research tendencies in Data Science: the findings led to insights on the tendencies of research that could not be derived by using more conventional techniques. The methodology takes a collection of texts and domain taxonomy as an input. It includes three steps: (1) computing matrices of relevance between texts and taxonomy leaf concepts using a purely structural string-to-text relevance measure based on suffix trees representing the texts and annotated by string frequencies, (2) finding fuzzy clusters of taxonomy leaf topics using an in-house method involving both additive and spectral properties, and (3) finding most specific generalizations of the fuzzy clusters in a rooted tree of the taxonomy. Such a generalization parsimoniously lifts a cluster to its 'head subject' in the higher ranks of the taxonomy, to tightly cover the cluster, up to a few errors, 'gaps' and/or 'offshoots'. A user of the toolkit may use the implementation of the whole methodology as well as its individual modules including a visualization module. GOT toolkit provides two usage scenarios: (a) console mode for using via command line and (b) import mode for using in Python3 source codes.

Dmitry Frolov
HSE University

Boris Mirkin
HSE University and Birkbeck University of London


Powered by OpenConf®
Copyright ©2002-2020 Zakon Group LLC