Considerations about corpus-dependency of topic modelling with Mallet

By Sara Garzone and Nicola Ruschena

In the context of text mining, topic modelling analyses co-occurrence patterns among textual data, in order to isolate clusters from the set of expressions occurring in a corpus. Topic modelling aims at extracting topics occurring in a corpus and categorize documents on the basis of their semantic content. It often represents an appealing approach for data-driven analysis in short-run projects, for it is an unsupervised method, i.e., there is no requirement for algorithm training from labelled data, whose production is quite a demanding task. Moreover, software programs that are executable from command line or user interface have been developed to perform topic modelling, so as to provide more friendly environments for researchers who are not much acquainted with code design.

Mallet is a tool for topic modelling: it is a Java-based package for statistical natural language processing, which was initially developed by Andrew McCallum at the University of Massachusetts. It allows topic modelling on textual corpora, without requiring advanced technical knowledge in statistics and programming. 

Mallet’s topic modelling is based on the Latent Dirichlet Allocation (LDA) model, a Bayesian probabilistic generative model which has been applied for the first time to text classification tasks by David Blei et al. in 2003, and thereafter has become the standard for probabilistic text categorization under latent semantic hypotheses. Along with many other techniques in the field of natural language processing, topic modelling relies upon the so-called distributional hypothesis (Harris 1954), according to which words occurring in the same contexts tend to have similar meanings. 

From co-occurrence analysis and clustering it is then possible to expect clusters to reflect semantic proximity relations, or topics. With advanced applications of probabilistic models, a categorization of documents can then be obtained on the basis of the degree of probability of their being a member of detected topics. The underlying assumption is that in each document a probabilistic distribution of every topic can be recognized. With LDA-based topic modelling one can try to understand which of the topics that have been detected in the corpus are likely to be present in each document, given the occurring terms.

Yet, clusters do not correspond immediately to topics, in the sense in which a generic reader would understand them. This kind of common-sense topics depends indeed on human interests, and this is especially true in cases in which automated analyses are implemented for research purposes. It has to be noted that Mallet’s output produces a number k of topics that needs to be set up by researchers in advance, and also that Mallet would return k topics even if it were “fed” with phone guides or meaningless data. Some legwork is therefore required to experiment different values for k and evaluating which setting returns the most consistent clusters accordingly both to researchers’ previous specialistic knowledge and Mallet’s Dirichlet parameter (see tabs in the following of the post).

Moreover, researchers have to be aware in advance of some features of the investigated corpora, such as size, sparsity and degree of specialization, which may condition the effectiveness of topic individuation, in order to obtain reliable results.

In this brief post we will report three topic modelling experiments conducted with Mallet, with the aim of obtaining the extraction of topics from three corpora that differ in content, size and composition. It is worth noticing that all the three features determine differences in topics retrieval. For a thorough tutorial on Mallet please notice “Getting started with topic modelling and Mallet” by Graham, Weingart and Milligan (2012) available on the programming historian website.

Case 1

The first corpus includes about 67,000 articles citing the philosopher Michel Foucault, in the field  ‘Humanities’ from 1980 to 2019. This corpus contains a wide variety of issues, since it includes articles on philosophy, history, social sciences, gender studies, literature, etc. Besides that, these are articles written not only by Foucault scholars but also by journalists, addressed to a wider audience, which is not necessarily competent in the philosophy of Foucault. The language is, therefore, extremely heterogeneous.

Case 2

The second corpus includes about 200 papers published in the journal Foucault Studies from 2004 to 2019. Unlike the first corpus, this one employs a considerably more specialised terminology, because the articles have been collected from a single journal, with  a well-defined editorial line. Moreover, most of the authors are part of the Foucauldian scholarship. For this reason, specific philosophical terms are more frequent in these articles, and the variety of topics is quite limited, if compared to case 1. This is the smallest corpus among the three considered.

Case 3

The third corpus includes about 700 articles citing the philosopher Baruch Spinoza, collected from French journals in the field of the social sciences, from 1980 to 2014. Like in case 1, the disciplines are various (economics, politics, sociology, etc.) and language is heterogeneous. The journals that have been selected to build the corpus are the most famous scientific journals for each discipline: therefore, even if articles cover several topics, the terminology is nonetheless more technical than in corpus 1, which included generalist journals as well. Similarly to case 2, the corpus dimension is small.

As expected, the results of the execution of Mallet substantially depend on the size and composition of the analysed corpora. 

The output files are composed of clusters of 20 words, ordered by descending frequencies in the corpus. In these files, the thematic heterogeneity of a corpus may result in extremely variegated clusters, which could contain associations of terms and themes not immediately intelligible for research purposes. This heterogeneity determines a diversified approach to the results, since some clusters will require a significant effort for interpretation. Indeed, on a vast and varied corpus like the first one, Mallet has an enormous amount of information to process while searching for frequent topics and patterns. The links found among words should be more consistent and reliable than in the case of a small corpus containing more limited information. 

Examining the first corpus, the first result seems to be that the clusters include mainly terms that are somewhat associated (or consistent) with one another, and in these cases we can recognise subjects with clear boundaries. After some trials, it was possible to establish a precise number of topics that Mallet had to process, in such a way that each topic corresponds to a specific subject. For example, in corpus 1: 

 

0 0,45455  political politics war history cultural society power nation culture colonial identity religious modern global india public rights europe discourse post
1 0,45455 development policy power political economic global public security government local society management human planning environmental governance change politics urban economy
2 0,45455 religion religious god church jewish ancient medieval spiritual ritual modern christianity theology catholic divine tradition biblical christ classical theological history

 

Here it is possible to associate each cluster with a disciplinary field. 

The outputs configuration is similar in corpus 2, which contains a small number of articles but a much more homogeneous vocabulary:

 

0 0,45455  truth subject freedom ethics practice ethical practices subjectivity care existence essential ancient relationship parrhesia hermeneutics life critical aesthetics rabinow process
1 0,45455 law men police women system legal justice group laws black war lives rights punishment public panopticon prison order criminal groups
2 0,45455 power relations disciplinary sovereign resistance discourse techniques biopower knowledge biopolitical war practices modern body political production racism mechanisms discursive effects

 

The articles from the Foucault Studies journal, on which corpus 2 was built, have an internal consistency: the authors ponder the applications of Foucauldian philosophy to current political and social problems. So even if Mallet has fewer data to process, in order to extract topics, the occurrence of specific terms and themes somewhat close to Foucault scholars makes the analysis more straightforward and effective. In this case, the interpretation of the results is a simpler task.

The same is not true in the case of the third corpus, where clusters sometimes contain terms from various disciplines:

 

0 0.625  jaspers moral kant language hegel knowledge hume frege rights justice husserl individuals descartes psychologie smith morality heidegger cognitive natural money
1 0.625 hegel politique conscience diderot éthique mouvement dieu hobbes pouvoir kant judaïsme travail peuple freud loi métaphysique puissance subjectivité rationalité guerre
2 0.625 politique sociale heidegger mouvements pouvoir utopie choix action nietzsche travail expérience droits guerre entreprise courant individu mouvement collective classe scientifiques

 

In each of these clusters, there are philosophical terms relating to the most disparate themes and citations of very diverse figures. On a corpus with reduced dimensions and a wide thematic variety, it is preferable to establish a reasonably low number of topics, but it is however not possible to eliminate thematic divergence. This variety requires more effort to interpret each cluster, but at the same time it produces original results. Indeed, an advantage of topic modelling on a corpus such as the third one is the prominence given to some unusual associations. This means that links between words of the same cluster, which seem inconsistent or bizarre at a first glance, actually derive from articles presenting original analyses of the subject, which in the analysis of a large corpus would be overwhelmed by more frequent patterns. In tiny corpora, indeed, every term has a greater weight in view of topics elaboration.

We also have to admit that complex clusters require complementary investigations on the corpus, such as manually checking text strings in order to verify the reason for the association of apparently unrelated terms. However, the need for such an approach has nothing to do with a weak reliability of the results, as it simply suggests that the interpretation has possibly to be guided by an expert in the field. 

Further confirmation of the fact that we will not necessarily obtain unreliable results on small corpora is the Dirichlet parameter obtained in the third corpus. Indeed, in the output file tutorial.keys, Mallet indicates a number lower than 1 that emerged from the application of Bayes’ theorem; this parameter signals the probability of finding the various topics in the corpus, expressed with a decimal value between 0 and 1. The Dirichlet parameter by default is symmetrical for all topics generated by Mallet. By doing several trials with a different number of topics (the k parameter mentioned above), on this third corpus we achieved to reach the 0.625 point, which is acceptably high. We also have to recognise that we decided to work with 8 topics on the basis of the best configuration achievable for the interpretation: with this number of topics we had both a fair interpretability and a good score. 

For wider corpora it is harder to obtain a good Dirichlet parameter, due to the enormous amount of data to process. However, we have to consider that this parameter is only a partial indicator of the reliability of outputs: the evaluation of field experts is still the most relevant feedback, even for the choice of the k number of topics. Mallet, however, provides the possibility to check which number of topics would generate a higher consistency, through the coherence score function: this score in the range of 0 and 1 indicates the probability to find the terms within the clusters together, in form of co-occurrences, in the corpus. The k number of topics with the best coherence score should be the one to work on, but in fact a compromise between a high score and a good interpretability can be obtained after several trials.

To make a long story short, Mallet is a software executable on different types of corpora, but the most linear and consistent results will be obtained with the wider amounts of textual documents. We have to recognise, however, that the complex composition of clusters is not a hard limit for research, since relevant and original results can hide behind this heterogeneity. Nevertheless, this sometimes determines the need for more work, so as to obtain a reliable interpretation, such as additional checks of unexpected co-occurrences within the topics.

 

References:

Blei, D. M., Ng, A. Y.,  and Jordan, M. I.. (2003) “Latent dirichlet allocation”. Journal of Machine Learning Research 3  (3/1/2003), 993–1022. https://dl.acm.org/doi/10.5555/944919.944937

Graham, S., Weingart, S., Milligan, I. (2012). Getting Started with Topic Modeling and MALLET. The Programming Historian, 1. https://doi.org/10.46430/phen0017

Harris, Z. S. (1954) Distributional Structure. WORD, 10(2–3), 146–162. https://doi.org/10.1080/00437956.1954.11659520

McCallum, A. K. (2002)  MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu.

This entry was posted in Data-Driven Research, Text mining. Bookmark the permalink.

2 Responses to Considerations about corpus-dependency of topic modelling with Mallet

  1. Eugenio Petrovich says:

    Interesting work. A question out of curiosity: does the fact that the third corpus is partly in French affect in some way Mallet’s performance? In other words, is topic modeling language-dependant?

    • Sara Garzone says:

      I am sorry to be late on this reply. Actually, in the third corpus, most of the articles were in French, therefore Mallet isolated the few English articles it found and provided few topics for these. The total number of topics we obtained from the corpus was 32 and only 3 of these contained English terms (the proportion seems reasonable). In general, this program is language-dependent, but it could be merit rather than a defect. It is possible, indeed, to work on multilingual corpora, choosing between outputs with separated topics for each language or mixed ones. On the other hand, you can decide to delete the presence of foreign words if they are unremarkable for your topic modeling: in this case, it is necessary to improve your extra-stopword list with them.

Comments are closed.