In text analysis, topic models are a prominent approach to extract overall themes from large collections of documents. Maybe the most widely used model in this domain is the Latent Dirichlet Allocation (LDA).
LDA is a probabilistic topic model, and exists in many variations. Today we will look at one of these: The seeded LDA model.
A seeded topic model allows the researcher to pass a collection of keywords to the model, which outlines the sought out themes before the estimation. This seeding provides more control over the estimation process, and also leads to “better” interpretability of the results.
In my recent master thesis, I explored how to measure the abstract concept of geopolitical risk based on text data. For this, I used transcripts of the UK’s House of Commons parliament debates of the last 200 years, and applied a seeded topic model.
Data science can help provide new measuring tools and can enable us to incorporate new, or more precise variables into econometric models.
Without getting into the details, I would like to give you an overview of seeded LDA models and provide some code snippets for a basic setup.