doc2date: A Study in Document Regression

Document classification is a common application of machine learning techniques. Examples include sentiment analysis, the classification of texts into a (typically small) number of moods (such as positive and negative); as well as authorship attribution in stylometry, in which texts are grouped according to their original author,

Unsupersived learning methods have also been applied to the analysis of documents. For instance, doc2vec is a dimensionality reduction technique that extends [word embeddings] to documents.

But what about document regression? In this notebook, we investigate the problem of learning the date of a publication from the text contained therein. Since the target space, a range of years, can be viewed as a continuum, this problem presents a natural test case for applying regression techniques to document analysis.

doc2date

Commits

took down doc2date

fixed scaling

comments

fixed broken link

deleted

added pipeline diagram

README

doc2date: A Study in Document Regression