Personal blog of Matthias M. Fischer


A small NLP data science project from start to finish

Posted: 21st November 2021

Introduction

In this post, we will look at a little data science project I recently finished. I mostly did this for fun, but also for learning the basics of natural language processing (NLP), and to also try my hands at some web scraping.

Thematically, we will be dealing with the analysis of a small corpus of ~1000 letters exchanged between two famous German poets from the 18th and 19th century: J. W. v. Goethe and F. v. Schiller. Admittedly, the exact results are probably only interesting for a smaller group of people. Nonetheless, I think the general analytic approach and some general 'lessons learned' might be interesting for a wider audience. Also, I suspect that this post might even serve as a little tutorial or inspiration for people who are just starting out with the topic of web scraping and NLP.

Getting started

We will use the following Python libraries:

Apart from those, we will only need some standard Python libraries and a way to interact with jupyter notebooks.

All source code is available in the following github repository: Ma-Fi-94/Letters

Obtaining the raw data

(Check out the file scrape.py if you want to see the complete code for this section.)

Naturally, we first have to obtain our data. Here, this turned out to be pretty simple. The correspondence between Goethe and Schiller has been published as book multiple times already; and luckily, some of them are not copyrighted any more and can thus be freely found online. We will use the following set of thirteen chapters from the 1881 edition published by J. C. Cotta in Stuttgart, contained in two volumes, obtainable via Projekt Gutenberg here and here.

Using the `requests` library for Python, downloading the raw files becomes a matter of a few lines of code:

r = requests.get('www.example.com')
r.encoding = r.apparent_encoding
with open('destinationfilename.html', 'w') as f:
    f.write(r.text)

Preprocessing the data

(Check out the file preprocess.py if you want to see the complete code for this section.)

Next, we need to preprocess the raw HTML files to extract the individual letters. The following section describes in detail what exactly I did. You may safely skip it, if you are not interested in that.

We loop over the downloaded HTML files, doing the following steps:

Data analysis – general principles

Now that we have stored all letters in a single CSV file, we can start working on the actual analyses. Such 'actual analyses' usually take the least amount of time in a project like this, compared to all the work to be done before. As the saying goes: Data science is mostly data obtaining, organising and cleaning.

To keep everything organised we will store our analyses in different jupyter notebook, splitted by the general type of question we want to answer. To avoid boilerplate code for e.g. importing the data, we will write some additional python scripts (with filenames lib_*.py), where we will store code we need multiple times. This allows allows for an easy testing of these external helper functions, which of course is a good thing in general to avoid errors.

First analysis: Letter counts and lengths by author and by time

(Check out the jupyter notebook Descriptive.ipynb if you want to see the complete code for this section.)

An excellent first step in any data analysis is simply looking at the data. Here, we will examine the numbers and lengths of the letters by author and over time. Such a simple inspection serves multiple purposes:

Let's start by simply counting how many letters our two authors have written by calling df.Author.value_counts(). We get:

Author Nb. Letters
Goethe
521
Schiller 442

This appears realistic. However, the difference in numbers is interesting – is this a general trend? Let's examine cumulative letter counts over time. We count letters written by Goethe and by Schiller in two new columns, by adding a 1 to a cumulative count whenever a letter is written by the respective author:

df["cum_counts_G"] = np.cumsum([1 if author == "Goethe" else 0 for author in df.Author])
df["cum_counts_S"] = np.cumsum([1 if author == "Schiller" else 0 for author in df.Author])

Plotting these columns yields:

Interesting! There seems to be indeed a trend going on over the whole corpus with Goethe generally writing more often than his colleague.

Let's now look the average letter lengths. We calculate the length of every letter and then simply calculate the mean per author:

df["length"] = [len(l) for l in df.Content]
print(df.groupby("Author").mean().length)

We get (after some rounding):

Author Avg. Letter Length
Goethe
1545
Schiller 2026

So on average, Schiller writes approximately 33% more than his collegue? No wonder he doesn't need to write that often ;-).

But wait, let's not get carried away too quickly and also inspect the raw data instead of only looking at summary statistics like the mean. After all, the mean is just a very rough summary of a dataset, easily affected by outliers, and not completely representative in case of skewed or bimodial distributions. We thus look at the raw distribution of letter lengths by author by simply generating a histogram.

Interesting again! While Goethe seems to mostly write short letters, his colleague shows a bimodal distribution. Schiller either writes rather short, or rather long letters. On average, this leads to a larger mean letter length; however, by looking only at summary statistics we never would have noticed.

Finally, let us also check again whether this trend persists over time. We calculate the cumulative sum of characters written by the two authors, as well as the cumulative sum of characters of all letters:

df["cum_words"] = np.cumsum(df.length)
df["cum_words_G"] = np.cumsum([length if author == "Goethe" else 0 for (length, author) in zip(df.length, df.Author)])
df["cum_words_S"] = np.cumsum([length if author == "Schiller" else 0 for (length, author) in zip(df.length, df.Author)])

We get:

The difference between the two authors seems to persist over time. Interestingly, however, the writing speed (in characters per letter) of both writers seems to decline over time (an effect of old age, maybe ;-) ?). There are also some little spikes in the curves, hinting at short periods of higher productivity. Of course, since we only plot the index of the letters on the x-axis instead of the actual time, this might be confounded by differences in time passed between subsequent letters. A more thorough analysis could try to extract the data of every letter from the corpus to correct for that.

Second analysis: Which words change occurence over time?

(Check out the jupyter notebook Word_Frequencies_Over_Time.ipynb if you want to see the complete code for this section.)

Let's dig more deeply into the actual contents of the letters. A natural question we may want to answer is whether certain words change their frequency over time. Such words could represent topics that the authors care about more or less at different points of time. In contrast, words with a constant frequency over time are more likely to be non-informative 'everyday words' we don't really care about.

Let us first examine the distribution of word frequencies across the whole corpus:

Nb. Occurences in Corpus Number of Words Fraction of Overall Nb. of Words
1 9381 50.53%
2 2788 15.02%
3 1378 7.42
4 826 4.45
5 580 3.12
6 410
2.21
7 321 1.73
8 267 1.44
9 222 1.20

We notice that out of 18'567 different words, 9'381 words (which is more than 50%!) appear only once throughout the entire corpus. Additional 15%, 7.4%, 4.5%, and 3.1% of words appear only twice, thrice, four times, and five times, respectively. In other words, 80% of all words appear no more often than five times altogether. Rare words like these do not really lend themself to a direct analysis, so we want to apply some filtering: We remove all words, which do not appear at least fifty times across the complete corpus. This leaves us with 580 words – way easier to work with!

Now, we need to quantify the variability of occurences of a word across the letters. Calculating the standard deviation of counts is a good start, however since the standard deviation describes the 'absolut variability', it will be correlated with the average number of occurences. Hence, we calculate the coefficient of variation by dividing the standard deviation by the mean, to get the 'relative variabilities' of the words. In other words, we do:

C.V. = σ(x) / μ(x)

Words with a high coeeficient of variation (C.V.) are potentially interesting candidates to examine further.

Among the top twenty variable words, we find the following interesting candidates:

Word Sum Counts Mean Counts Stddev Counts C.V. Counts
handlung 52 0.053 0.385 7.141
frankfurt 51 0.062 0.373 7.051
meister 62 0.064 0.384 5.964
faust 55 0.057 0.331 5.805
poesie 62 0.064 0.370 5.750
leser 73 0.075 0.421 5.566
almanachs 52 0.053 0.293 5.443
freiheit 64 0.066 0.343 5.171

We notice some things which give rise to analyses presented in this post:

Some other potential ideas, which we will not follow up on for the sake of brevity, could include:

Towns

We start by looking at the word frequencies of certain towns. While plotting the count of a given word over time, one generally gets rather ugly curves. Because of the many zero values, the graph is dominated by flat horizontal lines and sharp peaks. Such a curve is hard to make sense of:

Hence, we will instead plot the _cumulative_ counts of a word over time:

Way better! Some insights immediately become apparent: The town of Weimar seems to be relevant throughout the whole corpus, whereas the town of Jena at some point starts to lose some of it relevance. This coincides perfectly with the point of time Schiller moved from Jena to Weimar (between 2nd and 4th of December 1799, letters 673 and 674, which have indices 638 and 639 in our dataset; but see the caveat later in this article). In contrast, the towns of Frankfurt and Berlin do not appear all too often in the letters and do not seem to be really relevant.

Books

Next, we look at how often certain books are mentioned.

There are a number of interesting things to see here.

All in all, these trends nicely reflect what was happening in real life.

Third analysis: Do letters become more similar over time?

(Check out the jupyter notebook Letter_Similarities.ipynb if you want to see the complete code for this section.)

Finally, we move on to something slightly more involved. We want to answer the question whether the exchanged letters become more similar over time. To this end, we first need to transform the corpus into a structure that is easier to work with statistically. This can be achieved using a number of different approaches; here, we will use a rather simplistic one based on the Bag of Words model. In a bag of words model, for every document we count the raw occurences of all words of the corpus. We thus end up with a table with dimensionality n x m, where n represents the number of documents, and m the number of words of the corpus.

There are multiple different ways to construct this table:

We will experiment with the first three choices later in order to see how they might affect our results.

A typical bag of words table might thus look as follows:

Word 1 Word 2 ...
Word m
Letter 1 0 1 ...
4
Letter 2 1 0 ... 0
... ... ... ... ...
Letter n 1 2 ... 0

In Python, such a table is easy to construct by using scikit-learn`s CountVectorizer, invoking:

# Lengths of words we want to include
minimum_word_length = 3
maximum_word_length = 1000

# Which n-grams to consider
n_min = 1
n_max = 1

# Binary presencen/absence data, or counts?
binary = False

# Fit the model
token_pattern = "\[a-zA-Z\]{" + str(mininum_word_length) + "," + str(maximum_word_length) +"}"
model = CountVectorizer(ngram_range=(n_min, n_max), token_pattern=token_pattern, binary=binary)

# Get results
table = model.fit_transform(corpus)
word_names = model.get_feature_names()

After having constructed our bag-of-words table, we need to come up with a metric to quantify the similarity of two documents, which here means quantifying the similarity between two rows of the table. Different distance or similarity metrics are available; here, we will use the Cosine similarity, which is often used for purposes like that.

Intuitively, the cosine similarity quantifies the divergence between two vectors in our m-dimensional space. More precisely, the cosine similarity S corresponds to the cosine of this angle θ, and can be calculated as follows:

S(A, B) := cos(θ) = A · B / (||A|| · ||B||)

In Python, we can simply use the function sklearn.metrics.pairwise.cosine_similarity(X), where X contains the vectors to compare.

Note that other, more familiar metrics such as the Euclidean distance, are likely not a good choice because these tend to fail in higher dimensions. As Pedro Domingos has put it in this paper:

[O]ur intuitions, which come from a three-dimensional world, often do not apply in high-dimensional ones. In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean, but in an increasingly distant “shell” around it; and most of the volume of a high-dimensional orange is in the skin, not the pulp. If a constant number of examples is distributed uniformly in a high-dimensional hypercube, beyond some dimensionality most examples are closer to a face of the hypercube than to their nearest neighbor. And if we approximate a hypersphere by inscribing it in a hypercube, in high dimensions almost all the volume of the hypercube is outside the hypersphere.

Fitting the bag of words model and computing all pair-wise cosine similarities (for 1-grams with minimum word length of 3 characters, unnormalised and non-binarised counts) between the letters gives us the following similarity matrix:

We now slide a window of length 100 over the sequence of letters and compute the average pair-wise letter similarity over time. We get:

Interesting – apparently, the pair-wise letter similarity on average actually decreased over time. In order to check the robustness of this finding, we re-run the analysis with different window sizes, getting similar results:

Thus, apparently over time the two writers did not "find a common style", but instead seemed to actually have separated stylistically.

We now want to check how robust this result is to changes in the analytic approach. If we change the minimum letter length to 1, 5, or 10, instead of 3 letters, we get extremely similar results (not shown here for brevity, can be reproduced from the jupyter notebook). The result also stays the same if we switch to binary presence/absence data instead of using the raw word counts (again not shown here). However, a completely different result emerges when we normalise word counts across letters:

Using normalised word counts, we get the impression that around letter 600-ish there is a huge spike in avergae letter similiarity, after which a second spike emerges. Thus, qualitatively completely different results can emerge if we change the analytic approach. To judge which is the "right" approach requires a more precise definition of our question:

When we normalise the counts of a word across letters we limit the (formally rather big) relative influence of the mroe frequent words on the vectors, whereas rarer words will now influence the vectors more strongly. Thus, in the normalised scenario, all words contribute equally to the letter vectors, regardless of whether they're used frequently or infrequently throughout the corpus. In contrast, in the non-normalised scenario more frequent words exert a bigger influence on the data. For answering the question whether the two authors over time deelop a "common style" of talking, I think the non-normalised approach is more appropriate.

Conclusions, caveats and a little outlook

This was a very fun 'end to end' project that has taught me quite a lot! From writing a simple scraper for getting raw data, over dealing with the nitty-gritty details of data cleaning and preprocessing, to developing questions to ask, all the way to actually performing these analyses and interpreting their results – this project has consisted of all the major steps one needs to perform in a 'real-world' project too.

Some ideas (in no particular order) for future tinkering:

Some take-home messages: