A swift exploration of world literature

Dr. Philipp Warmer

With the advent of large language models, generating cohesive text has become trivial. Thus the available text corpus will be composed more and more of ai-generated text. Until then we wanted to take a step back and explore a part of the world’s literature in an attempt to quantify its magic.

Before we start — why does written content even matter?

Written texts represent the key element of development and transmission of human culture. Currently, we have access to millions of books written over thousands of years. The recent advances in natural language processing (NLP) offer an unprecedented toolbox to systematically analyze literature at scale.

Here, we used a set of NLP approaches to characterize over 90 of the most downloaded books from Project Gutenberg, a collection offering a vast number of free eBooks (https://www.gutenberg.org/). Taking an usupervised approach, we wanted to learn:

  1. How do writing styles differ between books?
  2. How does the sentiment changes throughout books?
  3. How are certain writing styles are associated with the overall sentiment of a book?
  4. How do transformers map a large corpus of literature and do they expose the underlying genre composition?

1. Books have many unique characteristics and it’s not only their number of pages

Literature is often analyzed either by investigating the story arch of an individual book or by contrasting two books across different dimensions. However, a systematic assessment of the commonalities and differences of a comprehensive collection of popular literature is not as often done. Various approaches have been developed to describe a book purely by the composition of its sentences and underlying words. Among those characteristics are the dependency distance of words, the proportion of unique words or the readability of a sentence. We were curious how these features vary across the different books. To this end we studied more than 90 prominent books from 80 authors of Project Gutenberg spanning a vast publishing date range (Figure 1A). We subjected the books to minimal preprocessing and computed a set of text descriptives using the hlasse package (Figure 1B). For an in depth explanation please check out https://github.com/HLasse/TextDescriptives. Afterwards we calculated the variance of each feature across the books (Figure 1B). Those features can be loosely grouped into two categories: (1) general text attributes such as number of unique words and (2) readability characteristics such as ease of reading. Those two categories are colored in a darker and lighter color, respectively.

As expected, the total number of characters (letters) per book, corresponding to the book length, showed the highest variance. Two other highly variable features were the proportion of unique words per book, which signifies a large difference in vocabulary size, and the readability of books. Among the smallest variations found were the average word length and the average number of syllables per word. All these observations highlight that authors have distinct writing styles resulting in a diverse set of text characteristics beyond the length of a book.

Figure 1: Overview on book characteristics. A) Distribution of analyzed books with regard to their publishing date B) Median sorted text characteristics per book. C) Associated variance. Characteristics related to readability labeled in darker colors, descriptive books characteristics in lighter colors.

2. A happy beginning doesn’t protect from an unhappy end

In addition to the text characteristics, we determined the sentiment of each book based on the word composition using NRCLex https://github.com/metalcorebear/NRCLex. NRCLex calculated the sentiment score for each book and we normalized it so that books with a positive sentiment got a positive and less positive books got a negative score. As expected we found the “Sabotage Field Manual’’ and “The War of the Worlds” were among the books with the most negative sentiment (Figure 2A). Among the most neutral books were “Thus Spoke Zarathustra” and “Don Quixote”. “The Happy Prince” was one of the most positive books.

To get a more fine grained understanding of how the sentiment changes throughout a book we split the content of each book into the five Freytags pyramid parts pioneered by novelist Gustav Freytag. Those are exposition, rising action, climax, falling action and resolution (Figure 2B). For each of the five parts of each book we determined the sentiment and subjected it to K-means based inertia drop analysis to identify the number of meaningful clusters. We found that three clusters separate the data best (Figure 2C). Looking at the progression we saw that two of the three sentiment groups had high sentiment scores that stayed relativley unchanged throughout the book. The third group, corresponding to the least positive books, starts out relatively positive in the exposition and when the rising action occurs, their sentiment drops by approximately 40% and remains low.

Overall, we see that the books analyzed here exhibit a broad distribution of sentiment scores and can be grouped into three groups with specific sentiment changes throughout the books’ progression.

Figure 2: Positional analysis of sentiment. A) Mean normalized sentiment score across all books. B) Scheme depicting how books are split in parts according to Freytags pyramid. C) Sentiment cluster group by K-means inertia drop projected onto T-SNE. D) Sentiment score across the three book clusters across the relative book position as defined by Freytags pyramid (A), on the right the cumulative density plot of the positive sentiment fraction across the book.

3. Sentiment and writing style are connected?!

To identify if text characteristics are associated with the text sentiment, we correlated them across books (Figure 3A). We found that the average dependency distance (Figure 3B) and the Flesch reading ease metric (Figure 3C) are most correlated with the sentiment score meaning they describe the data best. Mean dependency distance shows how close or distant related words in a sentence are, whereas Flesch reading ease is a commonly used metric for readability based on the number of syllables and sentence in relation to the total number of words where a high number corresponds to a more readable text.

Despite having weak correlations these results indicate that “hard-to-read” books tend to have a more positive sentiment, whereas negative sentiment is delivered in an easier to read fashion.

Figure 3: Associating text characteristics and sentiment. A) Sorted spearman correlation coefficients from associating the positive sentiment with given text characteristics. B) Correlation of sentiment score with mean dependency distance and C) Flesch reading ease distance.

4. On the edges of the book landscape distinct genres can be found

Next, we moved beyond descriptive text characteristics and focused on the content of each book. To get a more fine grained resolution on each book, we chunked them into pages of 100 lines each, resulting in a total of approximately 10’000 pages across all books. For each page we computed the text embeddings using Bidirectional Encoder Representations from Transformers (BERT) (https://github.com/explosion/s...) resulting in a numerical representation of the text content of a given page. Next we projected the numerical 768-dimensional text representation onto two dimensions using T-SNE. While the 2D distribution was overall homogeneous, we observed larger, distinct accumulations of pages in the fringe regions. Hypothesizing that especially very similar books would collocate there we highlighted books with similar genres (Figure 4C). In line with our expectation we found distinct book genres at the extremes. In the North East books from Adventure Fiction such as “Moby Dick” or “Treasure Island” were located. Below it in the South-East we found the literature category of Epic Poetry, home to “The Ilias”, “The Odyssey” and “The History of the Peloponnesian War”. Next to it, on a particularly distinct point-cloud inhabits the Russian Classics such as “Crime and Punishment” and “War and Peace”. Above it, in the North-West Philosophical Fiction can be found with books such as “Thus Spoke Zarathustra”, “The Prophet” and “The Christmas Story”.

Given that the Russian classics are among the most esteemed books in world literature and that our analysis grouped them on a distinct patch on the book map we were tempted to catch some of its essence of greatness using a numerical approach. We philosophized that the most central page in the group of Russian Classics would contain some of the group’s characteristic spirit. To this end we determined the centroid of the Russian classics. It was one of the last pages of War and Peace from Leo Tolstoy. It contained the scene where Pierre and his wife are in the drawing room of the Countess and present her with a gift, to which she responds “It’s not the gift that’s precious, my dear, but that you give it to me”.

We conclude that BERT together with T-SNE dimensionality reduction does not only group books by their content and presumably the overarching genre but can also help in identifying characteristic content of each group resulting in a potentially unsupervised genre classification of books and key passages of literature at large.

Figure 4: Connecting books beyond text characteristics. Books in the extremes are highlighted and grouped. The density kernel of the three largest and most exposed books are shown per group.

So whats the tl;dr?

Here we analyzed over 90 of the most downloaded books, spanning different time periods and cultures through different lenses. Generally, we found that books do not only vary a lot in length but also in the size of vocabulary used. Ranking the books by their overall sentiment showed that among the most positive books were “The Problems of Philosophy” and “The Happy Prince”, whereas the most negative book was the “Simple Sabotage Field Manual”.

We determined three sentiment groups among the analyzed books. While each of the three groups had a different average sentiment value, only one group showed a change in sentiment throughout the story progression. As a next step, one could delineate the different groups into their underlying emotions such as fear, anger, anticipation, trust, surprise, sadness, disgust, joy and further establish subgroups. In addition it would be exciting to see whether the categorization of these groups hold true when increasing the scope of the books but also using orthogonal means to determine the sentiment. If so this would hint at 3, surprisingly simple, sentiment archetypes in books.

Another intriguing yet less clear finding was that books that are easier to read tend to be more negative in sentiment — the inverse of what Leo Tolstoy writes in his novel Anna Karenina: “Happy families are all alike every unhappy family is unhappy in its own way”.

Ultimately the BERT based literature map showed that distinct corners are inhabited by the different genres of Adventure Fiction, Epic Poetry, Russian Classics and Philosophical Fiction. We also used this book map to determine core pieces of literature by selecting the centroid page of each group. Next steps could focus on: 1) expanding the book collection of the literature landscape e.g. with regards to the authors origin or 2) mapping different levels of metadata on the landscape to chart out differences within genres.

Thanks to DONE for supporting this exploration — a special thanks to Heiko Kromer.

If you enjoyed this analysis, have some ideas for follow ups or another random idea please feel free to reach out on LinkedIn https://www.linkedin.com/in/philippwarmer/

Thanks for your registration!