Natural Language Processing in Python: Part 3 -- Generating Word Clouds

Welcome to Natural Language Processing in Python (Part 3)

If you have not seen Part 2 of this tutorial, please refer to the following link:

The companion video to this post on NLP can be viewed here:

The primary goal of this post will be to:

Have some fun generating word clouds.

Table of Contents of this tutorial:

The wordcloud Module

Let us start off by importing the NLTK module.

import nltk

In this tutorial, we shall take a break from the core natural language processing content, and do something primarily just for kicks.

We shall make use of what we have learned thus far in NLTK to generate a word cloud (also known as tag cloud). This is a fun and interesting way in which to visually represent how prominent certain words are in a text resource.

In order to follow along with this tutorial, you will require the Python modules “matplotlib” and “wordcloud” Both of these can be installed on your machine by navigating to a terminal and typing:

pip install wordcloud 
pip install matplotlib

Generating Simple Word Clouds

Let us start off by importing the modules we just installed:

import matplotlib.pyplot as plt
from wordcloud import WordCloud

Next, let us generate a really basic and simple word cloud, based on just a single Python string.

text = "all your base are belong to us"

Generating a word cloud with no optional parameters based on the above string:

wordcloud = WordCloud().generate(text)

Finally, use matplotlib to render the word cloud:

plt.imshow(wordcloud)
plt.axis("off")
plt.show()

This will generate the following word cloud:

Wordcloud with phrase All your base are belong to us.

Since we will be using these three lines quite frequently, let us wrap them in a function for easier access.

def plot_wordcloud(wordcloud):
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

You will notice that the words “base”, “us” and “belong” are present in the word cloud, but the remaining words of “your”, “are”, “to”, and “all” are absent.

This is because the wordcloud module ignores stopwords by default. Refer to Part 1 of the NLTK tutorial if the concept of stopwords is new to you.

If we wish, we can specify our own set of stopwords, instead of the stopwords provided by default.

wordcloud = WordCloud(stopwords={'to', 'of'}).generate(text)
plot_wordcloud(wordcloud)

This generates the word cloud of the same phrase, where the stopwords excluded are the ones we selected.

Another optional parameter for WordCloud is that of relative_scaling, which corresponds to how the size of the text in the word cloud scales based on the content.

With relative_scaling=0, only the ranks of the words are considered. If we alter this to relative_scaling=1.0, then a word that appears twice as frequently will appear twice the size. By default, relative_scaling=0.5.

wordcloud = WordCloud(relative_scaling=1.0,
                      stopwords={'to', 'of'}).generate(text)
plot_wordcloud(wordcloud)

Add in a few more occurrences of the word “base” to illustrate the effect of relative scaling.

text_base = "all your base are belong to us base base base base"
wordcloud = WordCloud(relative_scaling=1.0,
                      stopwords={'to', 'of'}).generate(text_base)
plot_wordcloud(wordcloud)

Wordcloud with more uses of the word base.

Note that the word “base” in the world cloud is relatively much larger than the other words.

Recall from Part 2 of this series where we accessed the Inaugural Address corpus provided by NLTK.

Let us read in the raw content of the 1789 inaugural address of Washington and the 2009 address of Obama.

Generating Word Clouds from NLTK’s Corpus

washington = nltk.corpus.inaugural.raw('1789-Washington.txt')
obama = nltk.corpus.inaugural.raw('2009-Obama.txt')

Using relative_scaling=1.0 as a parameter for the word cloud, let us generate respective word clouds for both Washinton and for Obama:

Generating the word cloud for Washington:

wordcloud = WordCloud(relative_scaling=1.0).generate(washington)
plot_wordcloud(wordcloud)

This yields the following word cloud

Word cloud from George Washington inaugural address.

By default, if the WordCloud function is not provided a dictionary of stopwords, the WordCloud function will use the ones provided by default. This is okay, but perhaps we notice in the word cloud generated above that words such as “every” and “will” are present, but are not particularly useful in extracting information into what makes Washington’s address more unique over others.

What we can do then is to add in the words “every” and “will” into the set of stopwords that the WordCloud function considers.

from wordcloud import STOPWORDS

stopwords = set(STOPWORDS)
stopwords.add("every")
stopwords.add("will")
wordcloud = WordCloud(stopwords=stopwords, relative_scaling=1.0).generate(washington)
plot_wordcloud(wordcloud)

Word cloud from George Washington inaugural address with the words every and will.

For visual comparative purposes, the word cloud for Obama is generated in a similar manner:

```python ` wordcloud = WordCloud(relative_scaling=1.0).generate(obama) plot_wordcloud(wordcloud)

The Obama word cloud is shown here: 

<p align="center">
    <center>
        <figure>
            <img src="https://i.imgur.com/kUvwmYt.png" alt="Wordcloud from Barack Obama inaugural address."/>
        </figure>
    </center>
</p>

### Word Clouds in Different Shapes

For this next example, you will need two other Python modules installed, 
if you do not already have them:

```python 
pip install PIL
pip install numpy

The PIL module is used primarily for image processing, while numpy is used for various mathematical applications. The following example is inspired by the
examples page on the wordcloud website.

First off, we will import all of the necessary modules required for this example.

import numpy as np
from PIL import Image
from os import path

import nltk
import matplotlib.pyplot as plt
from wordcloud import WordCloud

Notice that we already imported some of these modules in an earlier part of this tutorial. We are placing them here redundantly so as to make the following example as self-contained as possible.

Now, read in the George Washington image as a numpy array. The source of the image can be found here. The entire example is provided below:

img = Image.open(path.join(path.dirname(__file__), 
                           "supplementary_files", 
                           "washington.jpg"))
mask = np.array(img)

text = nltk.corpus.inaugural.raw('1789-Washington.txt')

wc = WordCloud(max_words=1000, mask=mask).generate(text)

wc.to_file("washington_word_cloud.png")

plt.title("Default colors")
plt.imshow(wc)
plt.axis("off")
plt.show()

In essence, we first read in the George Washington image into the img variable by telling Python where the image is located on our machine. Next, we create a mask variable, that is, a numerical representation of the image. We pass this numeric representation to the WordCloud function and plot as we have done so in the earlier portion of this tutorial. The resulting word cloud should look like this:

Word cloud in the form of George Washington.

Conclusion

That wraps up this tutorial on natural language processing in Python. In the next tutorial, we will go over the WordNet resource.

Part 4 of Natural Language Processing in Python

Table of Contents of this tutorial:

Natural Language Processing in Python: Part 3 -- Generating Word Clouds

Welcome to Natural Language Processing in Python (Part 3)

The wordcloud Module

Generating Simple Word Clouds

Generating Word Clouds from NLTK’s Corpus

Conclusion

Related Posts:

The Scirate Python Package 12 Apr 2018

Natural Language Processing in Python: Part 5 -- Stemming and Lemmatization 11 Jan 2018

Natural Language Processing in Python: Part 4 -- WordNet 10 Jan 2018