An Introduction to Natural Language Processing with Python for SEOs

[]Natural language processing (NLP) is becoming more important than ever for SEO professionals.

[]It is crucial to start building the skills that will prepare you for all the amazing changes happening around us.

[]Hopefully, this column will motivate you to get started!

[]We are going to learn practical NLP while building a simple knowledge graph from scratch.

[]As Google, Bing, and other search engines use Knowledge Graphs to encode knowledge and enrich search results, what better way to learn about them than to build one?

[]Specifically, we are going to extract useful facts automatically from XML sitemaps.

[]In order to do this and keep things simple and fast, we will pull article headlines from the URLs in the XML sitemaps.

[]We will extract named entities and their relationships from the headlines.

[]Finally, we will build a powerful knowledge graph and visualize the most popular relationships.

[]In the example below the relationship is “launches.”


[]The way to read the graph is to follow the direction of the arrows: subject “launches” object.


[]Continue Reading Below

[]For example:

  • “Bing launches 19 tracking map”, which is likely “Bing launches covid-19 tracking map.”
  • Another is “Snapchat launches ads certification program.”

[]These facts and over a thousand more were extracted and grouped automatically!

[]Let’s get in on the fun.

[]Here is the technical plan:

  • We will fetch all XML sitemaps.
  • We will parse the URLs to extract the headlines from the slugs.
  • We will extract entity pairs from the headlines.
  • We will extract the corresponding relationships.
  • We will build a knowledge graph and create a simple form in Colab to visualize the relationships we are interested in.

Fetching All XML Sitemaps

[]I recently had an enlightening conversation with Elias Dabbas from The Media Supermarket and learned about his wonderful Python library for marketers: advertools.

[]Some of my old articles are not working with the newer library versions. He gave me a good idea.


[]Continue Reading Below

[]If I print the versions of third-party libraries now, it would be easy to get the code to work in the future.

[]I would just need to install the versions that worked when they fail. 🤓

%%capture !pip install advertoolsimport advertools as adv print(adv.__version__) #0.10.6[]We are going to download all sitemaps to a pandas data frame with two lines of code.

sitemap_url = “/?s=sitemap_index.xml” df= adv.sitemap_to_df(sitemap_url)[]

[]One cool feature in the package is that it downloaded all the linked sitemaps in the index and we get a nice data frame.

[]Look how simple it is to filter articles/pages from this year. We have 1,550 articles.

df[df[“lastmod”] > ‘2020-01-01’][]

Extract Headlines From the URLs

[]The advertools library has a function to break URLs within the data frame, but let’s do it manually to get familiar with the process.

from urllib.parse import urlparse import reexample_url=”/?s=google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/” u = urlparse(example_url)print(u) #output -> ParseResult(scheme=’https’, netloc=’‘, path=’/google-be-careful-relying-on-3rd-parties-to-render-website-content/376547/’, params=”, query=”, fragment=”)[]Here we get a named tuple, ParseResult, with a breakdown of the URL components.

[]We are interested in the path.


[]Continue Reading Below

[]We are going to use a simple regex to split it by / and – characters

slug = re.split(“[/-]”, u.path) print(slug)#output [”, ‘google’, ‘be’, ‘careful’, ‘relying’, ‘on’, ‘3rd’, ‘parties’, ‘to’, ‘render’, ‘website’, ‘content’, ‘376547’, ”] []Next, we can convert it back to a string.

headline = ” “.join(slug) print(headline)[]#output

‘ google be careful relying on 3rd parties to render website content 376547 ‘[]The slugs contain a page identifier that is useless for us. We will remove with a regex.

headline = re.sub(“d{6}”, “”,headline) print(headline)#output ‘ google be careful relying on 3rd parties to render website content ‘ #Strip whitespace at the borders headline = headline.strip()print(headline)#output ‘google be careful relying on 3rd parties to render website content’ []Now that we tested this, we can convert this code to a function and create a new column in our data frame.

def get_headline(url): u = urlparse(url) if len(u.path) > 1: slug = re.split(“[/-]”, u.path) new_headline = re.sub(“d{6}”, “”,” “.join(slug)).strip() #skip author and category pages if not re.match(“author|category”, new_headline): return new_headline return “”[]Let’s create a new column named headline.

new_df[“headline”] = new_df[“url”].apply(lambda x: get_headline(x))[]

Extracting Named Entities

[]Let’s explore and visualize the entities in our headlines corpus.


[]Continue Reading Below

[]First, we combine them into a single text document.

import spacy from spacy import displacy text = “n”.join([x for x in new_df[“headline”].tolist() if len(x) > 0]) nlp = spacy.load(“en_core_web_sm”) doc = nlp(text) displacy.render(doc, style=”ent”, jupyter=True)[]
We can see some entities correctly labeled and some incorrectly labeled like Hulu as a person.

[]There are also several missed like Facebook and Google Display Network.

[]spaCy‘s out of the box NER is not perfect and generally needs training with custom data to improve detection, but this is good enough to illustrate the concepts in this tutorial.


[]Continue Reading Below

Building a Knowledge Graph

[]Now, we get to the exciting part.

[]Let’s start by evaluating the grammatical relationships between the words in each sentence.

[]We do this by printing the syntactic dependency of the entities.

for tok in doc[:100]: print(tok.text, “…”, tok.dep_)[]

[]We are looking for subjects and objects connected by a relationship.

[]We will use spaCy’s rule-based parser to extract subjects and objects from the headlines.

[]The rule can be something like this:


[]Continue Reading Below

[]Extract the subject/object along with its modifiers, compound words and also extract the punctuation marks between them.

[]Let’s first import the libraries that we will need.

from spacy.matcher import Matcher from spacy.tokens import Span import networkx as nx import matplotlib.pyplot as plt from tqdm import tqdm[]To build a knowledge graph, the most important things are the nodes and the edges between them.

[]The main idea is to go through each sentence and build two lists. One with the entity pairs and another with the corresponding relationships.

[]We are going to borrow a couple of functions created by Data Scientist, Prateek Joshi.

  • The first one, get_entities extracts the main entities and associated attributes.
  • The second one, get_relations extracts the corresponding relationships between entities.

[]Let’s test them on 100 sentences and see what the output looks like. I added len(x) > 0 to skip empty lines.

for t in [x for x in new_df[“headline”].tolist() if len(x) > 0][:100]: print(get_entities(t)) []

[]Many extractions are missing elements or are not great, but as we have so many headlines, we should be able to extract useful facts anyways.


[]Continue Reading Below

[]Now, let’s build the graph.

entity_pairs = [] for i in tqdm([x for x in new_df[“headline”].tolist() if len(x) > 0]): entity_pairs.append(get_entities(i))[]Here are some example pairs.

entity_pairs[10:20]#output [[‘chrome’, ”], [‘google assistant’, ‘500 million 500 users’], [”, ”], [‘seo metrics’, ‘how them’], [‘google optimization’, ”], [‘twitter’, ‘new explore tab’], [‘b2b’, ‘greg finn podcast’], [‘instagram user growth’, ‘lower levels’], [”, ”], [”, ‘advertiser’]][]Next, let’s build the corresponding relationships. Our hypothesis is that the predicate is actually the main verb in a sentence.

relations = [get_relation(i) for i in tqdm([x for x in new_df[“headline”].tolist() if len(x) > 0])] print(relations[10:20]) #output[‘blocker’, ‘has’, ‘conversions’, ‘reports’, ‘ppc’, ‘rolls’, ‘paid’, ‘drops to lower’, ‘marketers’, ‘facebook’][]Next, let’s rank the relationships.


[]Finally, let’s build the knowledge graph.

# extract subject source = [i[0] for i in entity_pairs] # extract object target = [i[1] for i in entity_pairs] kg_df = pd.DataFrame({‘source’:source, ‘target’:target, ‘edge’:relations})# create a directed-graph from a dataframe G=nx.from_pandas_edgelist(kg_df, “source”, “target”, edge_attr=True, create_using=nx.MultiDiGraph())plt.figure(figsize=(12,12)) pos = nx.spring_layout(G) nx.draw(G, with_labels=True, node_color=’skyblue’,, pos = pos)[]This plots a monster graph, which, while impressive, is not particularly useful.


[]Let’s try again, but take only one relationship at a time.


[]Continue Reading Below

[]In order to do this, we will create a function that we pass the relationship text as input (bolded text).

def display_graph(relation): G=nx.from_pandas_edgelist(kg_df[kg_df[‘edge’]==relation], “source”, “target”, edge_attr=True, create_using=nx.MultiDiGraph()) plt.figure(figsize=(12,12)) pos = nx.spring_layout(G, k = 0.5) # k regulates the distance between nodes nx.draw(G, with_labels=True, node_color=’skyblue’, node_size=1500,, pos = pos)[]Now, when I run display_graph(“launches”), I get the graph at the beginning of the article.

[]Here are a few more relationships that I plotted.


[]I created a Colab notebook with all the steps in this article and at the end, you will find a nice form with many more relationships to check out.


[]Continue Reading Below

[]Just run all the code, click on the pulldown selector and click on the play button to see the graph.


Resources to Learn More & Community Projects

[]Here are some resources that I found useful while putting this tutorial together.

[]I asked my follower to share the Python projects and excited to see how many creative ideas coming to life from the community!

[]More Resources:

[]Image Credits


[]Continue Reading Below

[]All screenshots taken by author, August 2020