PROFESSIONALLY OPTIMIZED WEBSITES STARTING AT $995
Our team of SEO Web Design gurus are standing by to assist you achieve your online marketing goals.

+1-971-599-3330

REQUEST QUOTE
SEO Web Design, LLC aims to improve business by delivering effective solutions based on innovative technologies and professional designs. Discover the variety of services we offer and convince yourself on the basis of the latest works that we've done. We love building fresh, unique and usable websites optimized specifically for your niche.

Responsive Web Design

SEO / SEM / Social Media

Conversion Rate Optimization

Email Marketing

Online Presence Analysis

Web Hosting
Top
SEO Web Design, LLC / SEO  / How to Generate Quality FAQs & FAQPage Schemas Automatically with Python via @hamletbatista

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python via @hamletbatista

During my SEJ eSummit presentation[1], I introduced the idea of generating quality content at scale[2] by answering questions automatically.

I introduced a process where we researched questions using popular tools.

But, what if we don’t even need to research the questions?

In this column, we are taking that process a significant step further.

We are going to learn how to generate high-quality question/answer pairs (and their corresponding schema) automatically.

Here is the technical plan:

  • We will fetch content from an example URL.
  • We will feed that content into a T5-based[3] question/answer generator.
  • We will generate a FAQPage schema object[4] with the questions and answers.
  • We will validate the generated schema and produce a preview to confirm it works as expected.
  • We will go over the concepts that make this possible.

Generating FAQs from Existing Text

Let’s start with an example.

Make a copy of this Colab notebook[5] I created to show this technique.

  • Change the runtime to GPU and click connect.
  • Feel free to change the URL in the form. For illustration purposes, I’m going to focus on this recent article about Google Ads hiding keyword data[6].
  • The second input in the form is a CSS selector[7] we are using to aggregate text paragraphs. You might need to change it depending on the page you use to test.
  • After you hit Runtime > Run all, you should get a block of HTML that you can copy and paste into the Rich Results Test tool[8].

Advertisement

Continue Reading Below

Here is what the rich results preview looks for that example page:

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python

Completely automated.

How cool is that, right?

Now, let’s step through the Python code to understand how the magic is happening.

Fetching Article Content

We can use the reliable Requests-HTML[9] library to pull content from any page, even if the content is rendered using JavaScript.

We just need a URL and a CSS selector to extract only the content that we need.

Advertisement

Continue Reading Below

We need to install the library with:

!pip install requests-html

Now, we can proceed to use it as follows.

from requests_html import HTMLSession
session = HTMLSession()

with session.get(url) as r:

    paragraph = r.html.find(selector, first=False)

    text = " ".join([ p.text for p in paragraph])

I made some minor changes from what I’ve used in the past.

I request a list of DOM nodes when I specify first=False, and then I join the list by combining each paragraph by spaces.

I’m using a simple selector, p, which will return all paragraphs with text.

This works well for Search Engine Journal, but you might need to use a different selector and text extraction strategy for other sites.

After I print the extracted text, I get what I expected.

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python

The text is clean of HTML tags and scripts.

Now, let’s get to the most exciting part.

We are going to build a deep learning model that can take this text and turn it into FAQs! 🤓

Google T5 for Question & Answer Generation

I introduced Google’s T5 (Text-to-Text Transfer Transformer) in my article about quality title and meta description generation[10].

T5 is a natural language processing model that can perform any type of task, as long as it takes input as text and output as text, provided you have the right dataset.

I also covered it during my SEJ eSummit talk[11], when I mentioned that the designers of the algorithm actually lost on a trivia contest!

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python

Now, we are going to leverage the amazing work of researcher Suraj Patil.[12]

Advertisement

Continue Reading Below

He put together a high-quality GitHub repository[13] with T5 fine-tuned for question generation using the SQuAD dataset[14].

The repo includes instructions on how to train the model, but as he already did that, we will leverage his pre-trained models.

This will saves us significant time and expense.

Let’s review the code and steps to set up the FAQ generation model.

First, we need to download the T5 weights.

!python -m nltk.downloader punkt

This will cause the python library nltk[15] to download some files.

[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.

Clone the repository.

!git clone https://github.com/patil-suraj/question_generation.git
%cd question_generation

At the time of writing this, I faced a bug[16] and in the notebook I applied a temporary patch[17]. See if the issue has been closed and you can skip it.

Next, let’s install the transformers library.

!pip install transformers

We are going to import a module that mimics the transformers pipelines[18] to keep things super simple.

from pipelines import pipeline

Now we get to the exciting part that takes just two lines of code!

nlp = pipeline("multitask-qa-qg")
faqs = nlp(text)

Here are the generated questions and answers for the article we scraped.

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python
Look at the incredible quality of the questions and the answers.

How to Generate Quality FAQs & FAQPage Schemas Automatically with Python

They are diverse and comprehensive.

Advertisement

Continue Reading Below

And we didn’t even have to read the article to do this!

I was able to do this quickly by leveraging open source code that is freely available.

Let’s generate a FAQPage JSON-LD schema using these generated questions.

Generating FAQPage Schema

In order to generate the JSON-LD schema easily, we are going to borrow an idea from one of my early articles.

We used a Jinja2 template[19] to generate XML sitemaps[20] and we can use the same trick to generate JSON-LD and HTML.

We first need to install jinja2.

!pip install jinja2

This is the jinja2 template that we will use to do the generation.

faqpage_template="""<script type="application/ld+json">

{

"@context": "https://schema.org",

"@type": "FAQPage",

"mainEntity": [

{% for faq in faqs %}

{

"@type": "Question",

"name": {{faq.question|tojson}},

"acceptedAnswer": {

"@type": "Answer",

"text": {{faq.answer|tojson}}

}

}{{ "," if not loop.last }}

{% endfor %}

]

}

</script>"""

I want to highlight a couple of tricks I needed to use to make it work.

The first challenge with our questions is that they include quotes (“), for example:

Who announced that search queries without a "significant" amount of data will no longer show in query reports?

This is a problem because the quote is a separator in JSON.

Instead of quoting the values manually, I used a jinja2 filter, tojson to do the quoting for me and also escape any quotes.

Advertisement

Continue Reading Below

It converts the example above to:

"Who announced that search queries without a "significant" amount of data will no longer show in query reports?"

The other challenge was that adding the comma after each question/answer pair works well for all but the last one, where we are left with a dangling comma.

I found another StackOverflow thread[21] with an elegant solution for this.

{{ "," if not loop.last }}

It only adds the comma if it is not the last loop iteration.

Once you have the template and the list of unique FAQs, the rest is straightforward.

from jinja2 import Template
template=Template(faqpage_template)

faqpage_output=template.render(faqs=new_faqs)

That is all we need to generate our JSON-LD output.

You can find it here[22].

Finally, we can copy and paste it into the Rich Results Test tool, validate that it works and preview how it would look in the SERPs[23].

How to Generate Quality FAQs &#038; FAQPage Schemas Automatically with Python

How to Generate Quality FAQs &#038; FAQPage Schemas Automatically with Python

Awesome.

Advertisement

Continue Reading Below

Deploying the Changes to Cloudflare with RankSense

Finally, if your site uses the Cloudflare CDN, you could use the RankSense app content rules to add the FAQs to the site without involving developers. (Disclosure: I am the CEO and founder of RankSense.)[24]

Before we can add FAQPage schema to the pages, we need to create the corresponding FAQs to avoid any penalties.

According to Google’s general structured data guidelines[25]:

Don’t mark up content that is not visible to readers of the page. For example, if the JSON-LD markup describes a performer, the HTML body should describe that same performer.”

We can simply adapt our jinja2 template so it outputs HTML.

faqpage_template=""" <div id="FAQTab">

{% for faq in faqs %}

<div id="Question"> {{faq.question}} </div>

<div id="Answer"> <strong>{{faq.answer}} </strong></div>

{% endfor %}

</div> """

Here is what the HTML output looks like.

How to Generate Quality FAQs &#038; FAQPage Schemas Automatically with Python

How to Generate Quality FAQs &#038; FAQPage Schemas Automatically with Python

Now, that we can generate FAQs and FAQPage schemas for any URLs, we can simply populate a Google Sheet with the changes.

Advertisement

Continue Reading Below

My team shared a tutorial with code you can use to automatically populate sheets here[26].

Your homework is to adapt it to populate the FAQs HTML and JSON-LD we generated.

In order to update the pages, we need to provide Cloudflare-supported CSS selectors[27] to specify where in the DOM we want to make the insertions.

We can insert the JSON-LD in the HTML head, and the FAQ content at the bottom of the Search Engine Journal article.

As you are making HTML changes and can potentially break the page content, it would be a good idea to preview the changes[28] using the RankSense Chrome extension.

In case you are wondering, RankSense makes these changes directly in the HTML without using client-side JavaScript.

They happen in the Cloudflare CDN and are visible to both users and search engines.

Now, let’s go over the concepts that make this work so well.

How Does This T5-Based Model Generate These Quality FAQs?

The researcher is using an answer-aware, neural question generation[29] approach.

Advertisement

Continue Reading Below

This approach generally requires three models:

  • One to extract potential answers from the text.
  • Another to generate questions given answers and the text.
  • Finally, a model to take the questions and the context and produce the answers.

Here’s a useful explanation from Suraj Patil[30].

How to Generate Quality FAQs &#038; FAQPage Schemas Automatically with Python

How to Generate Quality FAQs &#038; FAQPage Schemas Automatically with Python

One simple approach to extract answers from text is to use Name Entity Recognition (NER) or facts from a custom knowledge graph like we built in my last column[31].

A question generation model is basically a question-answering model but with the input and target reversed.

A question-answering model takes questions+context and outputs answers, while a question-generation model takes answers+context and outputs questions.

Advertisement

Continue Reading Below

Both types of models can be trained using the same dataset – in our case, the readily available SQuAD dataset.

Make sure to read my post[32] for the Bing Webmaster Tools Blog.

I provided a fairly in-depth explanation of how transformer-based, question-answering models work.

It includes simple analogies and some minimal Python code.

One of the core strengths of the T5 model is that it can perform multiple tasks, as long as they take text and output text.

This enabled the researcher to train just one model to perform all tasks instead of three models.

Resources & Community Projects

The Python SEO community keeps growing with more rising stars appearing every month. 🐍🔥

Here are some exciting projects and new faces I learned about on Twitter.

Advertisement

Continue Reading Below

Keyword Density and Entity Calculator (Python + Knowledge Base API)[38]

Get the most out of PageSpeed Insights API with Python[39]

More Resources:

Advertisement

Continue Reading Below


Image Credits

All screenshots taken by author, September 2020

References

  1. ^ SEJ eSummit presentation (www.youtube.com)
  2. ^ generating quality content at scale (www.searchenginejournal.com)
  3. ^ T5-based (ai.googleblog.com)
  4. ^ FAQPage schema object (developers.google.com)
  5. ^ Colab notebook (github.com)
  6. ^ Google Ads hiding keyword data (www.searchenginejournal.com)
  7. ^ CSS selector (www.w3schools.com)
  8. ^ Rich Results Test tool (search.google.com)
  9. ^ Requests-HTML (requests.readthedocs.io)
  10. ^ quality title and meta description generation (www.searchenginejournal.com)
  11. ^ SEJ eSummit talk (www.youtube.com)
  12. ^ Suraj Patil. (github.com)
  13. ^ GitHub repository (github.com)
  14. ^ SQuAD dataset (rajpurkar.github.io)
  15. ^ nltk (www.nltk.org)
  16. ^ a bug (github.com)
  17. ^ a temporary patch (github.com)
  18. ^ transformers pipelines (huggingface.co)
  19. ^ Jinja2 template (jinja.palletsprojects.com)
  20. ^ generate XML sitemaps (www.searchenginejournal.com)
  21. ^ StackOverflow thread (stackoverflow.com)
  22. ^ here (gist.github.com)
  23. ^ SERPs (www.searchenginejournal.com)
  24. ^ RankSense app (www.cloudflare.com)
  25. ^ general structured data guidelines (developers.google.com)
  26. ^ here (github.com)
  27. ^ CSS selectors (developers.cloudflare.com)
  28. ^ preview the changes (help.ranksense.com)
  29. ^ neural question generation (paperswithcode.com)
  30. ^ Suraj Patil (github.com)
  31. ^ last column (www.searchenginejournal.com)
  32. ^ post (blogs.bing.com)
  33. ^ https://t.co/c1mSilB1Eh (t.co)
  34. ^ September 9, 2020 (twitter.com)
  35. ^ https://t.co/YD8pAexVg8 (t.co)
  36. ^ https://t.co/WsuaPJmxUn (t.co)
  37. ^ September 9, 2020 (twitter.com)
  38. ^ Keyword Density and Entity Calculator (Python + Knowledge Base API) (www.jcchouinard.com)
  39. ^ Get the most out of PageSpeed Insights API with Python (www.danielherediamejias.com)
  40. ^ September 9, 2020 (twitter.com)
  41. ^ @GoogleTrends (twitter.com)
  42. ^ @streamlit (twitter.com)
  43. ^ September 9, 2020 (twitter.com)

Powered by WPeMatico

No Comments

Sorry, the comment form is closed at this time.