How to do text scraping using python

Read Time: 3 min

To use machine learning or deep learning for any natural language processing, the first and foremost thing required is DATA. Though a lot of open source data is available but mostly is in unstructured form. In this blog post, you will learn how to perform text scraping using python so that that ‘unstructured data’ can be utilized.

In my previous post, I had shown how to scrape images from Google. You can check that through link mentioned.

“The goal should be to turn the data into information and information into insights”

– Carly Fiorina, ex CEO of HP

Table of content:

  • What is web scraping
  • Why to use web scraping
  • How to scrape text data using python
  • Steps followed
  • Summary

What is web scraping?

Whenever you want to perform any task based on the machine learning or deep learning algorithm, first thing required is: “DATA”

The more data you have, more chances are that you will have higher accuracy.

So how to collect it??

From the pool of information i.e. “INTERNET”

“To pull out the data from internet using some means, is known as Web scraping”

Why to web scrape the data?

No matter in whichever domain you want to apply Machine learning, almost everything uses the text data.

If you look at statistics here, you will get to know that either it’s research based data or price comparison or email marketing, everything requires web scraping to process the text data.

Source

How to web scrape the text data?

First, decide what you need to scrape and why? It is essential that you should have a clear goal in mind. Reason is, it will help you to understand which part of HTML needs to be extracted.

Here, I am going to scrape the data from “THE HINDU” newspaper. The goal will be to collect information regarding COVID-19. (We will see in later posts how to use this information.)

The Hindu

For this post, I am going to use “BeautifulSoup” python package for web scraping. You can install this package from the following command:

!pip install beautifulsoup4

What is “BeautifulSoup” ?

It’s a python package which is used to parse HTML or XML files.

Steps followed to scrape the data

Whenever we want access to content on any page, we need to raise the request to the server. This is the first step to scrape the data. For one web page, one request is to be raised.

To raise the request, first call the get() function from the request module.

import requests
url='https://www.thehindu.com/news/national/india-coronavirus-lockdown-april-1-2020-live-updates/article31223884.ece'
html = requests.get(item).content     ##get the content of the URL

Output:

From the above code as you can see, whole HTML content is parsed. For this case, suppose I just need paragraph content. To extract paragraphs, you need to pass the <p> tag to the find_all method.

unicode_str = html.decode("utf8")    ##Decode the HTML into UTF8
encoded_str = unicode_str.encode("ascii",'ignore') 
news_soup = BeautifulSoup(encoded_str, "html.parser")
a_text = news_soup.find_all('p')

Output:

Here “.encode("ascii",'ignore')” returns an 8-bit string version of the unicode string, encoded in the requested encoding.

Summary

Through this post, you understood how to parse the data from the web. Using BeautifulSoup, you can extract the data in a well structured form.

In the next post, you will get to know general cleaning steps for text data so that this information can be used.