by Ingroj Shrestha on July 9, 2017


The increasing amount of information being shared over the web makes it a huge source of data/information. To extract this data for analysis you need web scrapping. It is a popular technique to get data from web page in whatever format suitable for your analysis. Also, make sure that you make use of the scrapped data for research purposes only, for commercial use you have to contact the respective data sources for copyright.


In this blog post, we'll be scrapping Nepali news from E-Kantipur using Python and BeautifulSoup, which is a popular Python library that can be used to navigate through and extract from webpages.


I assume that you have python and necessary packages installed, if not the first step is to get them installed. You can follow the following links to do so:

Python Download 

Beautiful Soup Documentation 

NOTE: We'll be using other two packages re and urllib but you do not need to install them beacause they are standard Python libraries.


Okay, we are ready to start coding.



Specify File Encoding


The first step in the code is to define the utf-8 encoding of the Python file as we'll be scrapping Nepali text.

#include the following as the first line in your code
# -*- coding: utf-8 -*-

Import Libraries


The next step in the code is to import all the libraries we'll be using.

#to navigate, parse and scrap web contents
import bs4 as bs

For details on Beautiful Soup and its modules go to Beautiful Soup Documentation

#to fetch URLs
import urllib 

For details on urllib and its modules go to URL handling modules

#to use regular expressions
import re  

For details on re and its modules go to Regular expression operations

Define URL Variable and Open URL


The next step is defining URL Variable, nothing new here, just a normal python variable declaration. Then we'll use urlopen() function of urllib package to open URL stored in the URL variable we just declared.

NEWS_LINK = 'http://kantipur.ekantipur.com'
SOURCE = 'kantipur' 

r = urllib.urlopen(self.NEWS_LINK).read()


Create Beautiful Soup Object


We'll now create a Beautiful Soup object using the BeautifulSoup() function. The object will contain all the HTML of the webpage pointed by the URL variable.

soup = bs.BeautifulSoup(r)

By calling prettify() function we can view the properly formatted HTML structure of the webpage.

print soup.prettify()




NOTE: Only a part of HTML structure is shown above.



Searching HTML Tree using Beautiful Soup Functions


As can be seen not everthing that the soup object contains is important for us. So, we'll be using the soup object and functions to extract data important to us from specific locations of the HTML DOM. For this purpose we need to search the HTML DOM hierarchy. Beautiful Soup provides a number of functions for searching the DOM hierarchy but here we'll using two of those: find() and find_all().

#Returns the first content within <div> tag and class wrap
title_source = soup.find('div', {'class': ['wrap']})

#Returns all the contents within <div> tag and class wrap
title_source = soup.find_all('div', {'class': ['wrap']})
      

NOTE: If you are not familiar with HTML and HTML tags then you can go through HTML Tutorial - W3 Schools.

#link for homepage
<a href="/"> गृह पृष्ठ </a>

#link for a category
<a href="/category/opinion"> विचार/विश्लेषण </a>

The above snippet of code, taken from E-Kantipur's webpage, is the HTML which contains links for news categories and other pages like homepage. We'll be extracting categories using some Beautiful Soup and re functions. Beautiful Soup function is used to search the HTML tag and regex is used to match pattern and filter in links for categories only. We'll the use key, value pair to remove redundant categories - extracted originally.


#searches all <a> tags
for paragraph in content.find_all('a'):
    #obtain the value of href attribute, for above snippet link = /category/opinion
    link = paragraph.get('href')

    if link is not None:
        #regex to match and filter links for new categories only
        match = re.match(r'/category/', link, re.M | re.I)
        if match:
            #append category name to the array, for above snippet categoryOriginal = विचार/विश्लेषण 
            categoryOriginal[paragraph.get('href').replace("/category/", "")] = paragraph.text


#remove redundant categories, for above snippet 
#for above snippet, key = opinion and value = विचार/विश्लेषण 
for key, value in categoryOriginal.items():
    if key not in categories.keys():
        categories[key] = value
self.extractHeadline(categories)


You can go through Python Regular Expressions to learn more on regex in Python.


To get the headline and news body content we first need to open each category. For this we need to join the link to the category to the original URL. We can achieve this by using join() function.

for index, category in enumerate(categoryList):
    url = ''
    #for above snippet url = http://kantipur.ekantipur.com/category/opinion
    url = url.join((self.NEWS_LINK, '/category/', category))

We'll now open the newly joined URL for each category and instantiate soup object for headline.

r = urllib.urlopen(url).read()
soup = bs.BeautifulSoup(r)

<div class="item-wrap">
    <h2>
        <a href="/news/2017-07-09/20170709082415.html"> राष्ट्र/राज्य : इतिहास र यथार्थ </a>
    </h2>
    ....
<div>

The above snippet of code, is for the headlines containing links for actual news content page, we'll locate all the headlines by using find_all() function and specifying div tag and item-wrap class. We'll then append the extracted headlines to the headlineList array by specifying href attribute of <a> tag.

for data in soup.find_all('div', {'class': ['item-wrap']}):
    if data.find('a') is not None:
        # verify that it is content page
        if 'html' in data.find('a').get('href'):
            #for above snippet headlineList = /news/2017-07-09/20170709082415.html
            headlineList.append(data.find('a').get('href'))

To get to the page containing the news headline, body and its other metadatas like date and author, we'll again call the join() function of urllib package.

for index, headline in enumerate(headlineList):
    url = ''
    #for above snippet, url = http://kantipur.ekantipur.com/news/2017-07-09/20170709082415.html
    url = url.join((self.NEWS_LINK, headline))

We'll open the newly joined URL for each headline and instantiate soup object for headline and news body.

r = urllib.urlopen(url).read()
soup = bs.BeautifulSoup(r)

<div class="wrap">
    <h1> राष्ट्र/राज्य : इतिहास र यथार्थ  </h1>

</div>

We'll now extract headline of each news which is placed inside <h1> tag. As see in the above snippet the <h1> tag is within <div> tag and wrap class

title_source = soup.find('div', {'class': 'wrap'})
#for above snippet title = राष्ट्र/राज्य : इतिहास र यथार्थ 
title = title_source.find('h1').text

<p>

युग पाठकले आफ्नो नयाँ कृति ‘माङ्गेना’ नाम दिइएको नेपाल मन्थन विषयक पुस्तकमा इतिहासको गहिरो उत्खनन गरेका छन् । उनको निचोड छ, नेपाल काठमाडौं उपत्यकामा विकसित एउटा सभ्यता हो । चार–पाँच दशक अघिसम्म यसको नाम र नक्सा मेल खाँदैनथ्यो । त्यस्तै, एउटा जातिको भाषा ‘खस कुरा’ लाई नेपाली भाषा बनाइयो । तसर्थ अहिलेसम्मका इतिहासले नेपाल र नेपालीबारे शासकीय हित र स्वार्थअनुकूल एउटा मिथक मात्र निर्माण गरेका छन् र यो यथार्थसँग मेल खाने वृतान्त होइन । यस्तै, राष्ट्रियताका नाममा विजेता गोर्खाली जातिको भाषा संस्कृतिलाई मात्र राष्ट्रियताको जामा लगाएर अरूको ‘पराईकरण’ भएको छ । विजेता शासकको जाति, भाषा, संस्कृति र मान्यतामा नेपाललाई ढाल्ने काम भयो । अरूलाई निषेध गरियो, पाखा लगाइयो । यसमा खासै विमति राख्ने ठाउँ छैन । वास्तवमा उनको यो पुस्तक युगीन महत्त्वको छ । अत्यन्तै सरल भाषा र शैलीमा लेखिएको छ । खासै शास्त्रीय बहस र बखान छैन । कुनै शास्त्रीय अवधारणाको जामापगरी पनि छैन । तर अध्ययन र अनुसन्धान भने व्यापक छ । प्रस्तुति बेजोडको छ । पृथक् चिन्तन र विनिर्माणको विधामा युगले अरूलाई उछिनेका छन् । 

</p>


We'll now extract news body of each news which is placed inside <p> tag as shown in the snippet above.

 for body in body_content.find_all('p'):
    scrapper.BODY += body.text
    self.BODY = scrapper.BODY

With that, we have learned to scrap Nepali news using Beautiful Soup, re and urllib packages. You can use this method to scrap Nepali texts from other webpages as well, just don't forget to specify the encoding type.


Tags: Data Mining