The increasing amount of information being shared over the web makes it a huge source of data/information. To extract this data for analysis you need web scrapping. It is a popular technique to get data from web page in whatever format suitable for your analysis. Also, make sure that you make use of the scrapped data for research purposes only, for commercial use you have to contact the respective data sources for copyright.
In this blog post, we’ll be scrapping Nepali news from E-Kantipur using Python and BeautifulSoup, which is a popular Python library that can be used to navigate through and extract from webpages.
I assume that you have python and necessary packages installed, if not the first step is to get them installed. You can follow the following links to do so:
Python Download Beautiful Soup Documentation
NOTE: We’ll be using other two packages re and urllib but you do not need to install them beacause they are standard Python libraries.
Okay, we are ready to start coding.
Specify File Encoding
The first step in the code is to define the utf-8 encoding of the Python file as we’ll be scrapping Nepali text.
#include the following as the first line in your code # -*- coding: utf-8 -*-
Import Libraries
The next step in the code is to import all the libraries we’ll be using.
#to navigate, parse and scrap web contents import bs4 as bs For details on Beautiful Soup and its modules go to Beautiful Soup Documentation
#to fetch URLs import urllib For details on urllib and its modules go to URL handling modules
#to use regular expressions import re For details on re and its modules go to Regular expression operations
Define URL Variable and Open URL
The next step is defining URL Variable, nothing new here, just a normal python variable declaration. Then we’ll use urlopen() function of urllib package to open URL stored in the URL variable we just declared.
NEWS_LINK = 'http://kantipur.ekantipur.com' SOURCE = 'kantipur' r = urllib.urlopen(self.NEWS_LINK).read()
Create Beautiful Soup Object
We’ll now create a Beautiful Soup object using the BeautifulSoup() function. The object will contain all the HTML of the webpage pointed by the URL variable.
soup = bs.BeautifulSoup(r)
By calling prettify() function we can view the properly formatted HTML structure of the webpage.
print soup.prettify()
NOTE: Only a part of HTML structure is shown above.
Searching HTML Tree using Beautiful Soup Functions
As can be seen not everthing that the soup object contains is important for us. So, we’ll be using the soup object and functions to extract data important to us from specific locations of the HTML DOM. For this purpose we need to search the HTML DOM hierarchy. Beautiful Soup provides a number of functions for searching the DOM hierarchy but here we’ll using two of those: find() and find_all().
#Returns the first content within <div;> tag and class wrap title_source = soup.find('div', {'class': ['wrap']}) #Returns all the contents within <div;> tag and class wrap title_source = soup.find_all('div', {'class': ['wrap']})
NOTE: If you are not familiar with HTML and HTML tags then you can go through HTML Tutorial – W3 Schools.
#link for homepage <a; href="/"> गृह पृष्ठ </a> #link for a category <a; href="/category/opinion"> विचार/विश्लेषण </a>
The above snippet of code, taken from E-Kantipur’s webpage, is the HTML which contains links for news categories and other pages like homepage. We’ll be extracting categories using some Beautiful Soup and re functions. Beautiful Soup function is used to search the HTML tag and regex is used to match pattern and filter in links for categories only. We’ll the use key, value pair to remove redundant categories – extracted originally.
#searches all <a;> tags for paragraph in content.find_all('a'): #obtain the value of href attribute, for above snippet link = /category/opinion link = paragraph.get('href') if link is not None: #regex to match and filter links for new categories only match = re.match(r'/category/', link, re.M | re.I) if match: #append category name to the array, for above snippet categoryOriginal = विचार/विश्लेषण categoryOriginal[paragraph.get('href').replace("/category/", "")] = paragraph.text #remove redundant categories, for above snippet #for above snippet, key = opinion and value = विचार/विश्लेषण for key, value in categoryOriginal.items(): if key not in categories.keys(): categories[key] = value self.extractHeadline(categories)
You can go through Python Regular Expressions to learn more on regex in Python.
To get the headline and news body content we first need to open each category. For this we need to join the link to the category to the original URL. We can achieve this by using join() function.
for index, category in enumerate(categoryList): url = '' #for above snippet url = http://kantipur.ekantipur.com/category/opinion url = url.join((self.NEWS_LINK, '/category/', category))
We’ll now open the newly joined URL for each category and instantiate soup object for headline.
r = urllib.urlopen(url).read() soup = bs.BeautifulSoup(r)
<div; class="item-wrap"> <h2;> <a; href="/news/2017-07-09/20170709082415.html"> राष्ट्र/राज्य : इतिहास र यथार्थ </a> </h2> .... <div;>
The above snippet of code, is for the headlines containing links for actual news content page, we’ll locate all the headlines by using find_all() function and specifying div tag and item-wrap class. We’ll then append the extracted headlines to the headlineList array by specifying href attribute of <a;> tag.
for data in soup.find_all('div', {'class': ['item-wrap']}): if data.find('a') is not None: # verify that it is content page if 'html' in data.find('a').get('href'): #for above snippet headlineList = /news/2017-07-09/20170709082415.html headlineList.append(data.find('a').get('href'))
To get to the page containing the news headline, body and its other metadatas like date and author, we’ll again call the join() function of urllib package.
for index, headline in enumerate(headlineList): url = '' #for above snippet, url = http://kantipur.ekantipur.com/news/2017-07-09/20170709082415.html url = url.join((self.NEWS_LINK, headline))
We’ll open the newly joined URL for each headline and instantiate soup object for headline and news body.
r = urllib.urlopen(url).read() soup = bs.BeautifulSoup(r)
<div; class="wrap"> <h1;> राष्ट्र/राज्य : इतिहास र यथार्थ </h1> </div>
We’ll now extract headline of each news which is placed inside <h1;> tag. As see in the above snippet the <h1;> tag is within <div;> tag and wrap class
title_source = soup.find('div', {'class': 'wrap'}) #for above snippet title = राष्ट्र/राज्य : इतिहास र यथार्थ title = title_source.find('h1').text
<p;> युग पाठकले आफ्नो नयाँ कृति ‘माङ्गेना’ नाम दिइएको नेपाल मन्थन विषयक पुस्तकमा इतिहासको गहिरो उत्खनन गरेका छन् । उनको निचोड छ, नेपाल काठमाडौं उपत्यकामा विकसित एउटा सभ्यता हो । चार–पाँच दशक अघिसम्म यसको नाम र नक्सा मेल खाँदैनथ्यो । त्यस्तै, एउटा जातिको भाषा ‘खस कुरा’ लाई नेपाली भाषा बनाइयो । तसर्थ अहिलेसम्मका इतिहासले नेपाल र नेपालीबारे शासकीय हित र स्वार्थअनुकूल एउटा मिथक मात्र निर्माण गरेका छन् र यो यथार्थसँग मेल खाने वृतान्त होइन । यस्तै, राष्ट्रियताका नाममा विजेता गोर्खाली जातिको भाषा संस्कृतिलाई मात्र राष्ट्रियताको जामा लगाएर अरूको ‘पराईकरण’ भएको छ । विजेता शासकको जाति, भाषा, संस्कृति र मान्यतामा नेपाललाई ढाल्ने काम भयो । अरूलाई निषेध गरियो, पाखा लगाइयो । यसमा खासै विमति राख्ने ठाउँ छैन । वास्तवमा उनको यो पुस्तक युगीन महत्त्वको छ । अत्यन्तै सरल भाषा र शैलीमा लेखिएको छ । खासै शास्त्रीय बहस र बखान छैन । कुनै शास्त्रीय अवधारणाको जामापगरी पनि छैन । तर अध्ययन र अनुसन्धान भने व्यापक छ । प्रस्तुति बेजोडको छ । पृथक् चिन्तन र विनिर्माणको विधामा युगले अरूलाई उछिनेका छन् । </p>
We’ll now extract news body of each news which is placed inside <p;> tag as shown in the snippet above.
for body in body_content.find_all('p'): scrapper.BODY += body.text self.BODY = scrapper.BODY
With that, we have learned to scrap Nepali news using Beautiful Soup, re and urllib packages. You can use this method to scrap Nepali texts from other webpages as well, just don’t forget to specify the encoding type.
Tags: Data Mining