Data are key to any natural language processing and machine learning application. Machine learning algorithms learn from a predefined set of data. So, it is important that you feed the algorithms the right data. Also, it is equally important that the data is in right format and scale. Since we are considering NLP in Nepali with a focus in text analysis, the kind of data we are considering in this blog post is written texts in Nepali.
Data Acquisition
Text data can be survey responses, articles, social media posts, reviews, website content or news. With the variety of data available, the first step is to collect the kind of data you would need for your analysis.
Social medias like Facebook, Twitter and Youtube contain a large amount of opinionated texts. These medias provide an API and an access key to the API, which can be used to access the text data.
Blogs, online news and magazines are another great sources of text data. Web scrapping can be used to collect textual content from these sources. Some great web scrapping libraries in python include:
Text corpora contain a large collection of written texts. Nepali National Corpus(NNC) has a collection two different texts corpora: annotated written corpus and parallel corpus collected by Madan Puraskar Pustakalaya under the Bhasa Sanchar project.
Data Preparation
Real-world data are noisy, missing and inconsistent, so after collecting the data, the data should be cleaned before any further processing. Cleaning textual data involves fixing typos, stemming and removing extra symbols, html tags and stop words.
Another important task in data preparation is formatting. It is important that the collected data is in correct format. The data can be in stored in a relational or NoSQL database depending upon the size of data that will be used in the application. Also, it is tempting to use all the data that is available. Larger dataset does not always guarantee high performance. In fact larger the dataset higher the computational cost. So, it is better to use a subset of the available data in the first run. If the smaller subset of the data does not perform well in terms of precision and recall, there is always an option to use the whole dataset.
Some other common tasks that are done while preparing text data include tokenization, part-of-speech tagging, chunking, grouping and negation handling. Each of these tasks will be discussed in separate blog posts.