Looking for some tech projects to work on, I came across an article on Analytics India Magazine which gives a list of Machine Learning Projects. The first project on the list is Sentiment analysis. And so I took to the challenge to build a sentiment analysis software.
This post is about the first step I took to build this software and has all the resources that I used and notes that I took while taking this first step. Note that I’ve only gotten past the first step at this point and still haven’t worked on the whole project.
Text Mining (also referred to as text analysis) is an Artificial Intelligence (AI) technology that uses natural language processing (NLP) to transform free unstructured text in documents and data bases into normalized, structured data suitable for analysis or to drive machine learning algorithms. Widely used in knowledge driven organizations, text mining is the process of examining large collections of documents to discover new information or help answer specific research questions. Text mining identifies facts, relationships, and assertions that would otherwise remain buried in the mass of textual big data. Once extracted, this information is converted into a structured form.
Sentiment Analysis is helping enterprises to estimate and learn from their clients or customers correctly. It is used for social media monitoring, brand monitoring, the voice of customer (VOC), customer service, and market research. Sentiment Analysis uses NLP methods and algorithms that are either rule based, hybrid, or rely on ML algorithms to learn data from documents or data bases. Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. Analyzing large quantities of text data is a key way to understand what people are thinking. Powerful machine learning algorithms can easily recognize statements as Positive, Negative, or Neutral.
Unstructured Data requires processing to generate insights. Examples of unstructured data include news articles, posts in social media, and search history. About 80% of available data is unstructured data.
Natural Language Processing (NLP) is the process of analyzing natural language and making sense out of it. In order to produce actionable insights from textual (unstructured) data, it is important to get acquainted with the techniques and principles of NLP. Natural Language Processing is a part of computer science and artificial intelligence which deals with human languages. By using NLP and its components, one can organize the massive chucks of text data and perform numerous automated tasks and solve a wide range of problems such as automated summarization, machine translation, spell check, keyword searching, information extraction, advanced matching, sentiment analysis, speech recognition, chatbots, etc. NLP essentially has two subparts: Natural Language Understanding (NLU) and Natural Language Generation (NLG).
Natural Language Understanding (NLU) is the mapping of input to useful representations and analyzing the different aspects of a language. whereas Natural Language Generation (NLG) comprises of text planning, sentence planning, and text realization. NLU are three types of ambiguities:
- Lexical Ambiguity: The presence of two or more possible meanings of a single word. For instance, “The fisherman went to the bank.” Now bank here can meaning the a) the bank where we deposit or withdraw money b) the river bank.
- Syntactic Ambiguity: The presence of two or more possible meanings of a single sentence. For instance, “The chicken is ready to eat.” Now here the meaning of the sentence could imply that a) the chicken dish is ready to tbe eaten b) the chicken bird is ready to eat something.
- Referential Ambiguity: The ambiguity that arises due to the presence of a pronoun. For instance, “Te boy told the father about the theft. He was very upset.” Now here the pronoun “he” could imply either the boy, fatherm or thief.
Natural Language ToolKit (NLTK) is a commonly used NLP library in python to analyze textual data.
Tokenization is the process of breaking up strings (i.e sentences) into individual words called tokens. It is the process of breaking up a string into pices such as words, keywords, phrases, symbols, and other elements which are called tokens. For instance, the sentence- “There are seven words in this sentence” would be split into seven tokens= “There” “are” “seven” “words” “in” “this” “sentence”
Corpus (Latin for “body”) refers to a collection of texts. Such collections may be formed of a single language of texts, or can span over multiple languages. In linguists, a corpus (plural corpora) or text corpus is a large and structured set of texts. In corpus linguists, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules with a specific language territory.
Part-of-speech (POS) tagger is the process of labeling a word in a text as corresponding to a particular pos tag: nouns, verbs, adjectives, adverbs, etc.
Averaged Perceptron tagger is the average perceptron tagger uses the perceptron algorithm to predict which pos tag is most likely given the word.
So in the nltk, there is a a package called “twitter samples.” We will use this Twitter Corpus to count the number of nouns and adjectives present in the dataset.
You can find a detailed description from the original source of this project: https://www.digitalocean.com/community/tutorials/how-to-work-with-language-data-in-python-3-using-the-natural-language-toolkit-nltk
#check if nltk is working from nltk.corpus import brown brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
#import data and tagger from nltk.corpus import twitter_samples from nltk.tag import pos_tag_sents
#check the fields present in the sample set twitter_samples.fileids()
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']
#tokenize the tweets tweet_tokens= twitter_samples.tokenized('positive_tweets.json')
#tag the tokenized tweets tweets_tagged= pos_tag_sents(tweet_tokens)
#set counters or accumalators #JJ- Adjectives; NN- Nouns JJ_count=0 NN_count=0
#Loop through the tagged tweets for tweet in tweets_tagged: for pair in tweet: tag= pair if tag == 'JJ': JJ_count += 1 if tag == 'NN': NN_count +=1
print('Total number of Adjectives= ', JJ_count) print('Total number of Nouns= ', NN_count)
Total number of Adjectives= 6092 Total number of Nouns= 13181
Here is the complete code:
#check if nltk is working from nltk.corpus import brown brown.words() #import data and tagger from nltk.corpus import twitter_samples from nltk.tag import pos_tag_sents #tokenize the tweets tweet_tokens= twitter_samples.tokenized('positive_tweets.json') #tag the tokenized tweets tweets_tagged= pos_tag_sents(tweet_tokens) #set counters or accumalators #JJ- Adjectives; NN- Nouns JJ_count=0 NN_count=0 #Loop through the tagged tweets for tweet in tweets_tagged: for pair in tweet: tag= pair if tag == 'JJ': JJ_count += 1 if tag == 'NN': NN_count +=1 print('Total number of Adjectives= ', JJ_count) print('Total number of Nouns= ', NN_count)
- digitalocean.com- “How To Work with Language Data in Python 3 using the Natural Language Toolkit (NLTK)”
- linguamatics.com- “What is Text Mining, Text Analytics and Natural Language Processing?”
- w3resource.com- “NLTK Corpus Exercises with Solution”
- w3schools.com- “JSON tutorial”
- youtube.com-“NLTK Python Tutorial | Natural Language Processing (NLP) With Python Using NLTK | Simplilearn”