I am trying to produce a very simple twitter sentiment analysis. I have so far been able to pre-process my tweets however I am greatly struggling to lemmatize within my data frame. This is my code so far:
import nltk
import pandas as pd
from nltk.corpus import stopwords # Importing Natural Language Toolkit
from nltk.stem import WordNetLemmatizer
df = pd.read_csv(r'/Users/sarfrazkhan/Desktop/amazon.csv') # Loading Amazon data set into code
df = df['x'].str.replace('http\S+|www.\S+', '', case=False) # Removing URL's from data set
df = df.str.replace(r'\<.*\>', '') # Removing noise contained in '< >' these parenthesis
df = df.str.replace('RT ', '', case=False) # Removing the phrase 'RT" from all strings
df = df.str.replace('#[^\s]+', '', case=False) # Removing '#' and the following twitter handle from strings
df = df.str.replace('[^\w\s]', ' ') # Removing any punctuation
df = df.str.replace('\r\n', ' ') # Removing '\r\n' which is present in some strings
df = df.str.replace('\d+', '').str.lower().str.strip() # Removing numbers, capitalisation and white space
df = df.apply(nltk.word_tokenize) # Tokenizing data set
stop = nltk.download('stopwords') # Downloading stop words
stop = set(stopwords.words('english')) # Selecting English stop words
df = df.apply(lambda x: [item for item in x if item not in stop]) # Removing stop words from each string
lemmatizer = WordNetLemmatizer()
lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in df]
I am struggling to get my Lemmatizer to work and constantly met with errors possibly due to the fact my dataset is in a list form. (which I am struggling to work around) The Excel which I am trying to process is simply a long list of tweets with the heading 'x'. You can see on line 6 of my code that I focus primarily on this column, however I'm unsure if this is the correct way to do it!
My expected outcome would be a list of words which have been lemmatised correctly within their respective rows, to which I can then carry out a sentiment analysis.
These are the first few lines of my data frame before attempting the lemmatising process:
1 [swinging, pendulum, wall, clock, love, give, ...
2 [enter, via, gleam, l]
3 [screw, every, follow, gets, nude, dms, dm, pr...
4 [bishop, law, coming, soon, bishop, series, bo...
5 [adventures, bella, amp, emily, book, series, ...
6 [written, books, various, genres, amazon, kind...
7 [author, books, amwriting, fantasy, mystery, p...
8 [wonderful, mentor, recent, times, graham, kee...
9 [available, amazon, ebay, disabilities, hidden...
10 [screw, every, follow, gets, nude, dms, dm, pr...
Your code is trying to lemmatize an actual list hence the error.
... for w in df -> here, w is the list, rather than each element of each list.
To get around this you could use pandas apply to pass each row to a function (assuming df is a pd.DataFrame and not a pd.Series. If it's a Series and the below doesn't work, try df = df.to_frame() first) :
def df_lemmatize(row):
lemmatizer = WordNetLemmatizer()
row.at['lemma_words'] = [lemmatizer.lemmatize(w, pos='a') for w in row.x]
return row
df = df.apply(df_lemmatize, axis=1)
df_lemmatize will iterate over each element in the list, lemmatize it and then add the new list to a new column lemma_words.
Related
I have a text data-set (SMSCollection) which every line is a corpus (with one or more sentences). By using this code, I got its bi-gram as a pandas series (actually I get it from the internet but unfortunately I forget the link):
import pandas as pd
news = pd.read_csv('/content/drive/MyDrive/data.txt', encoding='utf-8', lineterminator='\n' , names=['text'],index_col = False)`
def basic_clean(text):
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8','ignore').lower())
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]
words = basic_clean(''.join(str(df['text'].tolist())))
bigram_seri = pd.Series(nltk.ngrams(words, 2)).value_counts()
result is something like this:
(prize, guaranteed) 22
(customer, service) 20
(1000, cash) 20
(urgent, mobile) 18
(every, week) 18
(show, 800) 18
(send, stop) 17
(valid, 12hrs) 17
(account, statement) 16
(land, line) 16
(free, entry) 15
(identifier, code) 15
(dating, service) 15
....
Finally I want to plot "Top bi-grams found in the database" using plotly module from this link based on this projrct which it uses a vactorized bi-gram data-set like this. But I do not know how to construct a bi-gram vectorized data-frame like this (then I need to use t-sne algorithm). Actually I don't know what this data-set is and how to make it.
The github project is available from here. Data-set that is used in this github project is different from mine.
I am trying to count the frequency of words in a dataframe column, titled df['MESSAGETEXT'] as shown below). The code (from Stackoverflow) that I am working with is below:
from collections import Counter
import pandas as pd
import nltk
import string
top_N = 50
stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df['MESSAGETEXT']
.str.lower()
.replace([r'\|', RE_stopwords], [' ', ''], regex=True)
.str.cat(sep=' ')
.split()
)
# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')
print(rslt)
# plot
rslt.plot.bar(rot=0, figsize=(16,10), width=0.8)
Word Frequency count result is below and still has punctuation such as semicolon and fullstop.
Frequency
Word
' 89217
# 22231
london 20404
. 18271
- 13356
like 13153
! 10752
get 10501
& 10073
love 9720
; 9422
good 9168
one 8630
? 7943
day 7781
time 6956
know 6818
see 6811
u 6786
new 6553
think 6545
got 6330
go 6329
#london 5888
back 5801
great 5736
would 5611
x 5566
thanks 5553
people 5534
going 5464
need 5381
happy 5338
today 5040
still 4984
much 4883
thank 4766
want 4680
last 4664
well 4479
really 4444
lol 4376
please 4275
... 4210
de 4207
come 4120
even 4117
man 4094
best 4076
night 4047
I need to eliminate the following => ?,x,-,... etc
Without knowning a bit more details about the contents of the dataframe df, I suggest the following brute-force method:
Throw away any character that is not a letter with .replace([r'[^a-zA-Z]'], [' '], regex=True) and in is_valid check that words are not in stopwords and have at least length 3.
from collections import Counter
import pandas as pd
import nltk
top_N = 50
stopwords = set(nltk.corpus.stopwords.words('english'))
def is_valid(w):
return len(w) > 2 and w not in stopwords
if __name__ == '__main__':
data = {'MESSAGETEXT': [open("sampletext.txt").read()]}
df = pd.DataFrame(data, columns=['MESSAGETEXT'])
words = (df['MESSAGETEXT']
.str.lower()
.replace([r'[^a-zA-Z]'], [' '], regex=True)
.str.cat(sep=' ')
.split()
)
filtered_words = [w for w in words if is_valid(w)]
most_common = Counter(filtered_words).most_common(top_N)
rslt = pd.DataFrame(most_common, columns=['Word', 'Frequency']).set_index('Word')
print(rslt)
Without the need to modify your code, it could be simply adding two lines at the end of the code to remove the unwanted rows from the resulted DataFram (assume to be rslt) as in the next lines, either:
remov_list='=>?x-' # unwanted symbols
rslt=rslt[rslt['Word'].apply(lambda x: x not in remove_list)]
or
remov_list='=>?x-' # unwanted symbols
rslt=rslt.drop(rslt[rslt['Word'].apply(lambda x: x in remove_list)].index,axis=0)
Say I have following text in a cell of a dataset (csv file):
I want to extract the words/phrase that appears after the keywords Decision and reason. I can do it like so:
import pandas as pd
text = '''Decision: Postpone\n\nreason:- medical history - information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''
keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)
a = text.split('\n')
for cell in a:
for keyword in keywords:
if keyword in cell.lower():
if len(cell.split(':'))>1:
new_df[keyword][0]=cell.split(':')[1]
new_df
However, in some of the cells, the words/phrases appear in a new line after the keyword, in which case this program is unable to extract it:
import pandas as pd
text = '''Decision: Postpone\n\nreason: \n- medical history \n- information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''
keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)
a = text.split('\n')
for cell in a:
for keyword in keywords:
if keyword in cell.lower():
if len(cell.split(':'))>1:
new_df[keyword][0]=cell.split(':')[1]
new_df
How can I fix this?
Use Regular expression to split data this would reduced number of loops
import re
import pandas as pd
text = '''Decision: Postpone\n\nreason: \n- medical history \n- information obtained from attending physician\n\nto review with current assessment from Dr Cynthia Dominguez regarding medical history, and current CBC showing actual number of platelet count\n\nmib: F\n'''
keywords = ['decision', 'reason']
new_df = pd.DataFrame(0, index=[0], columns=keywords)
text =text.lower()
tokens = re.findall(r"[\w']+", text)
for key in keywords:
if key =='decision':
index = tokens.index(key)
new_df[key][0] = ''.join(tokens[index+1:index+2])
if key =='reason':
index = tokens.index(key)
meta = tokens.index('review')
new_df[key][0] = " ".join(tokens[index + 1:meta -1])
print(new_df)
If the content is in another row, you definitely may not split the
source string into rows and then look for all "tokens" in thee
current row.
Instead you should:
prepare a regex with 2 capturing groups (keyword and content),
look for matches, e.g. using finditer.
Example code can be as follows:
df = pd.DataFrame(columns=keywords)
keywords = ['decision', 'reason']
it = re.finditer(r'(?P<kwd>\w+):\n?(?P<cont>.+?(?=\n\w+:|$))',
text, flags=re.DOTALL)
row = dict.fromkeys(keywords, '')
for m in it:
kwd = m.group('kwd').lower()
cont = m.group('cont').strip()
if kwd in keywords:
row[kwd] = cont
df = df.append(row, ignore_index=True)
Of course, you should start from import re.
And maybe you should also read a little about regular expressions.
I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:
ID TEXT
265 The farmer plants grain. The fisher catches tuna.
456 The sky is blue.
434 The sun is bright.
921 I own a phone. I own a book.
I know all nltk functions do not work on dataframes. How could sent_tokenize be applied to the above dataframe?
When I try:
df.TEXT.apply(nltk.sent_tokenize)
The output is unchanged from the original dataframe. My desired output is:
TEXT
The farmer plants grain.
The fisher catches tuna.
The sky is blue.
The sun is bright.
I own a phone.
I own a book.
In addition, I would like to tie back this new (desired) dataframe to the original ID numbers like this (following further text cleansing):
ID TEXT
265 'farmer', 'plants', 'grain'
265 'fisher', 'catches', 'tuna'
456 'sky', 'blue'
434 'sun', 'bright'
921 'I', 'own', 'phone'
921 'I', 'own', 'book'
This question is related to another of my questions here. Please let me know if I can provide anything to help clarify my question!
edit: as a result of warranted prodding by #alexis here is a better response
Sentence Tokenization
This should get you a DataFrame with one row for each ID & sentence:
sentences = []
for row in df.itertuples():
for sentence in row[2].split('.'):
if sentence != '':
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
Whose output looks like this:
split('.') will quickly break strings up into sentences if sentences are in fact separated by periods and periods are not being used for other things (e.g. denoting abbreviations), and will remove periods in the process. This will fail if there are multiple use cases for periods and/or not all sentence endings are denoted by periods. A slower but much more robust approach would be to use, as you had asked, sent_tokenize to split rows up by sentence:
sentences = []
for row in df.itertuples():
for sentence in sent_tokenize(row[2]):
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
This produces the following output:
If you want to quickly remove periods from these lines you could do something like:
new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))
Which would yield:
You can also take the apply -> map approach (df is your original table):
df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))
Yielding:
Continuing:
sentences = df.SENTENCES.apply(pandas.Series)
sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]
This yields:
As our indices have not changed, we can join this back into our original table:
df = df.join(sentences)
Word Tokenization
Continuing with df from above, we can extract the tokens in a given sentence as follows:
df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)
This is a little complicated. I apply sentence tokenization first then go through each sentences and remove words from remove_words list and remove punctuation for each word inside.
import pandas as pd
from nltk import sent_tokenize
from string import punctuation
remove_words = ['the', 'an', 'a']
def remove_punctuation(chars):
return ''.join([c for c in chars if c not in punctuation])
# example dataframe
df = pd.DataFrame([[265, "The farmer plants grain. The fisher catches tuna."],
[456, "The sky is blue."],
[434, "The sun is bright."],
[921, "I own a phone. I own a book."]], columns=['sent_id', 'text'])
df.loc[:, 'text_split'] = df.text.map(sent_tokenize)
sentences = []
for _, r in df.iterrows():
for s in r.text_split:
filtered_words = [remove_punctuation(w) for w in s.split() if w.lower() not in remove_words]
# or using nltk.word_tokenize
# filtered_words = [w for w in word_tokenize(s) if w.lower() not in remove_words and w not in punctuation]
sentences.append({'sent_id': r.sent_id,
'text': s.strip('.'),
'words': filtered_words})
df_words = pd.DataFrame(sentences)
Output
+-------+--------------------+--------------------+
|sent_id| text| words|
+-------+--------------------+--------------------+
| 265|The farmer plants...|[farmer, plants, ...|
| 265|The fisher catche...|[fisher, catches,...|
| 456| The sky is blue| [sky, is, blue]|
| 434| The sun is bright| [sun, is, bright]|
| 921| I own a phone| [I, own, phone]|
| 921| I own a book| [I, own, book]|
+-------+--------------------+--------------------+
I want to do some text processing on a dataset containing Twitter messages. So far I'm able to load the data (.CSV) in a Pandas dataframe and index that by a (custom) column 'timestamp'.
df = pandas.read_csv(f)
df.index = pandas.to_datetime(df.pop('timestamp'))
Looks a bit like this:
user_name user_handle
timestamp
2015-02-02 23:58:42 Netherlands Startups NLTechStartups
2015-02-02 23:58:42 shareNL share_NL
2015-02-02 23:58:42 BreakngAmsterdamNews iAmsterdamNews
[49570 rows x 8 columns]
I can create a new object (Series) containing just the text like so:
texts = pandas.Series(df['text'])
Which creates this:
2015-06-02 14:50:54 Business Update Meer cruiseschepen dan ooit in...
2015-06-02 14:50:53 RT #ProvincieNH: Provincie maakt Markermeerdij...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat: In ...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat http...
2015-06-02 14:50:53 Lugar secreto em Amsterdam: Begijnhof // Hidde...
Name: text, Length: 49570
1. Is this new object of the same sort of type (dataframe) as my initial df variable, just with different columns/rows?
Now together with the nltk tookit I'm able to tokenize the strings using this:
for w in words:
print(nltk.word_tokenize(w))
This iterates the array instead of mapping the 'text' column to a multiple-column 'words' array. 2. How would I do this and moreover how do I then count the occurrences of each word?
I know there is a unique() method which I could use to create a distinct list of words. But then again I'd need an extra column which is a count over the array which I'm unable to produce in the first place. :) 3. Or would the next step towards 'counting' occurrences of those words be grouping?
EDIT. 3: I seem to need "CountVectorizer", thanks EdChum
documents = df['text'].values
vectorizer = CountVectorizer(min_df=0, stop_words=[])
X = vectorizer.fit_transform(documents)
print(X.toarray())
My main goal is to count the occurences of each word and selecting the top X results. I feel I'm on the right track, but I can't get the final steps just right..
Building on EdChums comments here is a way to get the (I assume global) word counts from CountVectorizer:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer()
df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
,'class': ['a','a','a','a','c','c','b','e']})
X = vect.fit_transform(df['text'].values)
y = df['class'].values
covert the sparse matrix returned by CountVectoriser to a dense matrix, and pass it and the feature names to the dataframe constructor. Then transpose the frame and sum along axis=1 to get the total per word:
word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names()).T.sum(axis=1)
word_counts.sort(ascending=False)
word_counts[:3]
If all you are interested in is the frequency distribution of the words consider using Freq Dist from NLTK:
import nltk
import itertools
from nltk.probability import FreqDist
texts = ['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']
texts = [nltk.word_tokenize(text) for text in texts]
# collapse into a single list
tokens = list(itertools.chain(*texts))
FD =FreqDist(tokens)