Python/Pandas aggregation combined with NLTK - python

I want to do some text processing on a dataset containing Twitter messages. So far I'm able to load the data (.CSV) in a Pandas dataframe and index that by a (custom) column 'timestamp'.
df = pandas.read_csv(f)
df.index = pandas.to_datetime(df.pop('timestamp'))
Looks a bit like this:
user_name user_handle
timestamp
2015-02-02 23:58:42 Netherlands Startups NLTechStartups
2015-02-02 23:58:42 shareNL share_NL
2015-02-02 23:58:42 BreakngAmsterdamNews iAmsterdamNews
[49570 rows x 8 columns]
I can create a new object (Series) containing just the text like so:
texts = pandas.Series(df['text'])
Which creates this:
2015-06-02 14:50:54 Business Update Meer cruiseschepen dan ooit in...
2015-06-02 14:50:53 RT #ProvincieNH: Provincie maakt Markermeerdij...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat: In ...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat http...
2015-06-02 14:50:53 Lugar secreto em Amsterdam: Begijnhof // Hidde...
Name: text, Length: 49570
1. Is this new object of the same sort of type (dataframe) as my initial df variable, just with different columns/rows?
Now together with the nltk tookit I'm able to tokenize the strings using this:
for w in words:
print(nltk.word_tokenize(w))
This iterates the array instead of mapping the 'text' column to a multiple-column 'words' array. 2. How would I do this and moreover how do I then count the occurrences of each word?
I know there is a unique() method which I could use to create a distinct list of words. But then again I'd need an extra column which is a count over the array which I'm unable to produce in the first place. :) 3. Or would the next step towards 'counting' occurrences of those words be grouping?
EDIT. 3: I seem to need "CountVectorizer", thanks EdChum
documents = df['text'].values
vectorizer = CountVectorizer(min_df=0, stop_words=[])
X = vectorizer.fit_transform(documents)
print(X.toarray())
My main goal is to count the occurences of each word and selecting the top X results. I feel I'm on the right track, but I can't get the final steps just right..

Building on EdChums comments here is a way to get the (I assume global) word counts from CountVectorizer:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer()
df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
,'class': ['a','a','a','a','c','c','b','e']})
X = vect.fit_transform(df['text'].values)
y = df['class'].values
covert the sparse matrix returned by CountVectoriser to a dense matrix, and pass it and the feature names to the dataframe constructor. Then transpose the frame and sum along axis=1 to get the total per word:
word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names()).T.sum(axis=1)
word_counts.sort(ascending=False)
word_counts[:3]
If all you are interested in is the frequency distribution of the words consider using Freq Dist from NLTK:
import nltk
import itertools
from nltk.probability import FreqDist
texts = ['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']
texts = [nltk.word_tokenize(text) for text in texts]
# collapse into a single list
tokens = list(itertools.chain(*texts))
FD =FreqDist(tokens)

Related

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

Twitter Sentiment Analysis in Python, Lemmatization in Pandas

I am trying to produce a very simple twitter sentiment analysis. I have so far been able to pre-process my tweets however I am greatly struggling to lemmatize within my data frame. This is my code so far:
import nltk
import pandas as pd
from nltk.corpus import stopwords # Importing Natural Language Toolkit
from nltk.stem import WordNetLemmatizer
df = pd.read_csv(r'/Users/sarfrazkhan/Desktop/amazon.csv') # Loading Amazon data set into code
df = df['x'].str.replace('http\S+|www.\S+', '', case=False) # Removing URL's from data set
df = df.str.replace(r'\<.*\>', '') # Removing noise contained in '< >' these parenthesis
df = df.str.replace('RT ', '', case=False) # Removing the phrase 'RT" from all strings
df = df.str.replace('#[^\s]+', '', case=False) # Removing '#' and the following twitter handle from strings
df = df.str.replace('[^\w\s]', ' ') # Removing any punctuation
df = df.str.replace('\r\n', ' ') # Removing '\r\n' which is present in some strings
df = df.str.replace('\d+', '').str.lower().str.strip() # Removing numbers, capitalisation and white space
df = df.apply(nltk.word_tokenize) # Tokenizing data set
stop = nltk.download('stopwords') # Downloading stop words
stop = set(stopwords.words('english')) # Selecting English stop words
df = df.apply(lambda x: [item for item in x if item not in stop]) # Removing stop words from each string
lemmatizer = WordNetLemmatizer()
lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in df]
I am struggling to get my Lemmatizer to work and constantly met with errors possibly due to the fact my dataset is in a list form. (which I am struggling to work around) The Excel which I am trying to process is simply a long list of tweets with the heading 'x'. You can see on line 6 of my code that I focus primarily on this column, however I'm unsure if this is the correct way to do it!
My expected outcome would be a list of words which have been lemmatised correctly within their respective rows, to which I can then carry out a sentiment analysis.
These are the first few lines of my data frame before attempting the lemmatising process:
1 [swinging, pendulum, wall, clock, love, give, ...
2 [enter, via, gleam, l]
3 [screw, every, follow, gets, nude, dms, dm, pr...
4 [bishop, law, coming, soon, bishop, series, bo...
5 [adventures, bella, amp, emily, book, series, ...
6 [written, books, various, genres, amazon, kind...
7 [author, books, amwriting, fantasy, mystery, p...
8 [wonderful, mentor, recent, times, graham, kee...
9 [available, amazon, ebay, disabilities, hidden...
10 [screw, every, follow, gets, nude, dms, dm, pr...
Your code is trying to lemmatize an actual list hence the error.
... for w in df -> here, w is the list, rather than each element of each list.
To get around this you could use pandas apply to pass each row to a function (assuming df is a pd.DataFrame and not a pd.Series. If it's a Series and the below doesn't work, try df = df.to_frame() first) :
def df_lemmatize(row):
lemmatizer = WordNetLemmatizer()
row.at['lemma_words'] = [lemmatizer.lemmatize(w, pos='a') for w in row.x]
return row
df = df.apply(df_lemmatize, axis=1)
df_lemmatize will iterate over each element in the list, lemmatize it and then add the new list to a new column lemma_words.

Check if a string is present in multiple lists

I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan

Python concatenate values in rows till empty cell and continue

I am struggling a little to do something like that:
to get this output:
The purpose of it, is to separate a sentence into 3 parts to make some manipulations after.
Any help is welcome
Select from the dataframe only the second line of each pair, which is the line
containing the separator, then use astype(str).apply(''.join...) to restrain the word
that can be on any value column on the original dataframe to a single string.
Iterate over each row using split with the word[i] of the respective row, after split
reinsert the separator back on the list, and with the recently created list build the
desired dataframe.
Input used as data.csv
title,Value,Value,Value,Value,Value
Very nice blue car haha,Very,nice,,car,haha
Very nice blue car haha,,,blue,,
A beautiful green building,A,,green,building,lol
A beautiful green building,,beautiful,,,
import pandas as pd
df = pd.read_csv("data.csv")
# second line of each pair
d1 = df[1::2]
d1 = d1.fillna("").reset_index(drop=True)
# get separators
word = d1.iloc[:,1:].astype(str).apply(''.join, axis=1)
strings = []
for i in range(len(d1.index)):
word_split = d1.iloc[i, 0].split(word[i])
word_split.insert(1, word[i])
strings.append(word_split)
dn = pd.DataFrame(strings)
dn.insert(0, "title", d1["title"])
print(dn)
Output from dn
title 0 1 2
0 Very nice blue car haha Very nice blue car haha
1 A beautiful green building A beautiful green building

How do you go through a list of strings using the series.str.contains function?

I have credit card charge data that has a column containing the description for the charge. I also created a dictionary that contains categories for different charges. For example, I have a category called grocery expenses (value) and regular expressions (Ralphs, Target). I combined my values in a string with the separator |.
I am using the Series.str.contains(pat,case=True,flags=0,na=nan,regex=True) function to see if the string in each index contains my regular expressions.
# libraries needed
# import pandas as pd
# import re
joined_string=['|'.join(value) for value in values]
the_list=joined_string
Example output: the_list=[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK"]
df['Description']='FOOD4LESS 0508 0000FULLERTON CA'
The Dataframe contains a column of different charges on your credit card
```python
for character_sequence in the_list:
boolean_output=df['Description'].str.contains(character_sequence,regex=True)
For some reason, the code is not going through each character sequence in my list. It only goes through one character sequence, but I need it to go through multiple character sequences.
Since there is no data to compare with, I will just present some dummy data.
import pandas as pd
names = ['Adam','Barry','Chuck','Dennis','Elon','Fridman','George','Harry']
df = pd.DataFrame(names, columns=['Names'])
# Apply regex and save to column: Regex
df['Regex'] = df.Names.str.contains('[ae]', regex=True)
df
Output:
Names Regex
0 Adam True
1 Barry True
2 Chuck False
3 Dennis True
4 Elon False
5 Fridman True
6 George True
7 Harry True
Solution with another Example akin to the Problem
First, your the_list variable is not correct. Assuming it is a typo, I would present my solution here. Please note that regex or regular expression, when applied to a column of data, essentially means that you are trying to find some patterns. How would you in first place know/check if your pattern recognition is working fine? Well, you would need a few data-points to at least validate the regex results. Since, you only provided one line of data, therefore, I will make some dummy data here and test if the regex produces expected results.
Note: Please check the Data Prepeartions section to see the data so you can replicate and test the solution.
import pandas as pd
import re
# Make regex string from the list of target keywords
regex_expression = '|'.join(the_list)
# Make dataframe from the list of descriptions
# --> see under Data section of the solution.
df = pd.DataFrame(descriptions, columns=['Description'])
# Regex search results for a subset of
# target keywords: "Gas|Internet|Water|Electricity,VONS"
df['Regex_A'] = df.Description.str.contains("Gas|Internet|Water|Electricity,VONS", regex=True)
# Regex search result of all target keywords
df['Regex_B'] = df.Description.str.contains(regex_expression, regex=True)
df
Output:
Description Regex_A Regex_B
0 FOOD4LESS 0508 0000FULLERTON CA False True
1 Electricity,VONS 0777 0123FULLERTON NY True True
2 PAVILIONS 1248 9800Ralphs MA False True
3 SPROUTS 9823 0770MARKET#WORK WI False True
4 Internet 0333 1008Water NJ True True
5 Enternet 0444 1008Wager NJ False False
Data Preparation
In a practical scenario, I would assume that in case of the type of problem you presented in the question, you would have a list of words, that you would like to look for in the dataframe column.
So, I took the liberty to first convert your string into a list of strings.
the_list="[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK]"
the_list = the_list.replace("[","").replace("]","").split("|")
the_list
Output:
['Gas',
'Internet',
'Water',
'Electricity,VONS',
'RALPHS',
'Ralphs',
'PAVILIONS',
'FOOD4LESS',
"TRADER JOE'S",
'GROCERY OUTLET',
'FOOD 4 LESS',
'SPROUTS',
'MARKET#WORK']
Also, we make five rows of data where we have have the keywords we are looking for; and then add another row to it where we expect a False as a result of the regex pattern search.
descriptions = [
'FOOD4LESS 0508 0000FULLERTON CA',
'Electricity,VONS 0777 0123FULLERTON NY',
'PAVILIONS 1248 9800Ralphs MA',
'SPROUTS 9823 0770MARKET#WORK WI',
'Internet 0333 1008Water NJ',
'Enternet 0444 1008Wager NJ',
]

Categories