I have a dataframe and one column contains the lemmatized words of a paragraph. I wish to count the frequency of each word within the whole dataframe, not just within the record. There are over 40,000 records so the computation has to be quick and not reach the limit of my RAM.
For example, this basic input:
ID lemm
1 ['test','health']
2 ['complete','health','science']
would have this desired output:
'complete':1
'health':2
'science':1
'test':1
This is my current code:
from collections import Counter
cnt = Counter()
for entry in df.lemm:
for word in entry:
cnt[word]+=1
cnt
Which works when I manually enter a list of a list of strings (ex/[['completing', 'dog', 'cat'], ['completing','degree','health','health']]), but not when it iterates through the df.
I have also tried this:
top_N=20
word_dist = nltk.FreqDist(df_main.stem)
print('All frequences')
print('='*60)
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])
print(rslt)
to return the top 20 terms, but the output lists the frequencies of terms within the entry, not the entire dataframe.
Any help would be appreciated!
You can try explode if you have Pandas 0.25+:
df.Text.explode().value_counts()
maybe you can the counter line to:
cnt = Counter(word for entry in df.lemm for word in entry)
Refer to: How to find the lemmas and frequency count of each word in list of sentences in a list?
Assuming your column names and input data:
data = {
"ID": [1, 2],
"lemm": [['test', 'health'], ['complete', 'health', 'science']]
}
df = pd.DataFrame(data)
freq = df.explode("lemm").groupby(["lemm"]).count().rename(columns={"ID" : "Frecuency"})
Output:
from collections import Counter
cnt = df.apply(lambda x:Counter(x['lemm']),axis=1).sum()
Will do it for you. That will make cnt a Counter object so you can do most common on it or anything else counter offers.
Related
I've used POS-tagging (in german language, thus nouns have "NN" and "NE" as abbreviations) and now I am having trouble to extract the nouns into a new column of the pandas dataframe.
Example:
data = {"tagged": [[("waffe", "Waffe", "NN"), ("haus", "Haus", "NN")], [("groß", "groß", "ADJD"), ("bereich", "Bereich", "NN")]]}
df = pd.DataFrame(data=data)
df
df["nouns"] = df["tagged"].apply(lambda x: [word for word, tag in x if tag in ["NN", "NE"]])
Results in the following error message: "ValueError: too many values to unpack (expected 2)"
I think the code would work if I was able to delete the first value of each tagged word but I cannot figure out how to do that.
Because there are tuples with 3 values unpack values to variables word1 and word2:
df["nouns"] = df["tagged"].apply(lambda x: [word2 for word1, word2, tag
in x if tag in ["NN", "NE"]])
Or use same solution in list comprehension:
df["nouns"] = [[word2 for word1,word2, tag in x if tag in ["NN", "NE"]]
for x in df["tagged"]]
print (df)
tagged nouns
0 [(waffe, Waffe, NN), (haus, Haus, NN)] [Waffe, Haus]
1 [(groß, groß, ADJD), (bereich, Bereich, NN)] [Bereich]
I think it would be easier with function call. This creates list of NN or NE tags from each row. If you would like to deduplicate, you need to update the function.
data = {"tagged": [[("waffe", "Waffe", "NN"), ("haus", "Haus", "NN")], [("groß", "groß", "ADJD"), ("bereich", "Bereich", "NN")]]}
df = pd.DataFrame(data=data)
#function
def getNoun(obj):
ret=[] #declare empty list as default value
for l in obj: #iterate list of word groups
for tag in l: #iterate list of words/tags
if tag in ['NN','NE']:
ret.append(tag) #add to return list
return ret
#call new column creation
df['noun']=df['tagged'].apply(getNoun)
#result
print(df['noun'])
#output:
#0 [NN, NN]
#1 [NN]
#Name: noun, dtype: object
I have a dataframe df with a column "Content" that contains a list of articles extracted from the internet. I have already the code for constructing a dataframe with the expected output (two columns, one for the word and the other for its frequency). However, I would like to exclude some words (conectors, for instance) in the analysis. Below you will find my code, what should I add to it?
It is possible to use the code get_stop_words('fr') for a more efficiente use? (Since my articles are in French).
Source Code
import csv
from collections import Counter
from collections import defaultdict
import pandas as pd
df = pd.read_excel('C:/.../df_clean.xlsx',
sheet_name='Articles Scraping')
df = df[df['Content'].notnull()]
d1 = dict()
for line in df[df.columns[6]]:
words = line.split()
# print(words)
for word in words:
if word in d1:
d1[word] += 1
else:
d1[word] = 1
sort_words = sorted(d1.items(), key=lambda x: x[1], reverse=True)
There are a few ways you can achieve this. You can either use the isin() method with a list comprehension,
data = {'test': ['x', 'NaN', 'y', 'z', 'gamma',]}
df = pd.DataFrame(data)
words = ['x', 'y', 'NaN']
df = df[~df.test.isin([word for word in words])]
Or you can go with not string contains and a join:
df = df[~df.test.str.contains('|'.join(words))]
If you want to utilize the stop words package for French, you can also do that, but you must preprocess all of your texts before you start doing any frequency analysis.
french_stopwords = set(stopwords.stopwords("fr"))
STOPWORDS = list(french_stopwords)
STOPWORDS.extend(['add', 'new', 'words', 'here'])
I think the extend() will help you tremendously.
I have an excel file with many rows and columns. I want to do the following. First, I want to filter the rows based on a text match. Second, I want to choose a particular column and generate word frequency for ALL THE WORDS in that column. Third, I want to graph the word and frequency.
I have figured out the first part. My question is how to apply Counter() on a dataframe. If I just use Counter(df), it returns an error. So, I used the following code to convert each row into a list and then applied Counter. When I do this, I get the word frequency for each row separately (if I use counter within the for loop, else I get the word frequency for just one row). However, I want a word count for all the rows put together. Appreciate any inputs. Thanks!
The following is an example data.
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase
My desired output is like this: for product a - (product,3),(delivery,2)(fast,2) etc..my current output is like (great,1), (product,1) for the first row.
This is the code I used.
strdata = column.values.tolist()
tokens = [tokenizer.tokenize(str(i)) for i in strdata]
cleaned_list = []
for m in tokens:
stopped = [i for i in m if str(i).lower() not in stop_words]
stemmed = [stemmer.stem(i) for i in stopped]
cleaned_list.append(stopped) #append stemmed words to list
count = Counter(stemmed)
print(count.most_common(10))
Firstly, using groupby concatenate strings from same group.
Secondly, apply Counter() on joined strings.
joined = df.groupby('product', as_index=False).agg({'review' : ' '.join})
joined['count'] = joined.apply(lambda x: collections.Counter(x['review'].split(' ')), axis=1)
# print(joined)
product review count
0 a Great Product Delivery was fast Product receiv... {'Great': 1, 'Product': 2, 'Delivery': 1, 'was...
1 b Dont recommend I love it Please dont buy Secon... {'Dont': 1, 'recommend': 1, 'I': 1, 'love': 1,...
You can use the following function. The idea is
Group your data by byvar. Combine every words in yvar as a list.
Apply Counter and, if you want, select the most common
Explode to get a long formatted dataframe (easier to analyze afterwards)
just keep the relevent columns (word and count in a new dataframe)
:
from collections import Counter
import pandas as pd
def count_words_by(data, yvar, byvar):
cw = pd.DataFrame({'counter' : data
.groupby(byvar)
.apply(lambda s: ' '.join(s[yvar]).split())
.apply(lambda s: Counter(s))
# .apply(lambda s: s.most_common(10)) #uncomment this line if you want the top 10 words
.explode()}
)
cw[['word','count']] = pd.DataFrame(cw['counter'].tolist(), index=cw.index)
cw_red = cw[['word','count']].reset_index()
return cw_red
count_words_by(data = df, yvar = "review", byvar = "product")
where I assume you start from there:
product review
a Great Product
a Delivery was fast
a Product received in good condition
a Fast delivery but useless product
b Dont recommend
b I love it
b Please dont buy
b Second purchase
I have raw data in a string, which are basically multiple keywords in the form-
Law, of, three, stages
Alienation
Social, Facts
Theory, of, Social, System
How do I import it into a dataframe such that it counts repetition and returns me a count of each word?
Edit: I've converted it into the following format
Law,of,three,stages,Alienation,Social,Facts,Theory,of,Social,System
I want to convert it into a dataframe because i want to eventually predict which word has the highest probability of reocurring.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': [ 'Law','of','three','stages','Alienation','Social','Facts','Theory','of','Social','System']
})
df['name'] = df.name.str.split('[ ,]', expand=True)
print(df)
word_freq = pd.Series(np.concatenate([x.split() for x in df.name])).value_counts()
print(word_freq)
Use a dictionary
word_count_dict = {}
with open("Yourfile.txt") as file_stream:
lines = file_stream.readlines()
for line in lines:
if "," in line:
line = line.split(",")
else:
line = [line]
for item in line:
if item in word_count_dict.keys():
word_count_dict[item] += 1
else:
word_count_dict[item] = 1
Since Now you will be having all the count list of words if you want the probability-based order. Its recommended dividing each value by total count of occurrences
total = sum(word_count_dict.itervalues(), 0.0)
probability_words = {k: v / total for k, v in word_count_dict.iteritems()}
Now the probability words have all the chance of occurrence of that specific word.
Reverse Ordering based on Probabilities
sorted_probability_words = sorted(probability_words, key = lambda x : x[1], reverse = True)
Getting the first Element with highest chance
print(sorted_probability_words[0]) # to access the word Key value
print(sorted_probability_words[0][0]) # to get the first word
print(sorted_probability_words[0][1]) # to get the first word probability
I have a column named word_count which contains the count of all the words in a review. How can I find the number of times the word awesome has occurred in each row of that column and use .apply() method to make it into a new column say awesome.
products['word_count'][1]
{'and': 3L,'bags': 1L,'came': 1L, 'disappointed.':1L,'does':1L,'early':1L,'highly': 1L,'holder.': 1L, 'awesome': 2L}
how can i get the output
products['awesome'][1]
2
What I understood from you is that you have a dictionary called products which holds word counter for various texts like this:
products = {'word_count' : [{'holder.': 2, 'awesome': 1}, {'and': 3,'bags': 1,'came': 1, 'disappointed.':1,'does':1,'early':1,'highly': 1,'holder.': 1, 'awesome': 2}] }
for instance, the first text contains "holder" 2 times and awesome 1 time.
To add another column you need to create the array that counts 'awesome' on each text as follows:
counter = []
for i in range(len(products['word_count'])):
counter.append(products['word_count'][i]['awesome'])
and then add the row to the table:
products['awesome'] = counter
and there you have it!
Here's the code for the python function counting_words:
def counting_words(x):
if (products['word_count'][x].has_key('awesome')):
return products['word_count'][x]['awesome']
else:
return 0
Here's the other part of the code
new_dict = {}
for x in range(len(products)):
if (x==0):
new_dict['awesome'] = [counting_words(x)]
new_dict['awesome'].append(counting_words(x))
newframe = graphlab.SFrame(new_dict)
products.add_columns(newframe)
I assumed that you are using graphlab and the above code will work for the word 'awesome'. The new_dict was created to store the count of 'awesome' in each row of your product['word_count'] column. So in new_dict it should be: new_dict = {'awesome': [0,0,1,...2,1]}.
However, if you plan to count other words, this method would be too slow.