Lets say i have df like this
ID
name_x
st
string
1
xx
us
Being unacquainted with the chief raccoon was harming his prospects for promotion
2
xy
us1
The overpass went under the highway and into a secret world
3
xz
us
He was 100% into fasting with her until he understood that meant he couldn't eat
4
xu
us2
Random words in front of other random words create a random sentence
5
xi
us1
All you need to do is pick up the pen and begin
Using python and pandas for column st I want count name_x values and then extract top 3 key words from string.
For example like this:
st
name_x_count
top1_word
top2_word
top3_word
us
2
word1
word2
word3
us1
2
word1
word2
word3
us2
1
word1
word2
word3
Is there any way to solve this task?
I would first groupby() to concatenate the strings as you show and then use collections Counter and then most_common. Finally assign it back to the dataframe. I am using x.lower() because otherwise "He" and "he" will be considered a different word (but you can always remove it if this is intended):
output = df.groupby('st').agg(
name_x_count = pd.NamedAgg('name_x','count'),
string = pd.NamedAgg('string',' '.join))
After grouping by we create the columns by using collections.Counter():
output[['top1_word','top2_word','top3_word']] = output['string'].map(lambda x: [x[0] for x in collections.Counter(x.lower().split()).most_common(3)])
output = output.drop(columns='string')
Output:
name_x_count top1_word top2_word top3_word
st
us 2 he with was
us1 2 the and overpass
us2 1 random words in
First, I added a space at the end of each string, as we will combine sentences while grouping. Then I consolidated the sentences after grouping by the st column.
df['string']=df['string'] + ' ' # we will use sum function. When combining sentences, there should be spaces in between.
dfx=df.groupby('st').agg({'st':'count','string':'sum'}) #groupby st and combine strings
Then list each word of the string expression, calculate their distribution and get the first 3 values.
from collections import Counter
mask=dfx['string'].apply(lambda x: list(dict(Counter(x.split()).most_common()[:3]).keys()))
print(mask)
'''
st string
us ['with', 'was', 'he']
us1 ['the', 'and', 'The']
us2 ['words', 'random', 'Random']
'''
Finally, add these first 3 words as new columns.
dfx[['top1_word','top2_word','top3_word']]=pd.DataFrame(mask.tolist(), index= mask.index)
dfx
st name_x_count top1_word top2_word top3_word
us 2 with was he
us1 2 the and The
us2 1 words random Random
Related
I have a given text string:
text = """Alice has two apples and bananas. Apples are very healty."""
and a dataframe:
word
apples
bananas
company
I would like to add a column "frequency" which will count occurrences of each word in column "word" in the text.
So the output should be as below:
word
frequency
apples
2
bananas
1
company
0
import pandas as pd
df = pd.DataFrame(['apples', 'bananas', 'company'], columns=['word'])
para = "Alice has two apples and bananas. Apples are very healty.".lower()
df['frequency'] = df['word'].apply(lambda x : para.count(x.lower()))
word frequency
0 apples 2
1 bananas 1
2 company 0
Convert the text to lowercase and then use regex to convert it to a list of words. You might check out this page for learning purposes.
Loop through each row in the dataset and use lambda function to count the specific value in the previously created list.
# Import and create the data
import pandas as pd
import re
text = """Alice has two apples and bananas. Apples are very healty."""
df = pd.DataFrame(data={'word':['apples','bananas','company']})
# Solution
words_list = re.findall(r'\w+', text.lower())
df['Frequency'] = df['word'].apply(lambda x: words_list.count(x))
I have created a dataframe with just a column with the subject line.
df = activities.filter(['Subject'],axis=1)
df.shape
This returned this dataframe:
Subject
0 Call Out: Quadria Capital - May Lo, VP
1 Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2 Columbia Partners: WW Worked (Not Sure Will Ev...
3 Meeting, Sophie, CFO, CDC Investment
4 Prospecting
I then tried to analyse the text with this code:
import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)
The error message I get is: 'Series' object has no attribute 'Subject'
The error is being thrown because you have converted df to a Series in this line:
df = activities.filter(['Subject'],axis=1)
So when you say:
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
df is the Series and does not have the attribute Series. Try replacing with:
txt = df.str.lower().str.replace(r'\|', ' ')
Or alternatively don't filter your DataFrame to a single Series before and then
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
should work.
[UPDATE]
What I said above is incorrect, as pointed out filter does not return a Series, but rather a DataFrame with a single column.
Data:
Subject
"Call Out: Quadria Capital - May Lo, VP"
Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
Columbia Partners: WW Worked (Not Sure Will Ev...
"Meeting, Sophie, CFO, CDC Investment"
Prospecting
# read in the data
df = pd.read_clipboard(sep=',')
Updated code:
Convert all words to lowercase and remove all non-alphanumeric characters
txt = df.Subject.str.lower().str.replace(r'\|', ' ') creates a pandas.core.series.Series and will be replaced
words = nltk.tokenize.word_tokenize(txt), throws a TypeError because txt is a Series.
The following code tokenizes each row of the dataframe
Tokenizing the words, splits each string into a list. In this example, looking at df will show a tok column, where each row is a list
import nltk
import pandas as pd
top_N = 50
# replace all non-alphanumeric characters
df['sub_rep'] = df.Subject.str.lower().str.replace('\W', ' ')
# tokenize
df['tok'] = df.sub_rep.apply(nltk.tokenize.word_tokenize)
To analyze all the words in the column, the individual rows lists are combined into a single list, called words.
# all tokenized words to a list
words = df.tok.tolist() # this is a list of lists
words = [word for list_ in words for word in list_]
# frequency distribution
word_dist = nltk.FreqDist(words)
# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
# output the results
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
Output rslt:
Word Frequency
call 2
out 2
quadria 1
capital 1
may 1
lo 1
vp 1
revelstoke 1
anthony 1
hayes 1
sr 1
assoc 1
columbia 1
partners 1
ww 1
worked 1
not 1
sure 1
will 1
ev 1
meeting 1
sophie 1
cfo 1
cdc 1
investment 1
prospecting 1
I'm wondering if there's a more general way to do the below? I'm wondering if there's a way to create the st function so that I can search a non-predefined number of strings?
So for instance, being able to create a generalized st function, and then type st('Governor', 'Virginia', 'Google)
here's my current function, but it predefines two words you can use. (df is a pandas DataFrame)
def search(word1, word2, word3 df):
"""
allows you to search an intersection of three terms
"""
return df[df.Name.str.contains(word1) & df.Name.str.contains(word2) & df.Name.str.contains(word3)]
st('Governor', 'Virginia', newauthdf)
You could use np.logical_and.reduce:
import pandas as pd
import numpy as np
def search(df, *words): #1
"""
Return a sub-DataFrame of those rows whose Name column match all the words.
"""
return df[np.logical_and.reduce([df['Name'].str.contains(word) for word in words])] # 2
df = pd.DataFrame({'Name':['Virginia Google Governor',
'Governor Virginia',
'Governor Virginia Google']})
print(search(df, 'Governor', 'Virginia', 'Google'))
prints
Name
0 Virginia Google Governor
2 Governor Virginia Google
The * in def search(df, *words) allows search to accept an
unlimited number of positional arguments. It will collect all the
arguments (after the first) and place them in a list called words.
np.logical_and.reduce([X,Y,Z]) is equivalent to X & Y & Z. It
allows you to handle an arbitrarily long list, however.
str.contains can take regex. so you can use '|'.join(words) as the pattern; to be safe map to re.escape as well:
>>> df
Name
0 Test
1 Virginia
2 Google
3 Google in Virginia
4 Apple
[5 rows x 1 columns]
>>> words = ['Governor', 'Virginia', 'Google']
'|'.join(map(re.escape, words)) would be the search pattern:
>>> import re
>>> pat = '|'.join(map(re.escape, words))
>>> df.Name.str.contains(pat)
0 False
1 True
2 True
3 True
4 False
Name: Name, dtype: bool
I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:
ID TEXT
265 The farmer plants grain. The fisher catches tuna.
456 The sky is blue.
434 The sun is bright.
921 I own a phone. I own a book.
I know all nltk functions do not work on dataframes. How could sent_tokenize be applied to the above dataframe?
When I try:
df.TEXT.apply(nltk.sent_tokenize)
The output is unchanged from the original dataframe. My desired output is:
TEXT
The farmer plants grain.
The fisher catches tuna.
The sky is blue.
The sun is bright.
I own a phone.
I own a book.
In addition, I would like to tie back this new (desired) dataframe to the original ID numbers like this (following further text cleansing):
ID TEXT
265 'farmer', 'plants', 'grain'
265 'fisher', 'catches', 'tuna'
456 'sky', 'blue'
434 'sun', 'bright'
921 'I', 'own', 'phone'
921 'I', 'own', 'book'
This question is related to another of my questions here. Please let me know if I can provide anything to help clarify my question!
edit: as a result of warranted prodding by #alexis here is a better response
Sentence Tokenization
This should get you a DataFrame with one row for each ID & sentence:
sentences = []
for row in df.itertuples():
for sentence in row[2].split('.'):
if sentence != '':
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
Whose output looks like this:
split('.') will quickly break strings up into sentences if sentences are in fact separated by periods and periods are not being used for other things (e.g. denoting abbreviations), and will remove periods in the process. This will fail if there are multiple use cases for periods and/or not all sentence endings are denoted by periods. A slower but much more robust approach would be to use, as you had asked, sent_tokenize to split rows up by sentence:
sentences = []
for row in df.itertuples():
for sentence in sent_tokenize(row[2]):
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
This produces the following output:
If you want to quickly remove periods from these lines you could do something like:
new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))
Which would yield:
You can also take the apply -> map approach (df is your original table):
df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))
Yielding:
Continuing:
sentences = df.SENTENCES.apply(pandas.Series)
sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]
This yields:
As our indices have not changed, we can join this back into our original table:
df = df.join(sentences)
Word Tokenization
Continuing with df from above, we can extract the tokens in a given sentence as follows:
df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)
This is a little complicated. I apply sentence tokenization first then go through each sentences and remove words from remove_words list and remove punctuation for each word inside.
import pandas as pd
from nltk import sent_tokenize
from string import punctuation
remove_words = ['the', 'an', 'a']
def remove_punctuation(chars):
return ''.join([c for c in chars if c not in punctuation])
# example dataframe
df = pd.DataFrame([[265, "The farmer plants grain. The fisher catches tuna."],
[456, "The sky is blue."],
[434, "The sun is bright."],
[921, "I own a phone. I own a book."]], columns=['sent_id', 'text'])
df.loc[:, 'text_split'] = df.text.map(sent_tokenize)
sentences = []
for _, r in df.iterrows():
for s in r.text_split:
filtered_words = [remove_punctuation(w) for w in s.split() if w.lower() not in remove_words]
# or using nltk.word_tokenize
# filtered_words = [w for w in word_tokenize(s) if w.lower() not in remove_words and w not in punctuation]
sentences.append({'sent_id': r.sent_id,
'text': s.strip('.'),
'words': filtered_words})
df_words = pd.DataFrame(sentences)
Output
+-------+--------------------+--------------------+
|sent_id| text| words|
+-------+--------------------+--------------------+
| 265|The farmer plants...|[farmer, plants, ...|
| 265|The fisher catche...|[fisher, catches,...|
| 456| The sky is blue| [sky, is, blue]|
| 434| The sun is bright| [sun, is, bright]|
| 921| I own a phone| [I, own, phone]|
| 921| I own a book| [I, own, book]|
+-------+--------------------+--------------------+
This question already has answers here:
Fast punctuation removal with pandas
(4 answers)
Closed 4 years ago.
Using Canopy and Pandas, I have data frame a which is defined by:
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"]
test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.
Assuming df looks like:
test
%hgh&12
abc123!!!
porkyfries
I want my results to be:
test
hgh12
abc123
porkyfries
Effort so far:
from string import punctuation /-- import punctuation list from python itself
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"] /-- define the dataframe
for p in list(punctuation):
...: df2=df.med.str.replace(p,'')
...: df2=pd.DataFrame(df2);
...: df2
The command above basically just returns me with the same data set.
Appreciate any leads.
Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows.
Long story short, I need to clean data in a very efficient manner for big data sets.
Use replace with correct regex would be easier:
In [41]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
text
0 test
1 %hgh&12
2 abc123!!!
3 porkyfries
[4 rows x 1 columns]
use regex with the pattern which means not alphanumeric/whitespace
In [49]:
df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
text
0 test
1 hgh12
2 abc123
3 porkyfries
[4 rows x 1 columns]
For removing punctuation from a text column in your dataframme:
In:
import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)
pattern
Out:
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]'
In:
df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df
Out:
text
0 book...regh
1 book...
2 boo,
3 book.
4 ball,
5 ballnroll"
6 "rope"
7 rick %
In:
df['text'] = df['text'].str.replace(pattern, '')
df
You can replace the pattern with your desired character. Ex - replace(pattern, '$')
Out:
text
0 bookregh
1 book
2 boo
3 book
4 ball
5 ballnroll
6 rope
7 rick
Translate is often considered the cleanest and fastest way to remove punctuation (source)
import string
text = text.translate(None, string.punctuation.translate(None, '"'))
You may find that it works better to remove punctuation in 'a' before loading it into pandas.