I have a dataframe like below:
data = {'speaker':['Adam','Ben','Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
I want to count the number of words in the speech column but only for the words from a pre-defined list. For example, the list is:
wordlist = ['much', 'good','right']
I want to generate a new column which shows the frequency of these three words in each row. My expected output is therefore:
speaker speech words
0 Adam Thank you very much and good afternoon. 2
1 Ben Let me clarify that because I want to make sur... 1
2 Clair By now you should have received a copy of our ... 1
I tried:
df['total'] = 0
for word in df['speech'].str.split():
if word in wordlist:
df['total'] += 1
But I after running it, the total column is always zero. I am wondering what's wrong with my code?
You could use the following vectorised approach:
data = {'speaker':['Adam','Ben','Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
wordlist = ['much', 'good','right']
df['total'] = df['speech'].str.count(r'\b|\b'.join(wordlist))
Which gives:
>>> df
speaker speech total
0 Adam Thank you very much and good afternoon. 2
1 Ben Let me clarify that because I want to make sur... 1
2 Clair By now you should have some good rest 1
This is a much faster (runtime wise) solution, if you have a very large list and a large data frame to search through.
I guess it is because it takes advantage of a dictionary (which takes O(N) to construct and O(1) to search through). Performance wise, regex search is slower.
import pandas as pd
from collections import Counter
def occurrence_counter(target_string, search_list):
data = dict(Counter(target_string.split()))
count = 0
for key in search_list:
if key in data:
count+=data[key]
return count
data = {'speaker':['Adam','Ben','Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
wordlist = ['much', 'good','right']
df['speech'].apply(lambda x: occurrence_counter(x, wordlist))
import pandas as pd
data = {'speaker': ['Adam', 'Ben', 'Clair'],
'speech': ['Thank you very much and good afternoon.',
'Let me clarify that because I want to make sure we have got everything right',
'By now you should have some good rest']}
df = pd.DataFrame(data)
wordlist = ['much', 'good', 'right']
df["speech"] = df["speech"].str.split()
df = df.explode("speech")
counts = df[df.speech.isin(wordlist)].groupby("speaker").size()
print(counts)
Related
This is my text file. I want to convert it into columns such as speaker and comments and save it as csv. I have a huge list. So computing it will be helpful.
>bernardo11_5
Have you had quiet guard?
>francisco11_5
Not a mouse stirring.
>bernardo11_6
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
>francisco11_6
I think I hear them.--Stand, ho! Who is there?
>horatio11_1
Friends to this ground.
>marcellus11_1
And liegemen to the Dane.
Something like this?
import re
from pathlib import Path
import pandas as pd
input = Path('input.txt').read_text()
speaker = re.findall(">(.*)", input)
comments = re.split(">.*", input)
comments = [c.strip() for c in comments if c.strip()]
df = pd.DataFrame({'speaker': speaker, 'comments': comments})
This will give you full comments including newline characters.
For saving:
a) replace '\n' before calling to_csv()
df.comments = df.comments.str.replace('\n', '\\n')
b) save to a more suitable format, e.g., to_parquet()
c) split single comment into multiple rows
df.comments = df.comments.str.split('\n')
df.explode('comments')
one way is to parse and load
read the file
with open("test.txt") as fp:
data = fp.readlines()
remove empty lines
data = [x for x in data if x != "\n"]
separate into speaker and comments
speaker = []
comments = []
speaker_text = ""
for value in data:
if ">" in value:
speaker_text = value
else:
speaker.append(speaker_text)
comments.append(value)
convert to dataframe
df = pd.DataFrame({
"speaker": speaker,
"comments": comments
})
save as csv
df.to_csv("result.csv", index=False)
output
speaker comments
0 >bernardo11_5\n Have you had quiet guard?\n
1 >francisco11_5\n Not a mouse stirring.\n
2 >bernardo11_6\n Well, good night.\n
3 >bernardo11_6\n If you do meet Horatio and Marcellus,\n
4 >bernardo11_6\n The rivals of my watch, bid them make haste.\n
5 >francisco11_6\n I think I hear them.--Stand, ho! Who is there?\n
6 >horatio11_1\n Friends to this ground.\n
7 >marcellus11_1\n And liegemen to the Dane.\n
I have a corpus of conversations (400) between two people as strings (or more precisely as plain text files) A small example of this might be:
my_textfiles = ['john: hello \nmary: hi there \njohn: nice weather \nmary: yes',
'nancy: hello \nbill: hi there \nnancy: nice weather \nbill: yes',
'ringo: hello \npaul: hi there \nringo: nice weather \npaul: yes',
'michael: hello \nbubbles: hi there \nmichael: nice weather \nbubbles: yes',
'steve: hello \nsally: hi there \nsteve: nice weather \nsally: yes']
In addition to speaker names, I have also noted each speakers' role in the conversation (as a leader or follower depending on whether they are the first or second speaker). I then have a simple script that converts each conversation into a data-frame by seperating speaker ID from the content:
import pandas as pd
import re
import numpy as np
import random
def convo_tokenize(tf):
turnTokenize = re.split(r'\n(?=.*:)', tf, flags=re.MULTILINE)
turnTokenize = [turn.split(':', 1) for turn in turnTokenize]
dataframe = pd.DataFrame(turnTokenize, columns = ['speaker','turn'])
return dataframe
df_list = [convo_tokenize(tf) for tf in my_textfiles]
The corresponding dataframe then forms the basis of a much longer piece of analysis. However, I would now like to be able to shuffle speakers so that I create entirely random (and likely nonsense) conversations. For instance, John, who is having a conversation with Mary in the fist string, might be randomly assigned Paul (the second speaker in the third string). Crucially, I would need to maintain the order of speech within each speaker. It is also important that, when randomly assigning new speakers, I preserve a mix of leader/follower, such that I am not creating conversations from two leaders or two followers.
To begin, my thinking was to create a standardized speaker label (where 1 = leader, 2 = follower), and separate each DF into a sub-DF and store in role_specific df lists
def speaker_role(dataframe):
leader = dataframe['speaker'].iat[0]
dataframe['sp_role'] = np.where(dataframe['speaker'].eq(leader), 1, 2)
return dataframe
df_list = [speaker_role(df) for df in df_list]
leader_df = []
follower_df = []
for df in df_list:
is_leader = df['sp_role'] == 1
is_follower = df['sp_role'] != 1
leader_df.append(df[is_leader])
follower_df.append(df[is_follower])
I have worked out that I can now simply shuffle the data-frame of one of the sub-dfs, in this case the follower_df
follower_rand = random.sample(follower_df, len(follower_df))
Having got to this stage I'm not sure where to turn next. I suspect I will need some sort of zip function, but am unsure exactly what. I'm also unsure how I go about merging the turns together such that they form the same dataframe structure I initially had. Assuming Ringo (leader) is randomly assigned to Bubbles (follower) for one of the DFs, I would hope to have something like this...
speaker | turn | sp_role
------------------------------------
ringo hello 1
bubbles hi there 2
ringo nice weather 1
bubbles yes it is 2
I'm working on text analysis and try to quantify the value of sentence as the sum of the value assigned to some words if they are in the sentence. I have a DF with words and values such as:
import pandas as pd
df_w = pd.DataFrame( { 'word': [ 'high', 'sell', 'hello'],
'value': [ 32, 45, 12] } )
Then I have sentences in another DF such as:
df_s = pd.DataFrame({'sentence': [ 'hello life if good',
'i sell this at a high price',
'i sell or you sell'] } )
Now, I want to add a column in df_s with the sum of the value of each word in the sentence if the word is in the df_w. To do so, I tried:
df_s['value'] = df_s['sentence'].apply(lambda x: sum(df_w['value'][df_w['word'].isin(x.split(' '))]))
The results is:
sentence value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 45
My problem with this answer is that for the last sentence i sell or you sell, I have twice sell and I was expecting 90 (2*45) but sell has been considered only once so I got 45.
To solve this, I decided to create a dictionary and then do a apply:
dict_w = pd.Series(df_w['value'].values,index=df_w['word']).to_dict()
df_s['value'] = df_s['sentence'].apply(lambda x: sum([dict_w[word] for word in x.split(' ') if word in dict_w.keys()]))
This time, the result is what I expected (90 for the last sentence). But my problem comes with larger DF, and the time to perform the method with dict_w is about 20 times longer than the method with isin for my test case.
Do you know an way to multiply the value of a word by its occurrence within the method with isin? any other solution is welcomed too.
You can using str.split with stack and filter(isin) the result , replace those key words to value , then assign it back
s=df_s.sentence.str.split(' ',expand=True).stack()
df_s['Value']=s[s.isin(df_w.word)].replace(dict(zip(df_w.word,df_w.value))).sum(level=0)
df_s
Out[984]:
sentence Value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 90
Create a function with a default value out of a dictionary's get method
dw = lambda x: dict(zip(df_w.word, df_w.value)).get(x, 0)
df_s.assign(value=[sum(map(dw, s.split())) for s in df_s.sentence])
sentence value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 90
Thanks to the answer of piRSquared with his map function, I had the idea to use merge such as:
df_s['value'] = df_s['sentence'].apply(lambda x: sum(pd.merge(pd.DataFrame({'word':x.split(' ')}),df_w)['value']))
Thanks to the answer of Wen with his stack function, I use his idea but with merge such as:
df_stack = pd.DataFrame({'word': df_s['sentence'].str.split(' ',expand=True).stack()})
df_s['value'] = df_stack.reset_index().merge(df_w).set_index(['level_0','level_1'])['value'].sum(level=0)
And both methods give me the right answer.
Finally, to test which solution is faster, I define functions such as:
def sol_dict (df_s, df_w): # answer with a dict
dict_w = pd.Series(df_w['value'].values,index=df_w['word']).to_dict()
df_s['value'] = df_s['sentence'].apply(lambda x: sum([dict_w[word] for word in x.split(' ') if word in dict_w.keys()]))
return df_s
def sol_wen(df_s, df_w): # answer of Wen
s=df_s.sentence.str.split(' ',expand=True).stack()
df_s['value']=s[s.isin(df_w.word)].replace(dict(zip(df_w.word,df_w.value))).sum(level=0)
return df_s
def sol_pi (df_s, df_w): # answer of piRSquared
dw = lambda x: dict(zip(df_w.word, df_w.value)).get(x, 0)
df_s.assign(value=[sum(map(dw, s.split())) for s in df_s.sentence])
# or df_s['value'] = [sum(map(dw, s.split())) for s in df_s.sentence]
return df_s
def sol_merge(df_s, df_w): # answer with merge
df_s['value'] = df_s['sentence'].apply(lambda x: sum(pd.merge(pd.DataFrame({'word':x.split(' ')}),df_w)['value']))
return df_s
def sol_stack(df_s, df_w): # answer with stack and merge
df_stack = pd.DataFrame({'word': df_s['sentence'].str.split(' ',expand=True).stack()})
df_s['value'] = df_stack.reset_index().merge(df_w).set_index(['level_0','level_1'])['value'].sum(level=0)
return df_s
My "large" test DFs where composed of around 3200 words in df_w and around 42700 words in df_s (once split all sentences). I run timeit with several size of df_w (from 320 to 3200 words) with the full size of df_s and then with several size of df_s (from 3500 to 42700 words) with the full size of df_w. After curve fitting my results, I got:
To conclude, whatever is the size of both DFs, the method using stack then merge is really efficient (around 100ms, sorry not really visible on graphs). I run it on a my full size DFs with around 54k words in df_w 2.4 millions words in df_s and I got the results in few seconds.
Thanks both for your ideas.
I have two pandas data frames. The first one contains a list of unigrams extracted from the text, count and probability of the unigram occurring in the text. The structure looks like this:
unigram_df
word count prob
0 we 109 0.003615
1 investigated 20 0.000663
2 the 1125 0.037315
3 potential 36 0.001194
4 of 1122 0.037215
The second one contains a list of skipgrams extracted from the same text, along with the count and probability of the skipgram occurring in the text. It looks like this:
skipgram_df
word count prob
0 (we, investigated) 5 0.000055
1 (we, the) 31 0.000343
2 (we, potential) 2 0.000022
3 (investigated, the) 11 0.000122
4 (investigated, potential) 3 0.000033
Now, I want to calculate the pointwise mutual information for each skipgram, which is basically a log of skipgram probability divided by the product of its unigrams' probabilities. I wrote a function for that, which iterates through the skipgram df and and it works exactly how I want, but I have huge issues with performance, and I wanted to ask if there is a way to improve my code to make it calculate the pmi faster.
Here's my code:
def calculate_pmi(row):
skipgram_prob = float(row[3])
x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][0]]
['prob'])
y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][1]]
['prob'])
pmi = math.log10(float(skipgram_prob / (x_unigram_prob * y_unigram_prob)))
result = str(str(row[1][0]) + ' ' + str(row[1][1]) + ' ' + str(pmi))
return result
pmi_list = list(map(calculate_pmi, skipgram_df.itertuples()))
Performance of the function for now is around 483.18it/s, which is super slow, as I have hundreds of thousands of skipgrams to iterate through. Any suggestions would be welcome. Thanks.
This is a good question, and exercise, for new users of pandas. Use df.iterrows only as a last resort and, even then, consider alternatives. There are relatively few occasions when this is the right option.
Below is an example of how you can vectorise your calculations.
import pandas as pd
import numpy as np
uni = pd.DataFrame([['we', 109, 0.003615], ['investigated', 20, 0.000663],
['the', 1125, 0.037315], ['potential', 36, 0.001194],
['of', 1122, 0.037215]], columns=['word', 'count', 'prob'])
skip = pd.DataFrame([[('we', 'investigated'), 5, 0.000055],
[('we', 'the'), 31, 0.000343],
[('we', 'potential'), 2, 0.000022],
[('investigated', 'the'), 11, 0.000122],
[('investigated', 'potential'), 3, 0.000033]],
columns=['word', 'count', 'prob'])
# first split column of tuples in skip
skip[['word1', 'word2']] = skip['word'].apply(pd.Series)
# set index of uni to 'word'
uni = uni.set_index('word')
# merge prob1 & prob2 from uni to skip
skip['prob1'] = skip['word1'].map(uni['prob'].get)
skip['prob2'] = skip['word2'].map(uni['prob'].get)
# perform calculation and filter columns
skip['result'] = np.log(skip['prob'] / (skip['prob1'] * skip['prob2']))
skip = skip[['word', 'count', 'prob', 'result']]
I have a dataset with a column that has comments. This comments are words separated by commas.
df_pat['reason'] =
chest pain
chest pain, dyspnea
chest pain, hypertrophic obstructive cariomyop...
chest pain
chest pain
cad, rca stents
non-ischemic cardiomyopathy, chest pain, dyspnea
I would like to generate separated columns in the dataframe so that a column represent each word from all the set of words, and then have 1 or 0 to the rows where I initially had that word in the comment.
For example:
df_pat['chest_pain'] =
1
1
1
1
1
1
0
1
df_pat['dyspnea'] =
0
1
0
0
0
0
1
And so on...
Thank you!
sklearn.feature_extraction.text has something for you! It looks like you may be trying to predict something. If so - and if you're planning to use sci-kit learn at some point, then you can bypass making a dataframe with len(set(words)) number of columns and just use CountVectorizer. This method will return a matrix with dimensions (rows, columns) = (number of rows in dataframe, number of unique words in entire 'reason' column).
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'reason': ['chest pain', 'chest pain, dyspnea', 'chest pain, hypertrophic obstructive cariomyop', 'chest pain', 'chest pain', 'cad, rca stents', 'non-ischemic cardiomyopathy, chest pain, dyspnea']})
# turns body of text into a matrix of features
# split string on commas instead of spaces
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(","))
# X is now a n_documents by n_distinct_words-dimensioned matrix of features
X = vectorizer.fit_transform(df['reason'])
pandas plays really nicely with sklearn.
Or, a strict pandas solution that should probably be vectorized, but if you don't have that much data, should work:
# split on the comma instead of spaces to get "chest pain" instead of "chest" and "pain"
reasons = [reason for case in df['reason'] for reason in case.split(",")]
for reason in reasons:
for idx in df.index:
if reason in df.loc[idx, 'reason']:
df.loc[idx, reason] = 1
else:
df.loc[idx, reason] = 0