Groupby multiple columns in pandas dataframe - python

I have a dataframe looks like:
page reference ids - subject word
1 apple ['aaaa', 'bbbbb', 'cccc'] name app
1 apple ['bndv', 'asasa', 'swdsd'] fruit is
1 apple ['bsnm', 'dfsd', 'dgdf'] fruit text
1 bat ['asas', 'ddfgd', 'ff'] thing sport
1 cat ['sds', 'dffd', 'gdg'] fruit color
1 bat ['sds', 'fsss', 'ssfd'] thing was
1 bat ['fsf', 'sff', 'fss'] place that
2 dog ['fffds', 'gd', 'sdg'] name mud
2 egg ['dfff', 'sdf', 'vcv'] place gun
2 dog ['dsfd', 'fds', 'gfdg'] thing kit
2 egg ['ddd', 'fg', 'dfg'] place hut
I want to groupby reference column and subject column. The output should look like this:
output:
page reference ids subject word
1 apple [['bndv', 'asasa', 'swdsd'],['bsnm', 'dfsd', 'dgdf']] fruit [[is], [text]]
1 apple ['aaaa', 'bbbbb', 'cccc'] name [app]
1 bat [['asas', 'ddfgd', 'ff'], [['sds', 'fsss', 'ssfd']] thing [[sport], [was]]
1 bat ['fsf', 'sff', 'fss'] place [that]
1 cat ['sds', 'dffd', 'gdg'] fruit [color]
2 dog ['fffds', 'gd', 'sdg'] name [mud]
2 dog ['dsfd', 'fds', 'gfdg'] thing [kit]
2 egg [['dfff', 'sdf', 'vcv'], ['ddd', 'fg', 'dfg']] place [[gun], [hut]]

First to group and aggregate necessary fields:
res = df.groupby(["reference", "subject"]).agg({"page": min, "ids": list, "word": lambda l: [[ll] for ll in l]}).reset_index
reference subject page ids word
0 apple fruit 1 [[bndv, asasa, swdsd], [bsnm, dfsd, dgdf]] [[is], [text]]
1 apple name 1 [[aaaa, bbbbb, cccc]] [[app]]
2 bat place 1 [[fsf, sff, fss]] [[that]]
3 bat thing 1 [[asas, ddfgd, ff], [sds, fsss, ssfd]] [[sport], [was]]
4 cat fruit 1 [[sds, dffd, gdg]] [[color]]
5 dog name 2 [[fffds, gd, sdg]] [[mud]]
6 dog thing 2 [[dsfd, fds, gfdg]] [[kit]]
7 egg place 2 [[dfff, sdf, vcv], [ddd, fg, dfg]] [[gun], [hut]]
Note that this also wraps each word value in a list, just like what you want in your desired output. I'm also just assuming to take the minimum page value in each group since you didn't mention the rule for this variable. You can update the min value in agg function to whatever you think is appropriate.
Then you can cleanup the lists if length is 1:
res["word"] = res["word"].apply(lambda l: l[0] if len(l) == 1 else l)
res["ids"] = res["ids"].apply(lambda l: l[0] if len(l) == 1 else l)
reference subject page ids word
0 apple fruit 1 [[bndv, asasa, swdsd], [bsnm, dfsd, dgdf]] [[is], [text]]
1 apple name 1 [aaaa, bbbbb, cccc] [app]
2 bat place 1 [fsf, sff, fss] [that]
3 bat thing 1 [[asas, ddfgd, ff], [sds, fsss, ssfd]] [[sport], [was]]
4 cat fruit 1 [sds, dffd, gdg] [color]
5 dog name 2 [fffds, gd, sdg] [mud]
6 dog thing 2 [dsfd, fds, gfdg] [kit]
7 egg place 2 [[dfff, sdf, vcv], [ddd, fg, dfg]] [[gun], [hut]]

Related

Create new column using str.contains and based on if-else condition

I have a list of names 'pattern' that I wish to match with strings in column 'url_text'. If there is a match i.e. True the name should be printed in a new column 'pol_names_block' and if False leave the row empty.
pattern = '|'.join(pol_names_list)
print(pattern)
'Jon Kyl|Doug Jones|Tim Kaine|Lindsey Graham|Cory Booker|Kamala Harris|Orrin Hatch|Bernie Sanders|Thom Tillis|Jerry Moran|Shelly Moore Capito|Maggie Hassan|Tom Carper|Martin Heinrich|Steve Daines|Pat Toomey|Todd Young|Bill Nelson|John Barrasso|Chris Murphy|Mike Rounds|Mike Crapo|John Thune|John. McCain|Susan Collins|Patty Murray|Dianne Feinstein|Claire McCaskill|Lamar Alexander|Jack Reed|Chuck Grassley|Catherine Masto|Pat Roberts|Ben Cardin|Dean Heller|Ron Wyden|Dick Durbin|Jeanne Shaheen|Tammy Duckworth|Sheldon Whitehouse|Tom Cotton|Sherrod Brown|Bob Corker|Tom Udall|Mitch McConnell|James Lankford|Ted Cruz|Mike Enzi|Gary Peters|Jeff Flake|Johnny Isakson|Jim Inhofe|Lindsey Graham|Marco Rubio|Angus King|Kirsten Gillibrand|Bob Casey|Chris Van Hollen|Thad Cochran|Richard Burr|Rob Portman|Jon Tester|Bob Menendez|John Boozman|Mazie Hirono|Joe Manchin|Deb Fischer|Michael Bennet|Debbie Stabenow|Ben Sasse|Brian Schatz|Jim Risch|Mike Lee|Elizabeth Warren|Richard Blumenthal|David Perdue|Al Franken|Bill Cassidy|Cory Gardner|Lisa Murkowski|Maria Cantwell|Tammy Baldwin|Joe Donnelly|Roger Wicker|Amy Klobuchar|Joel Heitkamp|Joni Ernst|Chris Coons|Mark Warner|John Cornyn|Ron Johnson|Patrick Leahy|Chuck Schumer|John Kennedy|Jeff Merkley|Roy Blunt|Richard Shelby|John Hoeven|Rand Paul|Dan Sullivan|Tim Scott|Ed Markey'
I am using the following code df['url_text'].str.contains(pattern) which results in True in case a name in 'pattern' is present in a row in column 'url_text' and False otherwise. With that I have tried the following code:
df['pol_name_block'] = df.apply(
lambda row: pol_names_list if df['url_text'].str.contains(pattern) in row['url_text'] else ' ',
axis=1
)
I get the error:
TypeError: 'in <string>' requires string as left operand, not Series
From this toy Dataframe :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
... id,url_text
... 1,Tim Kaine
... 2,Tim Kain
... 3,Tim
... 4,Lindsey Graham.com
... """), sep=',')
>>> df
id url_text
0 1 Tim Kaine
1 2 Tim Kain
2 3 Tim
3 4 Lindsey Graham.com
From pol_names_list, we build patterns by formating it like so :
patterns = '(%s)' % '|'.join(pol_names_list)
Then, we can use the extract method to assign the value to the column pol_name_block to get the expected result :
df['pol_name_block'] = df['url_text'].str.extract(patterns)
Output :
id url_text pol_name_block
0 1 Tim Kaine Tim Kaine
1 2 Tim Kain NaN
2 3 Tim NaN
3 4 Lindsey Graham.com Lindsey Graham
Change your pattern to enclose it around a capture group () and use extract:
pattern = fr"({'|'.join(pol_names_list)})"
df['pol_name_block'] = df['url_text'].str.extract(pattern)
print(df)
# Output <- with the sample of #tlentali
id url_text pol_name_block
0 1 Tim Kaine Tim Kaine
1 2 Tim Kain NaN
2 3 Tim NaN
3 4 Lindsey Graham Lindsey Graham
Important: you can extract only one element even there are multiple matches. If you want to extract all elements you have to use findall or extractall (only the output format will change)
# New sample, same pattern
>>> df
id url_text
0 1 Tim Kaine and Lindsey Graham
1 2 Tim Kain
2 3 Tim
3 4 Lindsey Graham
# findall
>>> df['url_text'].str.findall(pattern)
0 [Tim Kaine, Lindsey Graham]
1 []
2 []
3 [Lindsey Graham]
Name: url_text, dtype: object
# extractall
>>> df['url_text'].str.extractall(pattern)
0
match
0 0 Tim Kaine
1 Lindsey Graham
3 0 Lindsey Graham

Get the Most Popular Trigrams for Each Row in a Pandas Dataframe

I'm new to python and trying to get a list of the most popular trigrams for each row in a Pandas dataframe from a column named ['Question'].
I've come close to what I need, but I am unable to get the popularity counts at a row level. Ideally I'd just like to keep the ngrams with a minimum frequency about 1.
Minimum Reproduceable Example:
import pandas as pd import nltk
data = {
"question": [
"The quick brown fox jumps over the lazy dog",
"Waltz, bad nymph, for quick jigs vex.",
"Glib jocks quiz nymph to vex dwarf.",
"Sphinx of black quartz, judge my vow.",
"How vexingly quick daft zebras jump!",
] }
df = pd.DataFrame(data)
df["bigrams"] = df['question'].apply(lambda row: list(nltk.bigrams(row.split(' '))))
print(df)
Current Output:
question bigrams
0 The quick brown fox jumps over the lazy dog [(The, quick), (quick, brown), (brown, fox), (...
1 Waltz, bad nymph, for quick jigs vex. [(Waltz,, bad), (bad, nymph,), (nymph,, for), ...
2 Glib jocks quiz nymph to vex dwarf. [(Glib, jocks), (jocks, quiz), (quiz, nymph), ...
3 Sphinx of black quartz, judge my vow. [(Sphinx, of), (of, black), (black, quartz,), ...
4 How vexingly quick daft zebras jump! [(How, vexingly), (vexingly, quick), (quick, d...
Desired Output: (Or close to it - I'm not sure how best to represent the frequency counts!)
question bigrams
0 The quick brown fox jumps over the lazy dog [(The, quick,1), (quick, brown,1), (brown, fox), (...
1 Waltz, bad nymph, for quick jigs vex. [(Waltz,, bad,1), (bad, nymph,2), (nymph,, for), ...
1 Glib jocks quiz nymph to vex dwarf. [(Glib, jocks,1), (jocks,quiz,2),
1 Sphinx of black quartz, judge my vow. [(Sphinx, of,1), (of, black,2), (black, quartz,), ...
1 How vexingly quick daft zebras jump! [(How, vexingly.1), (vexingly, quick,1), (quick, d...
Input data (for demo purpose, all strings have been cleaned):
data = ["she wants to sing she wants to act she wants to dance",
"if you sing I will smile if you laugh I will smile if you love I will smile"]
df = pd.DataFrame({"question": data})
Compute frequency distribution of bigrams with nltk.FreqDist:
bigram_freq = lambda s: list(nltk.FreqDist(nltk.bigrams(s.split(" "))).items())
out = df['question'].apply(bigram_freq).explode()
out = pd.DataFrame(out.to_list(), index=out.index, columns=["question", "bigrams"])
Output result:
>>> out
question bigrams
0 (she, wants) 3
0 (wants, to) 3
0 (to, sing) 1
0 (sing, she) 1
0 (to, act) 1
0 (act, she) 1
0 (to, dance) 1
1 (if, you) 3
1 (you, sing) 1
1 (sing, I) 1
1 (I, will) 3
1 (will, smile) 3
1 (smile, if) 2
1 (you, laugh) 1
1 (laugh, I) 1
1 (you, love) 1
1 (love, I) 1

Understanding Pandas Pivot function

I want to convert a categorical column in a pandas dataframe to multiple columns containing values. Here is a minimal example dataframe
dfTest = pd.DataFrame({
'animal' : ['cat','cat','dog','dog', 'mouse', 'mouse', 'rat', 'rat'],
'color' : ['black', 'white', 'black', 'white', 'black', 'white', 'black', 'white'],
'weight' : np.random.uniform(3, 20, 8)
})
dfTest
The table looks like this
According to pandas user guide, it seems to me that what I want to do is called a pivot. Namely, what I want to do should look something like this
animal weight_black weight_white
0 cat 1.23456 2.34234
1 dog 3.634634 3.4554646
2 mouse 5.24234 5.463452
3 rat 4.56456 2.3364
However, when I run
dfTest.pivot(columns='color', values='weight')
I get the following:
I don't want other categorical columns (such as animal) to disappear. Also, I don't want nans inbetween, I want everything to be compact. How do I do this?
EDIT: Here's a more involved example of what I want
animal color hair_length weight
1 cat black long 1.23
2 cat white long 2.34
3 cat black short 34534
4 cat white short 345
5 dog black long 234
6 dog white long 123
7 dog black short 444
8 dog white short 345
9 rat black long 5465
10 rat white long 2343
11 rat black short 123
12 rat white short 2343
13 bat black long 423
14 bat white long 23
15 bat black short 11123
16 bat white short 13423
I want to convert it to
animal hair_length weight_black weight_white
1 cat long 2.34 235
2 cat short 345 3423
3 dog long 123 56346
4 dog short 345 .... you get the point
5 rat long 2343
6 rat short 2343
7 bat long 23
8 bat short 13423
Ok I think I figured it out, #Randy's hint was actually enough
index = list(set(dfTest.columns) - {'color', 'weight'})
dfResult = df.pivot(index=index, columns='color', values='weight').reset_index()
So we
Put all of the columns except for the two columns of interest into index
Perform pivot, which results in complicated hierarchical index
Convert from complicated index to simple index by doing reset_index()

How to select the subset of data from each category using for loop in Python?

I have customer data (in CSV format) as:
index category text
0 spam you win much money
1 spam you win 7000 car
2 not_spam the weather in Chicago is nice
3 neutral we have a party now
4 neutral they are driving to downtown
5 not_spam pizza is an Italian food
As an example, each category contains various count:
customer.category.value_counts():
spam 100
not_spam 20
neutral 45
where:
min(customer.category.value_counts()): 20
I want to write a for loop in python to create a new data file that for all category only contains the same size equal to the smallest category count (in the example here smallest category is not_spam).
My expected output would be:
new_customer.category.value_counts():
spam 20
not_spam 20
neutral 20
It's easier to use groupby:
min_count = df.category.value_counts().min()
df.groupby('category').head(min_count)
That said, if you really want a loop, you can use it as a list comprehension which is faster:
categories = df.category.unique()
min_count = df.category.value_counts().min()
df = pd.concat([df.query('category==#cat')[:min_count] for cat in categories])
My randomly generated dataframe has 38 rows with the following distribution of categories:
spam 17
not_spam 16
neutral 5
Name: category, dtype: int64
I was thinking that the first thing you need to do is to find the smallest category, and once you know that, you could .sample() each category using calculated value as n:
def sample(df: pd.DataFrame, category: pd.Series):
threshold = df[category].value_counts().min()
for cat in df[category].unique():
data = df.loc[df[category].eq(cat)]
yield data.sample(threshold)
data = sample(df, "category")
pd.concat(data, ignore_index=True)
text category
0 v not_spam
1 l not_spam
2 q not_spam
3 j not_spam
4 f not_spam
5 l spam
6 t spam
7 r spam
8 n spam
9 k spam
10 n neutral
11 n neutral
12 d neutral
13 q neutral
14 l neutral
This should work. it keeps concatenating generated top min records from each category
minval = min(df1.category.value_counts())
df2 = pd.concat([df1[df1.category == cat].head(minval) for cat in df1.category.unique() ])
print(df2)

How to store a calculated value in new column by iterating through each row in a dataframe in Python?

The dataframe I am working with looks like this:
vid2 FStart FEnd cap2 VDuration COS cap1
0 -_aaMGK6GGw_57_61 0 3 A man grabbed a boy from his collar and threw ... 4 2 A man and woman are yelling at a young boy and...
1 -_aaMGK6GGw_57_61 3 4 A lady is waking up a man lying on a chair and... 4 2 A man and woman are yelling at a young boy and...
2 -_hbPLsZvvo_5_8 0 1 A white dog is barking and a caption is writte... 3 2 a dog barking and cooking with her master in t...
... ... ... ... ... ... ...
I am trying to calculate a similarity score between the two columns cap1 and cap2. However, I want to create a new column FSim that stores this similarity score for each row.
The code I have implemented till now is:
#The function that calculates the similarity score
def get_cosine_similarity(feature_vec_1, feature_vec_2):
return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]
for i, row in merged.iterrows():
captions = []
captions.append(row['cap1'])
captions.append(row['cap2'])
for c in range(len(captions)):
captions[c] = pre_process(captions[c])
captions[c] = lemmatize_sentence(captions[c])
feature_vectors = tfidf_vectorizer.transform(captions)
fsims = get_cosine_similarity(feature_vectors[0], feature_vectors[1])
merged['fsim'] = fsim
But I am getting the same similarity scored stored for each row like this:
fsim
0 0.054464
1 0.054464
2 0.054464
3 0.054464
4 0.054464
Same value for all the rows.
How to get properly stored the score for each row?
How about this ? (I'm assuming the DataFrame you have first is merged)
def preproc_and_lemmatize(x):
v1 = pre_process(x)
return lemmatize_sentence(v1)
def calc_sim(x, y):
x2 = preproc_and_lemmatize(x)
y2 = preproc_and_lemmatize(y)
feature_vectors = tfidf_vectorize.transform([x2, y2])
return get_cosine_similarity(feature_vectors[0], feature_vectors[1])
merged['fsim'] = [
calc_sim(x, y) for x, y in zip(merged['cap1'], merged['cap2'])
]
If you prefer to less edit, this will work.
merged["fsim"] = 0
for i, row in merged.iterrows():
captions = []
captions.append(row['cap1'])
captions.append(row['cap2'])
for c in range(len(captions)):
captions[c] = pre_process(captions[c])
captions[c] = lemmatize_sentence(captions[c])
feature_vectors = tfidf_vectorizer.transform(captions)
fsims = get_cosine_similarity(feature_vectors[0], feature_vectors[1])
merged['fsim'].iloc[i] = fsims

Categories