Pandas Dataframe group by, column with a list - python

Im using jupyter notebooks, my current dataframe looks like the following:
players_mentioned | tweet_text | polarity
______________________________________________
[Mane, Salah] | xyz | 0.12
[Salah] | asd | 0.06
How can I group all players individually and average their polarity?
Currently I have tried to use:
df.groupby(df['players_mentioned'].map(tuple))['polarity'].mean()
But this will return a dataframe grouping all the mentions when together as well as separate, how best can I go about splitting the players up and then grouping them back together.
An expected output would contain
player | polarity_average
____________________________
Mane | 0.12
Salah | 0.09
In other words how to group by each item in the lists in every row.

you can use the unnesting idiom from this answer.
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
You can now call groupby on the unnested "players_mentioned" column.
(unnesting(df, ['players_mentioned'])
.groupby('players_mentioned', as_index=False).mean())
players_mentioned polarity
0 Mane 0.12
1 Salah 0.09

If you are just looking to group by players_mentioned and get the averatge for that players popularity score this should do it.
df.groupby('players_mentioned').polarity.agg('mean')

Related

Use rows values from a pandas dataframe as new columns label

If I have a pandas dataframe it's possible to get values from a row and use it as a label for a new column?
I have something like this:
| Team| DateTime| Score
| Red| 2021/03/19 | 5
| Red| 2021/03/20 | 10
| Blue| 2022/04/10 | 20
I would like to write this data on a new dataframe that has:
Team Column
Year/Month SumScore Column
So I would have a row per team with multiple new columns for a month in a year that contains the sum of the score for a specific month.
It should be like this:
Team
2021/03
2022/04
Red
15
0
Blue
0
20
The date format time is YYYY/MM/DD
I hope I was clear
You can use
df = (df.assign(YM=df['DateTime'].str.rsplit('/', 1).str[0])
.pivot_table(index='Team', columns='YM', values='Score', aggfunc='sum', fill_value=0)
.reset_index())
print(df)
YM Team 2021/03 2022/04
0 Blue 0 20
1 Red 15 0
We can use pd.crosstab which allows us to
Compute a simple cross tabulation of two (or more) factors
Below I've changed df['DateTime'] to contain year/month only.
df['DateTime'] = pd.to_datetime(df['DateTime']).dt.strftime('%Y/%m')
pd.crosstab(
df['Team'],
df['DateTime'],
values=df['Score'],
aggfunc='sum'
).fillna(0)
If you don't want multiple levels in the index, just use the method call reset_index on your crosstab and then drop DateTime.

Pandas DataFrame: Fill NA values based on group mean

I would like to update the NA values of a Pandas DataFrame column with the values in a groupby object.
Let's illustrate with an example:
We have the following DataFrame columns:
|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1 | 1 | 1 | 14.3 |
| 2 | 1 | 1 | 14.8 |
| 3 | 1 | 2 | 13.1 |
|--------|-------|-----|-------------|
We're simply measuring temperature multiple times a day for many months. Now, let's assume that for some of our records, the temperature reading failed and we have a NA.
|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1 | 1 | 1 | 14.3 |
| 2 | 1 | 1 | 14.8 |
| 3 | 1 | 2 | 13.1 |
| 4 | 1 | 2 | NA |
| 5 | 1 | 3 | 14.8 |
| 6 | 1 | 4 | NA |
|--------|-------|-----|-------------|
We could just use panda's .fillna(), however we want to be a little more sophisticated. Since there are multiple readings per day (there could be 100's per day), we'd like to take the daily average and use that as our fill value.
we can get the daily averages with a simple groupby:
avg_temp_by_month_day = df.groupby(['month'])['day'].mean()
Which gives us the means for each day by month. The question is, how best to fill the NA values with the groupby values?
We could use an apply(),
df['temperature'] = df.apply(
lambda row: avg_temp_by_month_day.loc[r['month'], r['day']] if pd.isna(r['temperature']) else r['temperature'],
axis=1
)
however this is really slow (1M+ records).
Is there a vectorized approach, perhaps using np.where(), or maybe creating another Series and merging.
What's the a more efficient way to perform this operation?
Thank you!
I'm not sure if this is the fastest, however instead of taking ~1 hour for apply, it takes ~20 sec for +1M records. The below code has been updated to work on 1 or many columns.
local_avg_cols = ['temperature'] # can work with multiple columns
# Create groupby's to get local averages
local_averages = df.groupby(['month', 'day'])[local_avg_cols].mean()
# Convert to DataFrame and prepare for merge
local_averages = pd.DataFrame(local_averages, columns=local_avg_cols).reset_index()
# Merge into original dataframe
df = df.merge(local_averages, on=['month', 'day'], how='left', suffixes=('', '_avg'))
# Now overwrite na values with values from new '_avg' col
for col in local_avg_cols:
df[col] = df[col].mask(df[col].isna(), df[col+'_avg'])
# Drop new avg cols
df = df.drop(columns=[col+'_avg' for col in local_avg_cols])
If anyone finds a more efficient way to do this, (efficient in processing time, or in just readability), I'll unmark this answer and mark yours. Thank you!
I'm guessing what speeds down your process are two things. First, you don't need to convert your groupby to a dataframe. Second, you don't need the for loop.
from pandas import DataFrame
from numpy import nan
# Populating the dataset
df = {"Month": [1] * 6,
"Day": [1, 1, 2, 2, 3, 4],
"Temperature": [14.3, 14.8, 13.1, nan, 14.8, nan]}
# Creating the dataframe
df = pd.DataFrame(df, columns=df.keys())
local_averages = df.groupby(['Month', 'Day'])['Temperature'].mean()
df = df.merge(local_averages, on=['Month', 'Day'], how='left', suffixes=('', '_avg'))
# Filling the missing values of the Temperature column with what is available in Temperature_avg
df.Temperature.fillna(df.Temperature_avg, inplace=True)
df.drop(columns="Temperature_avg", inplace=True)
Groupby is a resource heavy process so make the most out of it when you use it. Furthermore, as you already know loops are not a good idea when it comes to dataframes. Additionally, if you have a large data you may want to avoid creating extra variables from it. I may put the groupby into the merge if my data has 1m rows and many columns.

Split Pivoted Index Column Pandas

I have a pivoted data frame that looks like this:
|Units_sold | Revenue
-------------------------------------
California_2015 | 10 | 600
California_2016 | 15 | 900
There are additional columns, but basically what I'd like to do is unstack the index column, and have my table look like this:
|State |Year |Units_sold |Revenue
-------------------------------------
California |2015 | 10 |600
California |2016 | 15 |900 `
Basically I had two data frames that I needed to merge, on the state and year, but I'm just not sure how to split the index column/ if that's possible. Still pretty new to Python, so I really appreciate any input!!
df = pd.DataFrame({'Units_sold':[10,15],'Revenue':[600,900]}, index=['California_2015','California_2016'])
df = df.reset_index()
df['State'] = df['index'].str.split("_").str.get(0)
df['Year'] = df['index'].str.split("_").str.get(1)
df = df.set_index('State')[['Year','Units_sold','Revenue']]
df

Grouping dataframe based on column similarities in Python

I have a dataframe with commonalities in groups of column names:
Sample1.Feature1 | Sample1.Feature2 | ... | Sample99.Feature1 | Sample99.Feature 2
And I'd like to reorder this as
|Sample1 ......................... | Sample99
|Feature 1, Feature 2 | ..... | Feature 1, Feature 2 |
I'd then have summary stats, e.g. mean, for Feature1, Feature2, grouped by Sample#. I've played with df.groupby() with no luck so far.
I hope my lack of table formatting skills doesn't distract from the question.
Thanks in advance.
consider the dataframe df
df = pd.DataFrame(
np.ones((1, 6)),
columns='s1.f1 s1.f2 s1.f3 s2.f1 s2.f2 s2.f3'.split())
df
split the columns
df.columns = df.columns.str.split('.', expand=True)
df

How do I apply a value from a dataframe based on the value of a multi-index of another dataframe?

I have the following:
Dataframe 1 (Multi-index dataframe):
| Assay_A |
---------------------------------------------------
Index_A | Index_B | Index_C | mean | std | count |
---------------------------------------------------
128 12345 AAA 123 2 4
Dataframe 2:
Index | Col_A | Col_B | Col_C | mean
-------------------------------------
1 128 12345 AAA 456
where Col_X = Index_X for a,b,c.
I have been spending all morning trying to do the following:
How do I pick the correct mean in dataframe 2 (which has to match up on Col ABC) so I can do mathematical operations on it. For example, I want to take the mean of dataframe 1 and divide it by the correctly chosen mean of dataframe 2.
Ideally, I want to store the results of the operation in a new column. So the final output should look like this:
| Assay_A |
------------------------------------------------------------
Index_A | Index_B | Index_C | mean | std | count | result |
------------------------------------------------------------
128 12345 AAA 123 2 4 0.26
Perhaps there is an easier way to do this I would be open to any such suggestions as well.
what I suggest you do is 1) rename the columns of Dataframe 2 to the respective names of the index columns of Dataframe 1, 2) reset the index on Dataframe 1, and 3) merge the two tables based on the now matching column names. Afterwards you can compute whatever you like. The MultiIndex on the columns of Dataframe 2 adds a bit of additional overhead.
Explicitly:
import pandas as pd
# re-create table1
row_index = pd.MultiIndex.from_tuples([(128, 12345, 'AAA')])
row_index.names=['Index_A', 'Index_B', 'Index_C']
table1 = pd.DataFrame(data={'mean': 123, 'std': 2, 'count': 4}, index=row_index)
table1.columns = pd.MultiIndex.from_tuples(zip(['Assay A'] * 3, table1.columns))
print "*** table 1:"
print table1
print ""
# re-create table2
table2 = pd.DataFrame([{'Col_A': 128, 'Col_B': 12345, 'Col_C': 'AAA', 'mean': 456}], index=[1])
table2.index.name = 'Index'
print "*** table 2:"
print table2
print ""
# re-name columns of table2 to match names of respective index columns in table1
table2 = table2.rename(columns={'Col_A': 'Index_A', 'Col_B': 'Index_B', 'Col_C': 'Index_C'})
# Drop 'Assay A' index level on columns of table1;
# without doing that, the following reset_index() will produce a column multi-index
# for Index_A/B/C, so column names will not match the simple column index of table2_renamed.
# If you need to keep the 'Assay A' level here, you will need to also construct a column
# multi-index for table2_renamed (with empty values for the second level).
table1.columns = table1.columns.levels[1]
# Move index columns of table1 back to regular columns
table1 = table1.reset_index()
# Merge the two tables on the now common column names. 'mean' appears in both tables,
# give the column from table2 a suffix '_2'.
joint = pd.merge(table1.reset_index(), table2, on=['Index_A', 'Index_B', 'Index_C'], suffixes={'', '_2'})
print "*** joint, before re-setting index:"
print joint
print ""
# Restore index of the joint table
joint = joint.set_index(['Index_A', 'Index_B', 'Index_C'])
# Compute the 'result'
joint['result'] = joint['mean'] / joint['mean_2']
# drop unused columns
joint = joint.drop(['index', 'mean_2'], axis=1)
# restore column index level
joint.columns = pd.MultiIndex.from_tuples(zip(['Assay A'] * 4, joint.columns))
print "*** final result:"
print joint
print ""
The script output is:
*** table 1:
Assay A
count mean std
Index_A Index_B Index_C
128 12345 AAA 4 123 2
*** table 2:
Col_A Col_B Col_C mean
Index
1 128 12345 AAA 456
*** joint, before re-setting index:
index Index_A Index_B Index_C count mean std mean_2
0 0 128 12345 AAA 4 123 2 456
*** final result:
Assay A
count mean std result
Index_A Index_B Index_C
128 12345 AAA 4 123 2 0.269737
Hope that helps!

Categories