Python pandas - DataFrame groupby and re-construct - python

I have a question for groupby() in pandas
If I have a DataFrame "df" like
user day click
0 U1 Mon 15
1 U2 Mon 7
2 U1 Wed 15
3 U3 Tue 21
4 U2 Tue 15
5 U2 Tue 10
When I use df.groupby(['user', 'day']).sum()
It would be
click
user day
U1 Mon 15
Tue NaN
Wed 15
U2 Mon 7
Tue 25
Wed NaN
U3 Mon NaN
Tue 21
Wed NaN
How can I get a DataFrame like this
day Mon Tue Wed
user
U1 15 NaN 15
U2 7 25 NaN
U3 NaN 21 NaN
Which means transform one column to be the column name of DataFrame.
Is there any method to do this?

Use pivot function with day as columns and fill with clicks.
df.groupby(['user', 'day']).sum().reset_index()\
.pivot(index='user',columns='day',values='click')
Out[388]:
day Mon Tue Wed
user
U1 15.0 NaN 15.0
U2 7.0 25.0 NaN
U3 NaN 21.0 NaN
Or you can only reset the second level index so you don't need to specify index column in the pivot function.
df.groupby(['user', 'day']).sum().reset_index(level=1)\
.pivot(columns='day',values='click')

Just another way to use unstack():
df=df.groupby(['user', 'day']).sum().unstack('day') #unstack
df.columns = df.columns.droplevel() # drop first level column name
df
Output:
day Mon Tue Wed
user
U1 15.0 NaN 15.0
U2 7.0 25.0 NaN
U3 NaN 21.0 NaN

Related

Python/Pandas outer merge not including all relevant columns

I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?
Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)

ValueError: Data overlaps. in python

I have dataframe df3 that looks like this
with unknown columns length as AAA_??? can be anything from the dataset
Date ID Calendar_Year Month DayName... AAA_1E AAA_BMITH AAA_4.1 AAA_CH
0 2019-09-17 8661 2019 Sep Sun... NaN NaN NaN NaN
1 2019-09-18 8662 2019 Sep Sun... 1.0 3.0 34.0 1.0
2 2019-09-19 8663 2019 Sep Sun... NaN NaN NaN NaN
3 2019-09-20 8664 2019 Sep Mon... NaN NaN NaN NaN
4 2019-09-20 8664 2019 Sep Mon... 2.0 4.0 32.0 3.0
5 2019-09-20 8664 2019 Sep Sat... NaN NaN NaN NaN
6 2019-09-20 8664 2019 Sep Sat... NaN NaN NaN NaN
7 2019-09-20 8664 2019 Sep Sat... 0.0 4.0 30.0 0.0
another dataframe dfMeans that has the mean of a third dataframe
Month Dayname ID ... AAA_BMITH AAA_4.1 AAA_CH
0 Jan Thu 7686.500000 ... 0.000000 28.045455 0.0
1 Jan Fri 7636.272727 ... 0.000000 28.136364 0.0
2 Jan Sat 7637.272727 ... 0.000000 27.045455 0.0
3 Jan Sun 7670.090909 ... 0.000000 27.090909 0.0
4 Jan Mon 7702.909091 ... 0.000000 27.727273 0.0
5 Jan Tue 7734.260870 ... 0.000000 27.956522 0.0
the dataframes will be joined by Month and Dayname
I want to replace NaNs in df3 with values from dfMean
using this line
df3.update(dfMeans, overwrite=False, errors="raise")
but I get this error
raise ValueError("Data overlaps.")
ValueError: Data overlaps.
How to update NaNs with values from dfMean and avoid this error?
Edit :
I have put all dataframes in one dataframe df
Month Dayname ID ... AAA_BMITH AAA_4.1 AAA_CH
0 Jan Thu 7686.500000 ... 0.000000 28.045455 0.0
1 Jan Fri 7636.272727 ... 0.000000 28.136364 0.0
2 Jan Sat 7637.272727 ... 0.000000 27.045455 0.0
3 Jan Sun 7670.090909 ... 0.000000 27.090909 0.0
4 Jan Mon 7702.909091 ... 0.000000 27.727273 0.0
5 Jan Tue 7734.260870 ... 0.000000 27.956522 0.0
How can I fill NaNs with average based on Month and Dayname?
Using fillna:
Data:
Date ID Calendar_Year Month Dayname AAA_1E AAA_BMITH AAA_4.1 AAA_CH
2019-09-17 8661 2019 Jan Sun NaN NaN NaN NaN
2019-09-18 8662 2019 Jan Sun 1.0 3.0 34.0 1.0
2019-09-19 8663 2019 Jan Sun NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Mon NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Mon 2.0 4.0 32.0 3.0
2019-09-20 8664 2019 Jan Sat NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Sat NaN NaN NaN NaN
2019-09-20 8664 2019 Jan Sat 0.0 4.0 30.0 0.0
df.set_index(['Month', 'Dayname'], inplace=True)
df_mean:
Month Dayname ID AAA_BMITH AAA_4.1 AAA_CH
Jan Thu 7686.500000 0.0 28.045455 0.0
Jan Fri 7636.272727 0.0 28.136364 0.0
Jan Sat 7637.272727 0.0 27.045455 0.0
Jan Sun 7670.090909 0.0 27.090909 0.0
Jan Mon 7702.909091 0.0 27.727273 0.0
Jan Tue 7734.260870 0.0 27.956522 0.0
df_mean.set_index(['Month', 'Dayname'], inplace=True)
Update df:
This operation is based on matching index values
It doesn't work with multiple column names at once, you'll have to get the columns of interest and iterate through them
Note, AAA_1E isn't in df_mean
for col in df.columns:
if col in df_mean.columns:
df[col].fillna(df_mean[col], inplace=True)
You can groupby on 'Month' and DayName' and use apply to edit the dataframe.
Use fillna to fill the Nan values. fillna accepts a dictionary as value parameter: keys of the dictionary are column names, values are scalars: the scalars are used to substitute the Nan in each column. With loc you can select the proper value from dMeans.
You can create the dictionary with a dict comprehension, using the intersection between columns of df3 and dfMeans.
All this corresponds to the following statement:
df3filled = df3.groupby(['Month', 'DayName']).apply(lambda x : x.fillna(
{col : dfMeans.loc[(dfMeans['Month'] == x.name[0]) & (dfMeans['Dayname'] == x.name[1]), col].iloc[0]
for col in x.columns.intersection(dfMeans.columns)})).reset_index(drop=True)

How to make the pandas multi-index data frame a simple table with only one column rows?

I was comparing SQL to Pandas from the website http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html then I found that the result of groupby functions are different in pandas and sql.
For example:
In pandas:
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv')
df.head()
g = df.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
print(g)
Gives:
tip
size mean
smoker day
No Fri 4.0 3.187500
Sat 45.0 3.361556
Sun 57.0 3.386491
Thur 45.0 3.122667
Yes Fri 15.0 3.114000
Sat 41.0 3.048049
Sun 19.0 3.595789
Thur 17.0 3.030000
How to get the output as like given by SQL?
smoker day tip_size tip_mean
0 No Fri 4 2.812500
1 No Sat 45 3.102889
2 No Sun 57 3.167895
3 No Thur 45 2.673778
4 Yes Fri 15 2.714000
5 Yes Sat 41 2.701707
6 Yes Sun 19 3.516842
7 Yes Thur 17 3.030000
I found out the answer.
g = g.reset_index()
print(g)
gives:
smoker day tip
size mean
0 No Fri 4.0 2.812500
1 No Sat 45.0 3.102889
2 No Sun 57.0 3.167895
3 No Thur 45.0 2.673778
4 Yes Fri 15.0 2.714000
5 Yes Sat 42.0 2.875476
6 Yes Sun 19.0 3.516842
7 Yes Thur 17.0 3.030000
Now, g.column.values gives:
array([('smoker', ''), ('day', ''), ('tip', 'size'), ('tip', 'mean')],
dtype=object)
Using, list comprehension we can get the required column names
g.columns = ['_'.join(e) if e[1] else ''.join(e) for e in g.columns.values]
print(g)
This gives:
smoker day tip_size tip_mean
0 No Fri 4.0 2.812500
1 No Sat 45.0 3.102889
2 No Sun 57.0 3.167895
3 No Thur 45.0 2.673778
4 Yes Fri 15.0 2.714000
5 Yes Sat 42.0 2.875476
6 Yes Sun 19.0 3.516842
7 Yes Thur 17.0 3.030000
Look into g.reset_index() method.
This would solve the issue with the multi index.
For the columns I would suggest flatten using get_level_values() method
g.columns = g.columns.get_level_values(1) + '_' + g.get_level_values(0)
Also note from pandas group by documentation:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html
as_index : boolean, default True.
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

Write column backwards with condition in Python

I have the following df and want to write the number column backwards and also overwrite other values if necessary. The condition is to always use the previous value unless the new values difference to the old value is greater than 10%.
Date Number
2019 150
2018 NaN
2017 118
2016 NaN
2015 115
2014 107
2013 105
2012 NaN
2011 100
Because of the condition the value in e.g. 2013 is equal to 100, because it is not smaller than 90 and not greater than 110. The result would look like this:
Date Number
2019 150
2018 115
2017 115
2016 115
2015 115
2014 100
2013 100
2012 100
2011 100
You can reverse your column and then apply a function to update values. Finally reverse the column to the original order:
def get_val(x):
global prev_num
if x and x > prev_num*1.1:
prev_num = x
return prev_num
prev_num = 0
df['number'] = df['number'][::-1].apply(get_val)[::-1]
Just groupby the difference after floor division by 10 which is not equal to zero then transform the min i.e
df['x'] = df.groupby((df['number'].bfill()[::-1]//10).diff().ne(0).cumsum())['number'].transform(min)
date number x
0 2019 150.0 150.0
1 2018 NaN 115.0
2 2017 118.0 115.0
3 2016 NaN 115.0
4 2015 115.0 115.0
5 2014 107.0 100.0
6 2013 105.0 100.0
7 2012 NaN 100.0
8 2011 100.0 100.0
​
Here is one way. It assumes the first value 100 is not NaN and the original dataframe is ordered descending by year. If performance is an issue, the loop can be converted to a list comprehension.
lst = df.sort_values('date')['number'].ffill().tolist()
for i in range(1, len(lst)):
if abs(lst[i] - lst[i-1]) / lst[i] <= 0.10:
lst[i] = lst[i-1]
df['number'] = list(reversed(lst))
# date number
# 0 2019 150.0
# 1 2018 115.0
# 2 2017 115.0
# 3 2016 115.0
# 4 2015 115.0
# 5 2014 100.0
# 6 2013 100.0
# 7 2012 100.0
# 8 2011 100.0

Problem Using Pivot_Table in Python: is there any way to keep the original order of the data and not having multiindex?

I am trying to recreate my data frame. Below is the original dataframe:
df = pd.DataFrame([['January','Monday',0,1,20],['January','Monday',1,2,15],['January','Wednesday',0,1,35],['March','Monday',0,1,23],['March','Monday',1,2,50],['March','Monday',2,3,60] ,['April','Wednesday',0,1,75]],columns = ['Month','Day','Data1','Data2','Random'])
Month Day Data1 Data2 Random
0 January Monday 0 1 20
1 January Monday 1 2 15
2 January Wednesday 0 1 35
3 March Monday 0 1 23
4 March Monday 1 2 50
5 March Monday 2 3 60
6 April Wednesday 0 1 75
I am aiming to achieve the below result:
Month Day 0 1 2
0 January Monday 1 2.0 NaN
1 January Monday 1 2.0 NaN
2 January Wednesday 1 NaN NaN
3 March Monday 1 2.0 3.0
I tried to use the pivot_table as below, but of course it did not work as pivot_table does not allow any duplicate for the index and I will also have multiindex which causes problems in my later process.
df1 = pd.pivot_table(df, values = 'Data2', index = ['Month','Day'], columns = ['Data1'])
Data1 0 1 2
Month Day
April Wednesday 1.0 NaN NaN
January Monday 1.0 2.0 NaN
Wednesday 1.0 NaN NaN
March Monday 1.0 2.0 3.0
Is there any other way to get my aimed result? Many thanks in advance.
You can try using groupby with unstack:
df.groupby(['Month','Day','Data2'])['Data2'].first().unstack().reset_index()
Output:
Data2 Month Day 1 2 3
0 April Wednesday 1.0 NaN NaN
1 January Monday 1.0 2.0 NaN
2 January Wednesday 1.0 NaN NaN
3 March Monday 1.0 2.0 3.0

Categories