Here is a reproducible example:
import pandas as pd
df = pd.DataFrame([['Type A', 'Event1', 1, 2, 3], ['Type A', 'Event1', 4, 5, 6], ['Type A', 'Event1', 7, 8, 9],
['Type A', 'Event2', 10, 11, 12], ['Type A', 'Event2', 13, 14, 15], ['Type A', 'Event2', 16, 17, 18], \
['Type B', 'Event1', 19, 20, 21], ['Type B', 'Event1', 22, 23, 24], ['Type B', 'Event1', 25, 26, 27], \
['Type B', 'Event2', 28, 29, 30], ['Type B', 'Event2', 31, 32, 33], ['Type B', 'Event2', 34, 35, 36]])
df.columns = ['TypeName', 'EventNumber', 'PricePart1', 'PricePart2', 'PricePart3']
print(df)
Gives:
TypeName EventNumber PricePart1 PricePart2 PricePart3
0 Type A Event1 1 2 3
1 Type A Event1 4 5 6
2 Type A Event1 7 8 9
3 Type A Event2 10 11 12
4 Type A Event2 13 14 15
5 Type A Event2 16 17 18
6 Type B Event1 19 20 21
7 Type B Event1 22 23 24
8 Type B Event1 25 26 27
9 Type B Event2 28 29 30
10 Type B Event2 31 32 33
11 Type B Event2 34 35 36
Here is what I've tried:
df['Average'] = df[['PricePart1', 'PricePart2', 'PricePart3']].mean(axis = 1)
print(df)
TypeName EventNumber PricePart1 PricePart2 PricePart3 Average
0 Type A Event1 1 2 3 2.0
1 Type A Event1 4 5 6 5.0
2 Type A Event1 7 8 9 8.0
3 Type A Event2 10 11 12 11.0
4 Type A Event2 13 14 15 14.0
5 Type A Event2 16 17 18 17.0
6 Type B Event1 19 20 21 20.0
7 Type B Event1 22 23 24 23.0
8 Type B Event1 25 26 27 26.0
9 Type B Event2 28 29 30 29.0
10 Type B Event2 31 32 33 32.0
11 Type B Event2 34 35 36 35.0
Now that I have this new column called Average, I can group by TypeName, EventNumber columns and find the 25th and 50th percentile using this peice of code:
print(df.groupby(['TypeName', 'EventNumber'])['Average'].quantile([0.25, 0.50]).reset_index())
What I have:
TypeName EventNumber level_2 Average
0 Type A Event1 0.25 3.5
1 Type A Event1 0.50 5.0
2 Type A Event2 0.25 12.5
3 Type A Event2 0.50 14.0
4 Type B Event1 0.25 21.5
5 Type B Event1 0.50 23.0
6 Type B Event2 0.25 30.5
7 Type B Event2 0.50 32.0
I want the level_2 as seperate columns with the values from Average column like with the output DataFrame I've created:
df1 = pd.DataFrame([['Type A', 'Event1', 3.5, 5], ['Type A', 'Event2', 12.5, 14], ['Type B', 'Event1', 21.5, 23], ['Type B', 'Event2', 30.5, 32]])
df1.columns = ['TypeName', 'EventNumber', '0.25', '0.50']
print(df1)
What I want:
TypeName EventNumber 0.25 0.50
0 Type A Event1 3.5 5
1 Type A Event2 12.5 14
2 Type B Event1 21.5 23
3 Type B Event2 30.5 32
I'm super sure that this is some duplicate, but I have searched through StackOverflow and not found my answer because of the difficulty wording the question (or maybe just that I'm stupid)
Use unstack with reset_index:
df = (df.groupby(['TypeName', 'EventNumber'])['Average']
.quantile([0.25, 0.50])
.unstack()
.reset_index())
print (df)
TypeName EventNumber 0.25 0.5
0 Type A Event1 3.5 5.0
1 Type A Event2 12.5 14.0
2 Type B Event1 21.5 23.0
3 Type B Event2 30.5 32.0
Syntactic sugar solution - new column Average is not necessary, is possible use groupby with 3 Series:
s = df[['PricePart1', 'PricePart2', 'PricePart3']].mean(axis = 1)
df = (s.groupby([df['TypeName'], df['EventNumber']])
.quantile([0.25, 0.50])
.unstack()
.reset_index())
print (df)
TypeName EventNumber 0.25 0.5
0 Type A Event1 3.5 5.0
1 Type A Event2 12.5 14.0
2 Type B Event1 21.5 23.0
3 Type B Event2 30.5 32.0
Related
I have a dataframe that looks something like this
data = [['Location 1', 'Oranges', 9, 12, 5, 10, 7, 12], ['Location 1', 'Apples', 2, 6, 4, 3, 7, 2], ['Location 1', 'Total', 11, 18, 9, 13, 14, 14],
['Location 2', 'Oranges', 11, 8, 14, 8, 10, 9], ['Location 2', 'Apples', 5, 4, 6, 2, 9, 9], ['Location 2', 'Total', 16, 12, 20, 10, 19, 18]]
df = pd.DataFrame(data, columns=['Location', 'Fruit', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
)
Location Fruit Jan Feb Mar Apr May Jun
0 Location 1 Oranges 9 12 5 10 7 12
1 Location 1 Apples 2 6 4 3 7 2
2 Location 1 Total 11 18 9 13 14 14
3 Location 2 Oranges 11 8 14 8 10 9
4 Location 2 Apples 5 4 6 2 9 9
5 Location 2 Total 16 12 20 10 19 18
I would like to group by location, get the percent apples (Apples/Total) and transpose the dataframe to ultimately look like this
Jan Feb Mar
Location # of Apples % Fruit # of Apples % Fruit # of Apples % Fruit
Location 1 2 18.2% 6 33.3% 4 44.4%
Location 2 5 31.3% 4 33.3% 6 20.0%
I've tried using this, but it seemed sort of tedious since my complete dataset has more than two locations
df.iloc[3, 2:4] = df.iloc[1, 2:4] / df.iloc[2, 2:4]
Thank you!
Solution
# Set the index to location and fruit
s = df.set_index(['Location', 'Fruit'])
# Select the rows corresponding to Apples and Total
apples, total = s.xs('Apples', level=1), s.xs('Total', level=1)
# Divide apples by total to calculate pct then concat
pd.concat([apples / total * 100, apples], keys=['%_fruit', '#_of_apples']).unstack(0)
Result
Jan Feb Mar Apr May Jun
%_fruit #_of_apples %_fruit #_of_apples %_fruit #_of_apples %_fruit #_of_apples %_fruit #_of_apples %_fruit #_of_apples
Location
Location 1 18.181818 2.0 33.333333 6.0 44.444444 4.0 23.076923 3.0 50.000000 7.0 14.285714 2.0
Location 2 31.250000 5.0 33.333333 4.0 30.000000 6.0 20.000000 2.0 47.368421 9.0 50.000000 9.0
To achieve the desired outcome, you can use the pivot method to reshape the dataframe, then divide the values of the 'Apples' rows by the values of the 'Total' rows. Here's one way to do it:
df_pivot = df[df['Fruit'] != 'Total'].pivot(index='Location', columns='Fruit', values=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'])
df_pivot.columns = df_pivot.columns.map('_'.join)
df_pivot['Jan_% Fruit'] = df_pivot['Jan_Apples'] / df_pivot['Jan_Total']
df_pivot['Feb_% Fruit'] = df_pivot['Feb_Apples'] / df_pivot['Feb_Total']
df_pivot['Mar_% Fruit'] = df_pivot['Mar_Apples'] / df_pivot['Mar_Total']
df_pivot = df_pivot.loc[:, ['Jan_Apples', 'Jan_% Fruit', 'Feb_Apples', 'Feb_% Fruit', 'Mar_Apples', 'Mar_% Fruit']]
df_pivot = df_pivot.reset_index()
df_pivot.columns = ['Location', '# of Apples', '% Fruit', '# of Apples', '% Fruit', '# of Apples', '% Fruit']
df_pivot['% Fruit'] = df_pivot['% Fruit'].apply(lambda x: '{:.1%}'.format(x))
Here's an example of the input and output using the data you provided:
Input:
Location Fruit Jan Feb Mar Apr May Jun
0 Location 1 Oranges 9 12 5 10 7 12
1 Location 1 Apples 2 6 4 3 7 2
2 Location 1 Total 11 18 9 13 14 14
3 Location 2 Oranges 11 8 14 8 10 9
4 Location 2 Apples 5 4 6 2 9 9
5 Location 2 Total 16 12 20 10 19 18
Output:
Location # of Apples % Fruit # of Apples % Fruit # of Apples % Fruit
0 Location 1 2 18.2% 6 33.3% 4 44.4%
1 Location 2 5 31.3% 4 33.3% 6 20.0%
I just see now #Shubham's answer, which is very similar to what I came up with. I'll still post this answer as it is slightly different: by setting the index to ['Fruit', 'Location'], you can avoid using xs() and instead use a simple .loc[]. But really this is nitpicking and the two are very similar.
z = df.set_index(['Fruit', 'Location'])
out = pd.concat([
z.loc['Apples'],
100 * z.loc['Apples'] / z.loc['Total']
], axis=1, keys=['# Apples', '% Fruit']).swaplevel(axis=1).reindex(z.columns, axis=1, level=0)
>>> out.round(1)
Jan Feb Mar Apr May Jun
# Apples % Fruit # Apples % Fruit # Apples % Fruit # Apples % Fruit # Apples % Fruit # Apples % Fruit
Location
Location 1 2 18.2 6 33.3 4 44.4 3 23.1 7 50.0 2 14.3
Location 2 5 31.2 4 33.3 6 30.0 2 20.0 9 47.4 9 50.0
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
However, I have the following problem:
If a year or date does not exist in df2 then a price and a listing_id is automatically added during the merge. But that should be NaN
The second problem is when merging, as soon as I have multiple data that were on the same day and year then the temperature is also merged to the second, for example:
d = {'id': [1], 'day': [1], 'temperature': [20], 'year': [2001]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
d2 = {'id': [122, 244], 'day': [1, 1],
'listing_id': [2, 4], 'price': [20, 440], 'year': [2001, 2001]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 1 4 440 2001
df3 = pd.merge(df,df2[['day', 'listing_id', 'price']],
left_on='day', right_on = 'day',how='left')
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 1 1 20 2001 4 440 # <-- The second temperature is wrong :/
This should not be so, because if I later still have a date from year 2002 which was in day 1 with a temperature of 30 and I want to calculate the average. Then I get the following formula: 20 + 20 + 30 = 23.3. The formula should be 20 + 30 = 25. Therefore, if a value has already been filled, there should be a NaN value in it.
Code Snippet
d = {'id': [1, 2, 3, 4, 5], 'day': [1, 2, 3, 4, 2],
'temperature': [20, 40, 50, 60, 20], 'year': [2001, 2002, 2004, 2005, 1999]}
df = pd.DataFrame(data=d)
print(df)
id day temperature year
0 1 1 20 2001
1 2 2 40 2002
2 3 3 50 2004
3 4 4 60 2005
4 5 2 20 1999
d2 = {'id': [122, 244, 387, 4454, 521], 'day': [1, 2, 3, 4, 2],
'listing_id': [2, 4, 5, 6, 7], 'price': [20, 440, 500, 6600, 500],
'year': [2001, 2002, 2004, 2005, 2005]}
df2 = pd.DataFrame(data=d2)
print(df2)
id day listing_id price year
0 122 1 2 20 2001
1 244 2 4 440 2002
2 387 3 5 500 2004
3 4454 4 6 6600 2005
4 521 2 7 500 2005
df3 = pd.merge(df,df2[['day','listing_id', 'price']],
left_on='day', right_on = 'day',how='left').drop('day',axis=1)
print(df3)
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 40 2002 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 4 440
6 5 2 20 1999 7 500
What I want
id day temperature year listing_id price
0 1 1 20 2001 2 20
1 2 2 40 2002 4 440
2 2 2 NaN 2005 7 500
3 3 3 50 2004 5 500
4 4 4 60 2005 6 6600
5 5 2 20 1999 NaN NaN
IIUC:
>>> df1.merge(df2[['day', 'listing_id', 'price', 'year']],
on=['day', 'year'], how='outer')
id day temperature year listing_id price
0 1.0 1 20.0 2001 2.0 20.0
1 2.0 2 40.0 2002 4.0 440.0
2 3.0 3 50.0 2004 5.0 500.0
3 4.0 4 60.0 2005 6.0 6600.0
4 5.0 2 20.0 1999 NaN NaN
5 NaN 2 NaN 2005 7.0 500.0
import pandas as pd
import numpy as np
# setting up the dataframe
data = [
['day 1','day 2', 2, 50],
['day 2','day 4', 2, 60],
['day 3','day 3', 1, 45],
['day 4','day 7', 2, 45],
['day 5','day 10', 3, 90],
['day 6','day 7', 3, 10],
['day 7','day 8', 2, 10]
]
columns = ['invoicedate', 'paymentdate', 'clientid', 'amounts']
df = pd.DataFrame(data=data, columns=columns)
I have the above dataframe, and I want to check if the last invoice of a certain client ('clientid') was paid ('paymentdate') before a new invoice was issued ('invoicedate').
does anyone have a good (pandas?) solution for this? I tried some thing with the .rolling() function.
This is less technology and more business modelling
really you are looking at accounting concepts
an approach is to consider the data as payables and receivables
then you can run whatever rolling functions you want after modelling using these basic accounting business concepts
# setting up the dataframe
data = [
['day 1','day 2', 2, 50],
['day 2','day 4', 2, 60],
['day 3','day 3', 1, 45],
['day 4','day 7', 2, 45],
['day 5','day 10', 3, 90],
['day 6','day 7', 3, 10],
['day 7','day 8', 2, 10]
]
columns = ['invoicedate', 'paymentdate', 'clientid', 'amounts']
df = pd.DataFrame(data=data, columns=columns)
# make abstract dates actual dates
cols = [c for c in df.columns if "date" in c]
df.loc[:,cols] = df.loc[:,cols].applymap(lambda d: pd.to_datetime(f'{d.split(" ")[1]}-jan-2021'))
# give it an invoice id...
df = df.reset_index().rename(columns={"index":"invoiceid"})
# make a payables / receivables structure
dfpr = pd.concat([df.loc[:,[c for c in df.columns if c!="paymentdate"]].rename(columns={"invoicedate":"date"}).assign(type="pay"),
df.loc[:,[c for c in df.columns if c!="invoicedate"]].rename(columns={"paymentdate":"date"}).assign(
type="rec",amounts=lambda dfa: dfa.amounts*-1),
]).sort_values(["clientid","date"]).reset_index(drop=True)
# analysis - what does client owe..
dfpr.assign(rolling=dfpr.groupby(["clientid"]).agg({"amounts":"cumsum"}))
invoiceid
date
clientid
amounts
type
rolling
0
2
2021-01-03 00:00:00
1
45
pay
45
1
2
2021-01-03 00:00:00
1
-45
rec
0
2
0
2021-01-01 00:00:00
2
50
pay
50
3
1
2021-01-02 00:00:00
2
60
pay
110
4
0
2021-01-02 00:00:00
2
-50
rec
60
5
3
2021-01-04 00:00:00
2
45
pay
105
6
1
2021-01-04 00:00:00
2
-60
rec
45
7
6
2021-01-07 00:00:00
2
10
pay
55
8
3
2021-01-07 00:00:00
2
-45
rec
10
9
6
2021-01-08 00:00:00
2
-10
rec
0
10
4
2021-01-05 00:00:00
3
90
pay
90
11
5
2021-01-06 00:00:00
3
10
pay
100
12
5
2021-01-07 00:00:00
3
-10
rec
90
13
4
2021-01-10 00:00:00
3
-90
rec
0
In pandas I achieved to transform the following, which is basically splitting the first non null value across following null values.
[100, None, None, 40, None, 120]
into
[33.33, 33.33, 33.33, 20, 20, 120]
Thanks to the solution given here, I managed to produce the following code for my specific task:
cols = ['CUSTOMER', 'WEEK', 'PRODUCT_ID']
colsToSplit = ['VOLUME', 'REVENUE']
df = pd.concat([
d.asfreq('W')
for _, d in df.set_index('WEEK').groupby(['CUSTOMER', 'PRODUCT_ID'])
]).reset_index()
df[cols] = df[cols].ffill()
df['nb_nan'] = df.groupby(['CUSTOMER', 'PRODUCT_ID', df_sellin['VOLUME'].notnull().cumsum()])['VOLUME'].transform('size')
df[colsToSplit] = df.groupby(['CUSTOMER', 'PRODUCT_ID'])[colsToSplit].ffill()[colsToSplit].div(df.nb_nan, axis=0)
df
My full dataframe looks like this :
df = pd.DataFrame(map(list, zip(*[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c'],
['2018-01-14', '2018-01-28', '2018-01-14', '2018-01-28', '2018-01-14', '2018-02-04', '2018-02-11', '2018-01-28', '2018-02-11'],
[1, 1, 2, 2, 1, 1, 1, 3, 3],
[50, 44, 22, 34, 42, 41, 43, 12, 13],
[15, 14, 6, 11, 14, 13.5, 13.75, 3, 3.5]])), columns =['CUSTOMER', 'WEEK', 'PRODUCT_ID', 'VOLUME', 'REVENUE'])
df
Out[16]:
CUSTOMER WEEK PRODUCT_ID VOLUME REVENUE
0 a 2018-01-14 1 50 15.00
1 a 2018-01-28 1 44 14.00
2 a 2018-01-14 2 22 6.00
3 a 2018-01-28 2 34 11.00
4 b 2018-01-14 1 42 14.00
5 b 2018-02-04 1 41 13.50
6 b 2018-02-11 1 43 13.75
7 c 2018-01-28 3 12 3.00
8 c 2018-02-11 3 13 3.50
In this case for example, the result would be :
CUSTOMER WEEK PRODUCT_ID VOLUME REVENUE
a 2018-01-14 1 25 7.50
a 2018-01-21 1 25 7.50
a 2018-01-28 1 44 14.00
a 2018-01-14 2 11 3.00
a 2018-01-21 2 11 3.00
a 2018-01-28 2 34 11.00
b 2018-01-14 1 14 4.67
b 2018-01-21 1 14 4.67
b 2018-01-28 1 14 4.67
b 2018-02-04 1 41 13.50
b 2018-02-11 1 43 13.75
c 2018-01-28 3 6 1.50
c 2018-02-04 3 6 1.50
c 2018-02-11 3 13 3.50
Sadly, my dataframe is way too big for further use and joins with other datasets, therefore I would like to test it out with Spark. I checked out many tutorials to compute most of those steps in PySpark, but none of them really showed how to include the groupby part. So I found how to do a transform('size') but not how to df.groupby(...).transform('size') and how I can combine all my steps.
Is there maybe a tool that can do pandas to PySpark translation ? Otherwise, could I have a clue on how to translate this piece of code ? Thanks, maybe I'm just over complicating this.
I have a DataFrame that looks something like this:
A B C D
1 10 22 14
1 12 20 37
1 11 8 18
1 10 10 6
2 11 13 4
2 12 10 12
3 14 0 5
and a function that looks something like this (NOTE: it's actually doing something more complex that can't be easily separated into three independent calls, but I'm simplifying for clarity):
def myfunc(g):
return min(g), mean(g), max(g)
I want to use groupby on A with myfunc to get an output on columns B and C (ignoring D) something like this:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
I can do the following:
df2.groupby('A')[['B','C']].agg(
{
'min': lambda g: myfunc(g)[0],
'mean': lambda g: myfunc(g)[1],
'max': lambda g: myfunc(g)[2]
})
But then—aside from this being ugly and calling myfunc multiple times—I end up with
max mean min
B C B C B C
A
1 12 22 10.75 15.0 10 8
2 12 13 11.50 11.5 11 10
3 14 0 14.00 0.0 14 0
I can use .swaplevel(axis=1) to swap the column levels, but even then B and C are in multiple duplicated columns, and with the multiple function calls it feels like barking up the wrong tree.
If you arrange for myfunc to return a DataFrame whose columns are ['A','B','C','D'] and whose rows index are ['min', 'mean', 'max'], then you could use groupby/apply to call the function (once for each group) and concatenate the results as desired:
import numpy as np
import pandas as pd
def myfunc(g):
result = pd.DataFrame({'min':np.min(g),
'mean':np.mean(g),
'max':np.max(g)}).T
return result
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3],
'B': [10, 12, 11, 10, 11, 12, 14],
'C': [22, 20, 8, 10, 13, 10, 0],
'D': [14, 37, 18, 6, 4, 12, 5]})
result = df.groupby('A')[['B','C']].apply(myfunc)
result = result.unstack(level=-1)
print(result)
prints
B C
max mean min max mean min
A
1 12.0 10.75 10.0 22.0 15.0 8.0
2 12.0 11.50 11.0 13.0 11.5 10.0
3 14.0 14.00 14.0 0.0 0.0 0.0
For others who may run across this and who do not need a custom function, note
that it behooves you to always use builtin aggregators (below, specified by the
strings 'min', 'mean' and 'max') if possible. They perform better than
custom Python functions. Happily, in this toy problem, it produces the desired result:
In [99]: df.groupby('A')[['B','C']].agg(['min','mean','max'])
Out[99]:
B C
min mean max min mean max
A
1 10 10.75 12 8 15.0 22
2 11 11.50 12 10 11.5 13
3 14 14.00 14 0 0.0 0
Something like this might work.
df2.groupby('A')[['B','C']]
aggregated = df2.agg(['min', 'mean', 'max'])
then you could use swap level to get the column order swapped around
aggregated.columns = aggregated.columns.swaplevel(0, 1)
aggregated.sortlevel(0, axis=1, inplace=True)