One additional note to address the problem better, in the actual data set there is also a column called store and the table can be grouped by store, date & product, When I tried the pivot solution and the cartesian product solution it did not work, is there a solution that could work for 3 grouping columns? Also the table has millions of rows.
Assuming a data frame with the following format:
d = {'product': ['a', 'b', 'c', 'a', 'b'], 'amount': [1, 2, 3, 5, 2], 'date': ['2020-6-6', '2020-6-6', '2020-6-6',
'2020-6-7', '2020-6-7']}
df = pd.DataFrame(data=d)
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
Product c is no longer present on the date 2020-6-7, I want to be able to calculate things like percent change or difference in the amount of each product.
For example: df['diff'] = df.groupby('product')['amount'].diff()
But in order for this to work and show for example that the difference of c is -3 and -100%, c would need to be present on the next date with the amount set to 0
This is the results I am looking for:
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
5 c 0 2020-6-7
Please note this is just a snipped data frame, in reality there might be many date periods, I am only looking to fill in the product and amount in the first date after it has been removed, not all dates after.
What is the best way to go about this?
Let us try pivot then unstack
out = df.pivot('product','date','amount').fillna(0).unstack().reset_index(name='amount')
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
You could use the complete function from pyjanitor to explicitly expose the missing values and combine with fillna to fill the missing values with 0:
# pip install pyjanitor
# import janitor
df.complete(['date', 'product']).fillna(0)
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
another way is to do create a cartesian product of your products & dates, then join that to your main dataframe to get the missing values.
#df['date'] = pd.to_datetime(df['date'])
#ensure you have a proper datetime object.
s = pd.merge( df[['product']].drop_duplicates().assign(ky=-1),
df[['date']].drop_duplicates().assign(ky=-1),
on=['ky']
).drop('ky',1)
df1 = pd.merge(df,s,
on = ['product','date']
,how='outer'
).fillna(0)
print(df1)
product amount date
0 a 1.0 2020-06-06
1 b 2.0 2020-06-06
2 c 3.0 2020-06-06
3 a 5.0 2020-06-07
4 b 2.0 2020-06-07
5 c 0.0 2020-06-07
Related
I have the following pandas dataframe and would like to build a new column 'c' which is the summation of column 'b' value and column 'a' previous values. With shifting column 'a' it is possible to do so. However, I would like to know how I can pass the previous values of column 'a' in the apply() function.
l1 = [1,2,3,4,5]
l2 = [3,2,5,4,6]
df = pd.DataFrame(data=l1, columns=['a'])
df['b'] = l2
df['shifted'] = df['a'].shift(1)
df['c'] = df.apply(lambda row: row['shifted']+ row['b'], axis=1)
print(df)
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0
I appreciate your help.
Edit: this is a dummy example. I need to use the apply function because I'm passing another function to it which uses previous rows of some columns and checks some condition.
First let's make it clear that you do not need apply for this simple operation, so I'll consider it as a dummy example of a complex function.
Assuming non-duplicate indices, you can generate a shifted Series and reference it in apply using the name attribute:
s = df['a'].shift(1)
df['c'] =df.apply(lambda row: row['b']+s[row.name], axis=1)
output:
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0
My df looks like this: (There are dozens of other columns in the df but these are the three I am focused on)
Param Value Limit
A 1.50 1
B 2.50 1
C 2.00 2
D 2.00 2.5
E 1.50 2
I am trying to use pandas to calculate how many [Value] that are less than [Limit] per [Param], Hoping to get a list like this:
Param Count
A 1
B 1
C 1
D 0
E 0
I've tried with a few methods, the first being
value_count = df.loc[df['Value'] < df['Limit']].count()
but this just gives the full count per column in the df.
I've also tried groupby function which I think could be the correct idea, by creating a subset of the df with the chosen columns
df_below_limit = df[df['Value'] < df['Limit']]
df_below_limit.groupby('Param')['Value'].count()
This is nearly what I want but it excludes values below which I also need. Not sure how to go about getting the list as I need it.
Assuming you want the count per Param, you can use:
out = df['Value'].ge(df['Limit']).groupby(df['Param']).sum()
output:
Param
A 1
B 2
C 1
D 0
E 0
dtype: int64
used input (with a duplicated row "B" for the example):
Param Value Limit
0 A 1.5 1.0
1 B 2.5 1.0
2 B 2.5 1.0
3 C 2.0 2.0
4 D 2.0 2.5
5 E 1.5 2.0
as DataFrame
df['Value'].ge(df['Limit']).groupby(df['Param']).sum().reset_index(name='Count')
# or
df['Value'].ge(df['Limit']).groupby(df['Param']).agg(Count='sum').reset_index()
output:
Param Count
0 A 1
1 B 2
2 C 1
3 D 0
4 E 0
Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
4 B 8 Z10
I want to obtain ratio of A/B for each UniqueID and put it in a new dataframe. For example, for UniqueID 1, its ratio of A/B = 5/6.
What is the most efficient way to do this in Python?
Want:
UniqueID RatioAB
1 5/6
2 10/11
3 Inf
4 0
Thank you.
One approach is using pivot_table, aggregating with the sum in the case there are multiple occurrences of the same letters (otherwise a simple pivot will do), and evaluating on columns A and B:
df.pivot_table(index='UniqueID', columns='Code', values='Value', aggfunc='sum').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If there is maximum one occurrence of each letter per group:
df.pivot(index='UniqueID', columns='Code', values='Value').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If you only care about A/B ratio:
df1 = df[df['Code'].isin(['A','B'])][['UniqueID', 'Code', 'Value']]
df1 = df1.pivot(index='UniqueID',
columns='Code',
values='Value')
df1['RatioAB'] = df1['A']/df1['B']
The most apparent way is via groupby.
df.groupby('UniqueID').apply(lambda g: g.query("Code == 'A'")['Value'].iloc[0] / g.query("Code == 'B'")['Value'].iloc[0])
I was wondering if there is an efficient way to add rows to a Dataframe that e.g. include the average or a predifined value in case there are not enough rows for a specific value in another column. I guess the description of the Problem is not the best that is why you find an example below:
Say we have the Dataframe
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
And we want to have 2 Rows for each client A, B, C, D, no matter if these 2 rows are already existing or not. So for Client A and B we can just copy the rows, for C we want to add a row which says Client = C, NumberOfProducts = average of existing rows = 9 and ID is not of interest (so we could set it to ID = smallest existing one - 1 = 0 any other value, even NaN, would also be possible). For Client D there does not exist a single row so we want to add 2 rows where NumberOfProducts is equal to the constant 2.5. The output should then look like this:
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
C 9 0
D 2.5 NaN
D 2.5 NaN
What I have done so far is to loop through the dataframe and add rows where necessary. Since this is pretty inefficient any better solution would be highly appreciated.
Use:
clients = ['A','B','C','D']
N = 2
#test only values from list and also filter only 2 rows for each client if necessary
df = df[df['Client'].isin(clients)].groupby('Client').head(N)
#create helper counter and reshape by unstack
df1 = df.set_index(['Client',df.groupby('Client').cumcount()]).unstack()
#set first if only 1 row per client - replace second NumberOfProducts by first
df1[('NumberOfProducts',1)] = df1[('NumberOfProducts',1)].fillna(df1[('NumberOfProducts',0)])
# ... replace second ID by first subtracted by 1
df1[('ID',1)] = df1[('ID',1)].fillna(df1[('ID',0)] - 1)
#add missing clients by reindex
df1 = df1.reindex(clients)
#replace NumberOfProducts by constant 2.5
df1['NumberOfProducts'] = df1['NumberOfProducts'].fillna(2.5)
print (df1)
NumberOfProducts ID
0 1 0 1
Client
A 1.0 5.0 2.0 1.0
B 1.0 6.0 2.0 1.0
C 9.0 9.0 1.0 0.0
D 2.5 2.5 NaN NaN
#last reshape to original
df2 = df1.stack().reset_index(level=1, drop=True).reset_index()
print (df2)
Client NumberOfProducts ID
0 A 1.0 2.0
1 A 5.0 1.0
2 B 1.0 2.0
3 B 6.0 1.0
4 C 9.0 1.0
5 C 9.0 0.0
6 D 2.5 NaN
7 D 2.5 NaN
I have a dataframe like this:
person action_type time
A 4 2014-11-10
A 4 2014-11-15
A 3 2014-11-16
A 1 2014-11-18
A 4 2014-11-19
B 4 2014-11-13
B 2 2014-11-15
B 4 2014-11-19
So I want to add a new column named 'action_4' which represent the count of action_type is 4 for the person of the past 7 days(not include itself).
The result like this:
person action_type time action_4
A 4 2014-11-10 0
A 4 2014-11-15 1
A 3 2014-11-16 2
A 1 2014-11-18 1
A 4 2014-11-19 1
B 4 2014-11-13 0
B 2 2014-11-15 1
B 4 2014-11-19 1
As my the shape of my dataframe is 21649900*3, so avoid using for...in... please.
Here is my approach.
I think checking an interval based on time (e.g. 7 days) is always very expensive, so it is better to rely on the number of observations. (actually, in latest pandas version they introduced a "time-aware" rolling, but I have no experience with it...)
So my approach is, for each person, to force daily frequency and then simply counting the number of action_4 happened in the last 7 days excluding today. I have added comments to the code that should make it clear, but feel free to ask for more explanation.
import pandas as pd
from io import StringIO
inp_str = u"""
person action_type time
A 4 2014-11-10
A 4 2014-11-15
A 3 2014-11-16
A 1 2014-11-18
A 4 2014-11-19
B 4 2014-11-13
B 2 2014-11-15
B 4 2014-11-19
"""
or_df = pd.read_csv(StringIO(inp_str), sep = " ").set_index('time')
or_df.index = pd.to_datetime(or_df.index)
# Find first and last date for each person
min_dates = or_df.groupby('person').apply(lambda x: x.index[0])
max_dates = or_df.groupby('person').apply(lambda x: x.index[-1])
# Resample each person to daily frequency so that 1 obs = 1 day
transf_df = pd.concat([el.reindex(pd.date_range(min_dates[pp], max_dates[pp], freq = 'D')) for pp, el in or_df.groupby('person')])
# Forward fill person
transf_df.loc[:, 'person'] = transf_df['person'].ffill()
# Set a null value for action_type (possibly integer so you preserve the column type)
transf_df = transf_df.fillna({'action_type' : -1})
# For each person count the number of action 4, exluding today
result = transf_df.groupby('person').transform(lambda x: x.rolling(7, 1).apply(lambda y: len(y[y==4])).shift(1).fillna(0))
result.columns = ['action_4']
# Bring back to original index
pd.concat([transf_df, result], axis = 1).set_index('person', append = True).loc[or_df.set_index('person', append = True).index, :]
This gives the expected output:
action_type action_4
time person
2014-11-10 A 4.0 0.0
2014-11-15 A 4.0 1.0
2014-11-16 A 3.0 2.0
2014-11-18 A 1.0 1.0
2014-11-19 A 4.0 1.0
2014-11-13 B 4.0 0.0
2014-11-15 B 2.0 1.0
2014-11-19 B 4.0 1.0
I don't take into account your action_type column but it can help you to find the correct answer :
df2 = pd.DataFrame(None, columns=["person", "action_type", "time", "action_4"])
df["action_4"] = 0
for index, table in df.groupby('person'):
table["action_4"] = table.time.apply(lambda x: table[(table.time > (x -
datetime.timedelta(days=7))) & (table.time < x)].shape[0])
df2 = pd.concat([df2, table])