Handing complex data transformations with pivot_table in Python - python

I am working a pandas DataFrame of a shape of 7837 rows and 19 columns. I am interested in getting the number of times a product_id appears per month which is the date column, and the associated amount. Because a product_id can have various amounts. So I am looking for a way to say for example product_id 1921 with amount 59 appeared ....
Here is the small version of the pandas dataframe
print(df)
CompanyName Produktname product_id amount Date
0 companyA productA 1921 59.0 Jan-2020
1 companyB productB 114 NaN May-2020
2 companyC productC 469 NaN Feb-2020
3 companyD productD 569 18.0 Jun-2020
4 companyE productE 569 18.0 March-2020
I think pivot_table might be helpful. I wanted to first see how many times each product_id appeared with the date as the column
pd.pivot_table(df, index="product_id", values= "product_id" ,columns="Date", aggfunc="count")
but I get an error:
ValueError: Grouper for 'product_id' not 1-dimensional
Is there a way around this or a more efficient way to handle this?

IIUC use:
df = df.pivot_table(index="product_id", values= "amount" ,columns="Date", aggfunc="count")
print (df)
Date Feb-2020 Jan-2020 Jun-2020 March-2020 May-2020
product_id
114 NaN NaN NaN NaN 0.0
469 0.0 NaN NaN NaN NaN
569 NaN NaN 1.0 1.0 NaN
1921 NaN 1.0 NaN NaN NaN
For correct order is possible use:
df['Date'] = pd.to_datetime(df['Date'], format='%b-%Y')
df = df.pivot_table(index="product_id",
values= "amount" ,
columns="Date",
aggfunc="count",
fill_value=0).rename(columns = lambda x: x.strftime('%b-%Y'))
print (df)
Date Jan-2020 Feb-2020 Mar-2020 May-2020 Jun-2020
product_id
114 0 0 0 0 0
469 0 0 0 0 0
569 0 0 1 0 1
1921 1 0 0 0 0

Related

Generate date ranges using group by + apply in pandas

I want to imitate prophet make_future_dataframe() functionality for multiple groups in a pandas dataframe.
If I would like to create a date range as a separate column I could do:
import pandas as pd
my_dataframe['prediction_range'] = pd.date_range(start=my_dataframe['date_column'].min(),
periods=48,
freq='M')
However my dataframe has the following structure:
id feature1 feature2 date_column
1 0 4.3 2022-01-01
2 0 3.3 2022-01-01
3 0 2.2 2022-01-01
4 1034 1.11 2022-01-01
5 1090 0.98 2022-01-01
6 1078 0 2022-01-01
I wanted to do the following:
def generate_date_range(date_column, data):
dates = pd.date_range(start=data[date_column].unique()[0],
periods=48,
freq='M')
return dates
And then:
my_dataframe = my_dataframe.groupby('id').apply(generate_date_ranges('date_columns', my_dataframe))
But I am getting the following:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda/envs/scoring_env/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1377, in apply
func = com.is_builtin_func(func)
File "/anaconda/envs/scoring_env/lib/python3.9/site-packages/pandas/core/common.py", line 615, in is_builtin_func
return _builtin_table.get(arg, arg)
TypeError: unhashable type: 'DatetimeIndex'
I am not sure if I am approaching the problem in the right way. I have also done this with a MultiIndex:
multi_index = pd.MultiIndex.from_product([pd.Index(file['id'].unique()), dates], names=('customer', 'prediction_date'))
And then reindexing an filling the NANs but I am not able to understand why the apply version does not work.
The desired output is:
id feature1 feature2 date_column prediction_date
1 0 4.3 2022-01-01 2022-03-01
1 0 4.3 2022-01-01 2022-04-01
1 0 4.3 2022-01-01 2022-05-01
1 0 4.3 2022-01-01 2022-06-01
--- Up to 48 periods --
2 0 3.3 2022-01-01 2022-03-01
2 0 3.3 2022-01-01 2022-04-01
2 0 3.3 2022-01-01 2022-05-01
2 0 3.3 2022-01-01 2022-06-01
BR
E
Try doing some list comprehension on your groupby object where you reindex the dates then forward fill the id
df['date_column'] = pd.to_datetime(df['date_column'])
df = df.set_index('date_column')
new_df = pd.concat([g.reindex(pd.date_range(g.index.min(), periods=48, freq='MS'))
for _,g in df.groupby('id')])
new_df['id'] = new_df['id'].ffill().astype(int)
id feature1 feature2
2022-01-01 1 0.0 4.3
2022-02-01 1 NaN NaN
2022-03-01 1 NaN NaN
2022-04-01 1 NaN NaN
2022-05-01 1 NaN NaN
... .. ... ...
2025-08-01 6 NaN NaN
2025-09-01 6 NaN NaN
2025-10-01 6 NaN NaN
2025-11-01 6 NaN NaN
2025-12-01 6 NaN NaN
Update
If there is only one record for each ID we can do the following. If there is more than one record for each ID then we will need to only keep the min value for each ID, preform the task below and merge everything back together.
# make sure your date is datetime
df['date_column'] = pd.to_datetime(df['date_column'])
# use index.repeat for the number of months you want
# in this case we will offset the min date for 48 months
new_df = df.reindex(df.index.repeat(48)).reset_index(drop=True)
# groupby the id, cumcount and set the type so we can offset
new_df['date_column'] = new_df['date_column'].values.astype('datetime64[M]') + \
new_df.groupby('id')['date_column'].cumcount().values.astype('timedelta64[M]')
id feature1 feature2 date_column
0 1 0 4.3 2022-01-01
1 1 0 4.3 2022-02-01
2 1 0 4.3 2022-03-01
3 1 0 4.3 2022-04-01
4 1 0 4.3 2022-05-01
.. .. ... ... ...
283 6 1078 0.0 2025-08-01
284 6 1078 0.0 2025-09-01
285 6 1078 0.0 2025-10-01
286 6 1078 0.0 2025-11-01
287 6 1078 0.0 2025-12-01

Insert Value to a dataframe column

I have a pandas dataframe
0 1 2 3
0 173.0 147.0 161 162.0
1 NaN NaN 23 NaN
I just want to add value a column such as
3
0 161
1 23
2 181
But can't go with the approch of loc and iloc. Because the file can contain columns of any length and I will not know loc and iloc. Hence Just want to add value to a column. Thanks in advance.
I believe need setting with enlargement:
df.loc[len(df.index), 2] = 181
print (df)
0 1 2 3
0 173.0 147.0 161.0 162.0
1 NaN NaN 23.0 NaN
2 NaN NaN 181.0 NaN
If that 2x3 dataframe is your original dataframe, you can add an extra row to dataframe by pandas.concat().
For example:
pandas.concat([your_original_dataframe, pandas.DataFrame([[181]] , columns=[2] )], axis=0)
This will add 181 at the bottom of column 2

Give scores to dataframe based on id

I have a dataframe which is indexed by date, I am trying to provide scores for each accountid based on category, if that category value exist on the index date, this dataframe will look like this.
accountid category Smooth Hard Sharp Narrow
timestamp
2018-03-29 101 Smooth 1 NaN NaN NaN
2018-03-29 102 Hard NaN 1 NaN NaN
2018-03-30 103 Narrow NaN NaN NaN 1
2018-04-30 104 Sharp NaN NaN 1 NaN
2018-04-21 105 Narrow NaN NaN NaN 1
what is the best way to loop through the dataframe per accountid and assign scores for each category unstacked.
here is the dataframe creation script.
import pandas as pd
import datetime
idx = pd.date_range('02-28-2018', '04-29-2018')
df = pd.DataFrame(
[[ '101', '2018-03-29', 'Smooth','NaN','NaN','NaN','NaN'], [
'102', '2018-03-29', 'Hard','NaN','NaN','NaN','NaN'
], [ '103', '2018-03-30', 'Narrow','NaN','NaN','NaN','NaN'], [
'104', '2018-04-30', 'Sharp','NaN','NaN','NaN','NaN'
], [ '105', '2018-04-21', 'Narrow','NaN','NaN','NaN','NaN']],
columns=[ 'accountid', 'timestamp', 'category','Smooth','Hard','Sharp','Narrow'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df=df.set_index(['timestamp'])
print(df)
You can use str accessor with get_dummies:
df[['accountid','category']].assign(**df['category'].str.get_dummies())
Output:
accountid category Hard Narrow Sharp Smooth
timestamp
2018-03-29 101 Smooth 0 0 0 1
2018-03-29 102 Hard 1 0 0 0
2018-03-30 103 Narrow 0 1 0 0
2018-04-30 104 Sharp 0 0 1 0
2018-04-21 105 Narrow 0 1 0 0
And replace 0 with nan,
df[['accountid','category']].assign(**df['category'].str.get_dummies())\
.replace(0,np.nan)
Output:
accountid category Hard Narrow Sharp Smooth
timestamp
2018-03-29 101 Smooth NaN NaN NaN 1.0
2018-03-29 102 Hard 1.0 NaN NaN NaN
2018-03-30 103 Narrow NaN 1.0 NaN NaN
2018-04-30 104 Sharp NaN NaN 1.0 NaN
2018-04-21 105 Narrow NaN 1.0 NaN NaN

Pivoting DataFrame with multiple columns for the index

I have a dataframe and I want to transpose only few rows to column.
This is what I have now.
Entity Name Date Value
0 111 Name1 2018-03-31 100
1 111 Name2 2018-02-28 200
2 222 Name3 2018-02-28 1000
3 333 Name1 2018-01-31 2000
I want to create date as the column and then add value. Something like this:
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
I can have identical Name for two different Entitys. Here is an updated dataset.
Code:
import pandas as pd
import datetime
data1 = {
'Entity': [111,111,222,333],
'Name': ['Name1','Name2', 'Name3','Name1'],
'Date': [datetime.date(2018,3, 31), datetime.date(2018,2,28), datetime.date(2018,2,28), datetime.date(2018,1,31)],
'Value': [100,200,1000,2000]
}
df1 = pd.DataFrame(data1, columns= ['Entity','Name','Date', 'Value'])
How do I achieve this? Any pointers? Thanks all.
Based on your update, you'd need pivot_table with two index columns -
v = df1.pivot_table(
index=['Entity', 'Name'],
columns='Date',
values='Value'
).reset_index()
v.index.name = v.columns.name = None
v
Entity Name 2018-01-31 2018-02-28 2018-03-31
0 111 Name1 NaN NaN 100.0
1 111 Name2 NaN 200.0 NaN
2 222 Name3 NaN 1000.0 NaN
3 333 Name1 2000.0 NaN NaN
From unstack
df1.set_index(['Entity','Name','Date']).Value.unstack().reset_index()
Date Entity Name 2018-01-31 00:00:00 2018-02-28 00:00:00 \
0 111 Name1 NaN NaN
1 111 Name2 NaN 200.0
2 222 Name3 NaN 1000.0
3 333 Name1 2000.0 NaN
Date 2018-03-31 00:00:00
0 100.0
1 NaN
2 NaN
3 NaN

how to shift value in dataframe using pandas?

I have data like this, without z1, what i need is to add a column to DataFrame, so it will add column z1 and represent values as in the example, what it should do is to shift z value equally on 1 day before for the same Start date.
I was thinking it could be done with apply and lambda in pandas, but i`m not sure how to define lambda function
data = pd.read_csv("....")
data["Z"] = data[[
"Start", "Z"]].apply(lambda x:
You can use DataFrameGroupBy.shift with merge:
#if not datetime
df['date'] = pd.to_datetime(df.date)
df.set_index('date', inplace=True)
df1 = df.groupby('start')['z'].shift(freq='1D',periods=1).reset_index()
print (pd.merge(df.reset_index(),df1, on=['start','date'], how='left', suffixes=('','1')))
date start z z1
0 2012-12-01 324 564545 NaN
1 2012-12-01 384 5555 NaN
2 2012-12-01 349 554 NaN
3 2012-12-02 855 635 NaN
4 2012-12-02 324 56 564545.0
5 2012-12-01 341 98 NaN
6 2012-12-03 324 888 56.0
EDIT:
Try find duplicates and fillna by 0:
df['date'] = pd.to_datetime(df.date)
df.set_index('date', inplace=True)
df1 = df.groupby('start')['z'].shift(freq='1D',periods=1).reset_index()
df2 = pd.merge(df.reset_index(),df1, on=['start','date'], how='left', suffixes=('','1'))
mask = df2.start.duplicated(keep=False)
df2.ix[mask, 'z1'] = df2.ix[mask, 'z1'].fillna(0)
print (df2)
date start z z1
0 2012-12-01 324 564545 0.0
1 2012-12-01 384 5555 NaN
2 2012-12-01 349 554 NaN
3 2012-12-02 855 635 NaN
4 2012-12-02 324 56 564545.0
5 2012-12-01 341 98 NaN
6 2012-12-03 324 888 56.0

Categories