Adding column in pandas based on values from other columns with conditions - python

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.

Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

Related

Putting Na on several columns according to conditions on certain columns

I have a pandas dataframe with amount of bills columns,with dates and ids associated with those amounts. I would like to set the columns to Na when the date is less than 2016-12-31 and their associated id and amount. Here is an example
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
Gender
Age
4
6
2000-10-04
1
45
2000-11-05
2
51
1999-12-05
8
M
25
6
8
2016-05-03
7
39
2017-08-09
8
38
2018-07-14
17
W
54
12
14
2016-11-16
10
73
2017-05-04
15
14
2017-07-04
35
M
68
And I would like to get this:
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date 3
ID Bill 3
Gender
Age
4
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
M
25
6
Nan
Nan
Nan
39
2017-08-09
8
38
2018-07-14
17
W
54
12
Nan
Nan
Nan
73
2017-05-04
15
14
2017-07-04
35
M
68
One option is to create a MultiIndex.from_frame based on the values extracted str.extractall:
new_df = df.set_index(['ID customer', 'Gender', 'Age'])
orig_cols = new_df.columns # Save For Later
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extractall(r'(.*?)(?:\s+)?(\d+)')
)
0 Bill Date ID Bill Bill Date ID Bill Bill Date ID Bill
1 1 1 1 2 2 2 3 3 3
ID customer Gender Age
4 M 25 6 2000-10-04 1 45 2000-11-05 2 51 1999-12-05 8
6 W 54 8 2016-05-03 7 39 2017-08-09 8 38 2018-07-14 17
12 M 68 14 2016-11-16 10 73 2017-05-04 15 14 2017-07-04 35
Then mask on the Date column (in level 0) where dates are less than the threshold:
new_df = new_df.mask(new_df['Date'].lt(pd.to_datetime('2016-12-31')))
0 Bill Date ID Bill Bill Date ID Bill Bill Date ID Bill
1 1 1 1 2 2 2 3 3 3
ID customer Gender Age
4 M 25 NaN NaT NaN NaN NaT NaN NaN NaT NaN
6 W 54 NaN NaT NaN 39.0 2017-08-09 8.0 38.0 2018-07-14 17.0
12 M 68 NaN NaT NaN 73.0 2017-05-04 15.0 14.0 2017-07-04 35.0
Lastly, restore columns and order:
new_df.columns = orig_cols # Restore from "save"
new_df = new_df.reset_index().reindex(columns=df.columns)
ID customer Bill1 Date 1 ID Bill 1 Bill2 Date 2 ID Bill 2 Bill3 Date3 ID Bill 3 Gender Age
0 4 NaN NaT NaN NaN NaT NaN NaN NaT NaN M 25
1 6 NaN NaT NaN 39.0 2017-08-09 8.0 38.0 2018-07-14 17.0 W 54
2 12 NaN NaT NaN 73.0 2017-05-04 15.0 14.0 2017-07-04 35.0 M 68
All Together:
(ensure Date Columns are DateTime)
df['Date 1'] = pd.to_datetime(df['Date 1'])
df['Date 2'] = pd.to_datetime(df['Date 2'])
df['Date3'] = pd.to_datetime(df['Date3'])
new_df = df.set_index(['ID customer', 'Gender', 'Age'])
orig_cols = new_df.columns # Save For Later
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extractall(r'(.*?)(?:\s+)?(\d+)')
)
new_df = new_df.mask(new_df['Date'].lt(pd.to_datetime('2016-12-31')))
new_df.columns = orig_cols # Restore from "save"
new_df = new_df.reset_index().reindex(columns=df.columns)
Another way:
#Assumption Your dates are of dtype datetime[ns]
c=~df.filter(like='Date').lt(pd.to_datetime('2016-12-31'))
c=pd.DataFrame(c.values.repeat(3,1),columns=df.columns[1:10])
Finally:
out=df[df.columns[1:10]]
out=out[c].join(df[['ID customer','Gender','Age']])
Now If you print out you will get your desired output

Python/Pandas outer merge not including all relevant columns

I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?
Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)

Fill up columns in dataframe based on condition

I have a dataframe that looks as follows:
id cyear month datadate fyear
1 1988 3 nan nan
1 1988 4 nan nan
1 1988 5 1988-05-31 1988
1 1988 6 nan nan
1 1988 7 nan nan
1 1988 8 nan nan
1 1988 9 nan nan
1 1988 12 nan nan
1 1989 1 nan nan
1 1989 2 nan nan
1 1989 3 nan nan
1 1989 4 nan nan
1 1989 5 1989-05-31 1989
1 1989 6 nan nan
1 1989 7 nan nan
1 1989 8 nan nan
1 1990 8 nan nan
4 2000 1 nan nan
4 2000 2 nan nan
4 2000 3 nan nan
4 2000 4 nan nan
4 2000 5 nan nan
4 2000 6 nan nan
4 2000 7 nan nan
4 2000 8 nan nan
4 2000 9 nan nan
4 2000 10 nan nan
4 2000 11 nan nan
4 2000 12 2000-12-31 2000
5 2000 11 nan nan
More specifically, I have a dataframe consisting of monthly (month) data on firms (id) per calendar year (cyear). If the respective row, i.e. month, represents the end of a fiscal year of the firm, the datadate column will denote the respective months end as a date variable and the fyear column will denote the respective fiscal year that just ended.
I now want the fyear value to indicate the respective fiscal year not just in the last month of the respective companies fiscal year, but in every month within the respective fiscal year:
id cyear month datadate fyear
1 1988 3 nan 1988
1 1988 4 nan 1988
1 1988 5 1988-05-31 1988
1 1988 6 nan 1989
1 1988 7 nan 1989
1 1988 8 nan 1989
1 1988 9 nan 1989
1 1988 12 nan 1989
1 1989 1 nan 1989
1 1989 2 nan 1989
1 1989 3 nan 1989
1 1989 4 nan 1989
1 1989 5 1989-05-31 1989
1 1989 6 nan 1990
1 1989 7 nan 1990
1 1989 8 nan 1990
1 1990 8 nan 1991
4 2000 1 nan 2000
4 2000 2 nan 2000
4 2000 3 nan 2000
4 2000 4 nan 2000
4 2000 5 nan 2000
4 2000 6 nan 2000
4 2000 7 nan 2000
4 2000 8 nan 2000
4 2000 9 nan 2000
4 2000 10 nan 2000
4 2000 11 nan 2000
4 2000 12 2000-12-31 2000
5 2000 11 nan nan
Note that months may be missing, as evident in case of id 1, and fiscal years may end on different months in fyear=cyear or fyear=cyear+1 (I have included only the former example, one could construct the latter example by adding 1 to the current fyear values of e.g. id 1). Also, the last row(s) of a given firm may not necessarily be its fiscal year end month, as evident in case of id 1. Lastly, there may exist firms for which no information on fiscal years is available.
I appreciate any help on this.
Do you want this?
def backword_fill(x):
x = x.bfill()
x = x.ffill() + x.isna().astype(int)
return x
df.fyear = df.groupby('id')['fyear'].transform(backword_fill)
Output
id cyear month datadate fyear
0 1 1988 3 <NA> 1988
1 1 1988 4 <NA> 1988
2 1 1988 5 1988-05-31 1988
3 1 1988 6 <NA> 1989
4 1 1988 7 <NA> 1989
5 1 1988 8 <NA> 1989
6 1 1988 9 <NA> 1989
7 1 1988 12 <NA> 1989
8 1 1989 1 <NA> 1989
9 1 1989 2 <NA> 1989
10 1 1989 3 <NA> 1989
11 1 1989 4 <NA> 1989
12 1 1989 5 1989-05-31 1989
13 1 1989 6 <NA> 1990
14 4 2000 1 <NA> 2000
15 4 2000 2 <NA> 2000
16 4 2000 3 <NA> 2000
17 4 2000 4 <NA> 2000
18 4 2000 5 <NA> 2000
19 4 2000 6 <NA> 2000
20 4 2000 7 <NA> 2000
21 4 2000 8 <NA> 2000
22 4 2000 9 <NA> 2000
23 4 2000 10 <NA> 2000
24 4 2000 11 <NA> 2000
25 4 2000 12 2000-12-31 2000

Leading and Trailing Padding Dates in Pandas DataFrame

This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN

Pandas Dataframe: shift/merge multiple rows sharing the same column values into one row

Sorry for any possible confusion with the title. I will describe my question better with the following code and pictures.
Now I have a dataframe with multiple columns. The first two columns, by which they are sorted, 'Route' and 'ID' (Sorry about the formatting, all the rows here have 'Route' value of '100' and 'ID' from 1 to 3.
df1.head(9)
Route ID Year Vol Truck_Vol Truck_%
0 100 1 2017.0 7016 635.0 9.1
1 100 1 2014.0 6835 NaN NaN
2 100 1 2011.0 5959 352.0 5.9
3 100 2 2018.0 15828 NaN NaN
4 100 2 2015.0 13114 2964.0 22.6
5 100 2 2009.0 11844 1280.0 10.8
6 100 3 2016.0 15434 NaN NaN
7 100 3 2013.0 18699 2015.0 10.8
8 100 3 2010.0 15903 NaN NaN
What I want to have is
Route ID Year Vol1 Truck_Vol1 Truck_%1 Year2 Vol2 Truck_Vol2 Truck_%2 Year3 Vol3 Truck_Vol3 Truck_%3
0 100 1 2017 7016 635.0 9.1 2014 6835 NaN NaN 2011 5959 352.0 5.9
1 100 2 2018 15828 NaN NaN 2015 13114 2964.0 22.6 2009 11844 1280.0 10.8
2 100 3 2016 15434 NaN NaN 2013 18699 2015.0 10.8 2010 15903 NaN NaN
Again, sorry for the messy formatting. Let me try a simplified version.
Input:
Route ID Year Vol T_%
0 100 1 2017 100 1.0
1 100 1 2014 200 NaN
2 100 1 2011 300 2.0
3 100 2 2018 400 NaN
4 100 2 2015 500 3.0
5 100 2 2009 600 4.0
Desired Output:
Route ID Year Vol T_% Year.1 Vol.1 T_%.1 Year.2 Vol.2 T_%.2
0 100 1 2017 100 1.0 2014 200 NaN 2011 300 2
1 100 2 2018 400 NaN 2015 500 3.0 2009 600 4
So basically just move the cells shown in the picture
I am stumped here. The names for the newly generated columns don't matter.
For this current dataframe, I have three rows per 'group' like shown in the code. It will be great if the answer can accommodate any number of rows each group.
Thanks for your time.
with groupby + cumcount + set_index + unstack
df1 = df.assign(cid = df.groupby(['Route', 'ID']).cumcount()).set_index(['Route', 'ID', 'cid']).unstack(-1).sort_index(1,1)
df1.columns = [f'{x}{y}' for x,y in df1.columns]
df1 = df1.reset_index()
Output df1:
Route ID T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
0 100 1 1.0 100 2017 NaN 200 2014 2.0 300 2011
1 100 2 NaN 400 2018 3.0 500 2015 4.0 600 2009
melt + pivot_table
v = df.melt(id_vars=['Route', 'ID'])
v['variable'] += v.groupby(['Route', 'ID', 'variable']).cumcount().astype(str)
res = v.pivot_table(index=['Route', 'ID'], columns='variable', values='value')
variable T_% 0 T_% 1 T_% 2 Vol 0 Vol 1 Vol 2 Year 0 Year 1 Year 2
Route ID
100 1 1.0 NaN 2.0 100.0 200.0 300.0 2017.0 2014.0 2011.0
2 NaN 3.0 4.0 400.0 500.0 600.0 2018.0 2015.0 2009.0
If you want to sort these:
c = res.columns.str.extract(r'(\d+)')[0].values.astype(int)
res.iloc[:,np.argsort(c)]
variable T_%0 Vol0 Year0 T_%1 Vol1 Year1 T_%2 Vol2 Year2
Route ID
100 1 1.0 100.0 2017.0 NaN 200.0 2014.0 2.0 300.0 2011.0
2 NaN 400.0 2018.0 3.0 500.0 2015.0 4.0 600.0 2009.0
You asked about why I used cumcount. To explain, here is what v looks like from above:
Route ID variable value
0 100 1 Year 2017.0
1 100 1 Year 2014.0
2 100 1 Year 2011.0
3 100 2 Year 2018.0
4 100 2 Year 2015.0
5 100 2 Year 2009.0
6 100 1 Vol 100.0
7 100 1 Vol 200.0
8 100 1 Vol 300.0
9 100 2 Vol 400.0
10 100 2 Vol 500.0
11 100 2 Vol 600.0
12 100 1 T_% 1.0
13 100 1 T_% NaN
14 100 1 T_% 2.0
15 100 2 T_% NaN
16 100 2 T_% 3.0
17 100 2 T_% 4.0
If I used pivot_table on this DataFrame, you would end up with something like this:
variable T_% Vol Year
Route ID
100 1 1.5 200.0 2014.0
2 3.5 500.0 2014.0
Obviously you are losing data here. cumcount is the solution, as it turns the variable series into this:
Route ID variable value
0 100 1 Year0 2017.0
1 100 1 Year1 2014.0
2 100 1 Year2 2011.0
3 100 2 Year0 2018.0
4 100 2 Year1 2015.0
5 100 2 Year2 2009.0
6 100 1 Vol0 100.0
7 100 1 Vol1 200.0
8 100 1 Vol2 300.0
9 100 2 Vol0 400.0
10 100 2 Vol1 500.0
11 100 2 Vol2 600.0
12 100 1 T_%0 1.0
13 100 1 T_%1 NaN
14 100 1 T_%2 2.0
15 100 2 T_%0 NaN
16 100 2 T_%1 3.0
17 100 2 T_%2 4.0
Where you have a count of repeated elements per unique Route and ID.

Categories