Subtract columns from two DFs based on matching condition - python

Suppose I have the following two DFs:
DF A: First column is a date, and then there are columns that start with a year (2021, 2022...)
Date 2021.Water 2021.Gas 2022.Electricity
may-04 500 470 473
may-05 520 490 493
may-06 540 510 513
DF B: First column is a date, and then there are columns that start with a year (2021, 2022...)
Date 2021.Amount 2022.Amount
may-04 100 95
may-05 110 105
may-06 120 115
The expected result is a DF with the columns from DF A, but that have the rows divided by the values for the matching year in DF B. Such as:
Date 2021.Water 2021.Gas 2022.Electricity
may-04 5.0 4.7 5.0
may-05 4.7 4.5 4.7
may-06 4.5 4.3 4.5
I am really struggling with this problem. Let me know if any clarifications are needed and will be glad to help.

Try this:
dfai = dfa.set_index('Date')
dfai.columns = dfai.columns.str.split('.', expand=True)
dfbi = dfb.set_index('Date').rename(columns = lambda x: x.split('.')[0])
df_out = dfai.div(dfbi, level=0).round(1)
df_out.columns = df_out.columns.map('.'.join)
df_out.reset_index()
Output:
Date 2021.Water 2021.Gas 2022.Electricity
0 may-04 5.0 4.7 5.0
1 may-05 4.7 4.5 4.7
2 may-06 4.5 4.2 4.5
Details
First, move 'Date' into the index of both dataframes, then use string split to get years into a level in each dataframe.
Use, pd.DataFrame.div with level=0 to align operations on the top level index of each dataframe.
Flatten multiindex column header back to a single level and reset_index.

Related

Sum one column based on similarity of other column values in Python

I want to sum one column based on similarity of other column. I tried the below code, but it gives me error, and It docent bring all the column. can anyone help me please?
df ["sum"]=df.groupby(['id']).agg({'duration': sum}).reset_index()
df
df
x. y. m. n. duration id
xx. rr. 1.1. 4.4 66 2
xx. rr. 1.1. 4.4 66 2
xx. rr. 1.1 4.4 66 2
tt. uu 2.2 4.4 10 3
tt. uu 2.2 4.4 55 3
What I want is:
x. y. m. n. duration id
xx. rr. 11 4.4 sum(66+66+66) 2
tt. uu. 22. 4.4 sum(10+55) 2
If need first rows by id use GroupBy.transform with DataFrame.drop_duplicates:
df["sum"] = df.groupby('id')['duration'].transform('sum')
df1 = df.drop_duplicates('id')
Or aggregate by all columns:
df2 = df.groupby(['x.','y.','m.','n.', 'id'], as_index=False)['duration'].sum()

Pandas combine two dataseries into one series

I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)

Filtering a dataframe by a list

I have the following dataframe
Dataframe:
Date Name Value Rank Mean
01/02/2019 A 10 100 8.2
02/03/2019 A 9 120 7.9
01/03/2019 B 3 40 6.4
03/02/2019 B 1 39 5.9
...
And following list:
date=['01/02/2019','03/02/2019'...]
I would like to filter the df by the list, but as a date range, so for each value in the list I would like to bring back data between the date and the date-30 days
I am using numpy broadcast here, notice this method is o(n*m) , which mean if both of the df and date list is huge , it will exceed the memory limit
s=pd.to_datetime(date).values
df.Date=pd.to_datetime(df.Date)
s1=df.Date.values
t=(s-s1[:,None]).astype('timedelta64[D]').astype(int)
df[np.any((t>=0)&(t<=30),1)]
Out[120]:
Date Name Value Rank Mean
0 2019-01-02 A 10 100 8.2
1 2019-02-03 A 9 120 7.9
3 2019-03-02 B 1 39 5.9
If your date is a string, just do:
df[df.date.isin(list_of_dates)]

Compare columns in pandas and merge

so I have two data frames that I would like to merge on a column called offer_codes. All of the rows have multiple offer codes in a list (I could probably convert it to a tuple) and I want to match up the offer codes with the second data frame and merge on them. One of the data frames returns a list and the other is just one value, but I would like to merge it so that it merges. The data frame comes from sales data from a website.
df = pd.DataFrame(data={'available': [False, True, True],
'count': [190,285,165],
'offer_codes': ['no_offer_code',['G545', 'G1891'],['G92182', 'G1921']]})
df2 = pd.DataFrame(data={'price':[85.00,99.00],
'offer_codes':['G1891', 'G1921'],
'after_fees':[105, 121]})
I would like to merge these but my issue is that lists are unhashable when I try to merge with tuples don't seem to match up correctly.
#first df
available count offer_codes
0 False 190 no_offer_code
1 True 285 [G545, G1891]
2 True 165 [G92182, G1921]
#2nd df
after_fees offer_codes price
0 105 G1891 85.0
1 121 G1921 99.0
#after the merge
after_fees available count offer_codes price
0 105 True 285 G1891 85.0
1 121 True 165 G1921 99.0
I thought that putting the list into a tuple would work but it definitely didn't.
A little bit long ..
df.set_index(['available','count']).offer_codes.apply(pd.Series).stack().\
to_frame('offer_codes').\
reset_index(level['count','available']).\
merge(df2,on='offer_codes',how='left').dropna()
Out[59]:
available count offer_codes after_fees price
2 True 285 G1891 105.0 85.0
4 True 165 G1921 121.0 99.0

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

Categories