Sum one column based on similarity of other column values in Python - python

I want to sum one column based on similarity of other column. I tried the below code, but it gives me error, and It docent bring all the column. can anyone help me please?
df ["sum"]=df.groupby(['id']).agg({'duration': sum}).reset_index()
df
df
x. y. m. n. duration id
xx. rr. 1.1. 4.4 66 2
xx. rr. 1.1. 4.4 66 2
xx. rr. 1.1 4.4 66 2
tt. uu 2.2 4.4 10 3
tt. uu 2.2 4.4 55 3
What I want is:
x. y. m. n. duration id
xx. rr. 11 4.4 sum(66+66+66) 2
tt. uu. 22. 4.4 sum(10+55) 2

If need first rows by id use GroupBy.transform with DataFrame.drop_duplicates:
df["sum"] = df.groupby('id')['duration'].transform('sum')
df1 = df.drop_duplicates('id')
Or aggregate by all columns:
df2 = df.groupby(['x.','y.','m.','n.', 'id'], as_index=False)['duration'].sum()

Related

Subtract columns from two DFs based on matching condition

Suppose I have the following two DFs:
DF A: First column is a date, and then there are columns that start with a year (2021, 2022...)
Date 2021.Water 2021.Gas 2022.Electricity
may-04 500 470 473
may-05 520 490 493
may-06 540 510 513
DF B: First column is a date, and then there are columns that start with a year (2021, 2022...)
Date 2021.Amount 2022.Amount
may-04 100 95
may-05 110 105
may-06 120 115
The expected result is a DF with the columns from DF A, but that have the rows divided by the values for the matching year in DF B. Such as:
Date 2021.Water 2021.Gas 2022.Electricity
may-04 5.0 4.7 5.0
may-05 4.7 4.5 4.7
may-06 4.5 4.3 4.5
I am really struggling with this problem. Let me know if any clarifications are needed and will be glad to help.
Try this:
dfai = dfa.set_index('Date')
dfai.columns = dfai.columns.str.split('.', expand=True)
dfbi = dfb.set_index('Date').rename(columns = lambda x: x.split('.')[0])
df_out = dfai.div(dfbi, level=0).round(1)
df_out.columns = df_out.columns.map('.'.join)
df_out.reset_index()
Output:
Date 2021.Water 2021.Gas 2022.Electricity
0 may-04 5.0 4.7 5.0
1 may-05 4.7 4.5 4.7
2 may-06 4.5 4.2 4.5
Details
First, move 'Date' into the index of both dataframes, then use string split to get years into a level in each dataframe.
Use, pd.DataFrame.div with level=0 to align operations on the top level index of each dataframe.
Flatten multiindex column header back to a single level and reset_index.

Pandas combine two dataseries into one series

I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)

How to create new columns by looping through columns in different dataframes?

I have two pd.dataframes:
df1:
Year Replaced Not_replaced
2015 1.5 0.1
2016 1.6 0.3
2017 2.1 0.1
2018 2.6 0.5
df2:
Year HI LO RF
2015 3.2 2.9 3.0
2016 3.0 2.8 2.9
2017 2.7 2.5 2.6
2018 2.6 2.2 2.3
I need to create a third df3 by using the following equation:
df3[column1]=df1['Replaced']-df1['Not_replaced]+df2['HI']
df3[column2]=df1['Replaced']-df1['Not_replaced]+df2['LO']
df3[column3]=df1['Replaced']-df1['Not_replaced]+df2['RF']
I can merge the two dataframes and manually create 3 new columns one by one, but I can't figure out how to use the loop function to create the results.
You can create an empty dataframe & fill it with values while looping
(Note: col_names & df3.columns must be of the same length)
df3 = pd.DataFrame(columns = ['column1','column2','column3'])
col_names = ["HI", "LO","RF"]
for incol,df3column in zip(col_names,df3.columns):
df3[df3column] = df1['Replaced']-df1['Not_replaced']+df2[incol]
print(df3)
output
column1 column2 column3
0 4.6 4.3 4.4
1 4.3 4.1 4.2
2 4.7 4.5 4.6
3 4.7 4.3 4.4
for the for loop, I would first merge df1 and df2 into to create a new df, called df3. Then, I would create a list of te names of the columns you want to iterate through:
col_names = ["HI", "LO","RF"]
for col in col_names:
df3[f"column_{col}]= df3['Replaced']-df3['Not_replaced]+df3[col]

Filtering a dataframe by a list

I have the following dataframe
Dataframe:
Date Name Value Rank Mean
01/02/2019 A 10 100 8.2
02/03/2019 A 9 120 7.9
01/03/2019 B 3 40 6.4
03/02/2019 B 1 39 5.9
...
And following list:
date=['01/02/2019','03/02/2019'...]
I would like to filter the df by the list, but as a date range, so for each value in the list I would like to bring back data between the date and the date-30 days
I am using numpy broadcast here, notice this method is o(n*m) , which mean if both of the df and date list is huge , it will exceed the memory limit
s=pd.to_datetime(date).values
df.Date=pd.to_datetime(df.Date)
s1=df.Date.values
t=(s-s1[:,None]).astype('timedelta64[D]').astype(int)
df[np.any((t>=0)&(t<=30),1)]
Out[120]:
Date Name Value Rank Mean
0 2019-01-02 A 10 100 8.2
1 2019-02-03 A 9 120 7.9
3 2019-03-02 B 1 39 5.9
If your date is a string, just do:
df[df.date.isin(list_of_dates)]

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

Categories