I have a dataframe df:
2019 2020 2021 2022
A 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
B 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
C 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
D 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
I am trying to group by the level 1 index, 1, 2, 3, 4 and assign different aggregation functions for those 1, 2, 3, 4 indexes so that 1 is aggregated by sum, 2 by mean, and so on. So that the end result would look like this:
2019 2020 2021 2022
1 40 ... ... # sum
2 5 ... ... # mean
3 0.3 ... ... # mean
4 2000 ... ... # sum
I tried:
df.groupby(level = 1).agg({'1':'sum', '2':'mean', '3':'sum', '4':'mean'})
But I get that none of 1, 2, 3, 4 are in columns which they are not, so I am not sure how should I proceed with this problem.
You could use apply with a custom function as follows:
import numpy as np
aggs = {1: np.sum, 2: np.mean, 3: np.mean, 4: np.sum}
def f(x):
func = aggs.get(x.name, np.sum)
return func(x)
df.groupby(level=1).apply(f)
The above code uses sum by default so 1 and 4 could be removed from aggs without any different results. In this way, only groups that should be handled differently from the rest need to be specified.
Result:
2019 2020 2021 2022
1 40.0 60.0 60.0 124.0
2 5.0 4.0 7.0 9.0
3 0.3 0.4 0.4 0.7
4 2000.0 2400.0 280.0 360.0
Just in case you were after avoiding for loops. Slice and group by index and agg conditionally.
df1 = (
df.groupby([df.index.get_level_values(level=1)]).agg(
lambda x: x.sum() if x.index.get_level_values(level=1).isin([1,4]).any() else x.mean())
)
df1
2019 2020 2021 2022
1 40.0 60.0 60.0 124.0
2 5.0 4.0 7.0 9.0
3 0.3 0.4 0.4 0.7
4 2000.0 2400.0 280.0 360.0
Here is sample csv file of cricket score:
>>> df
venue ball run extra wide noball
0 a 0.1 0 1 NaN NaN
1 a 0.2 4 0 NaN NaN
2 a 0.3 1 5 5.0 NaN
3 a 0.4 1 0 NaN NaN
4 a 0.5 1 1 NaN 1.0
5 a 0.6 2 1 NaN NaN
6 a 0.7 6 2 1.0 1.0
7 a 0.8 0 0 NaN NaN
8 a 0.9 1 1 NaN NaN
9 a 1.1 2 2 NaN NaN
10 a 1.2 1 0 NaN NaN
11 a 1.3 6 1 NaN NaN
12 a 1.4 0 2 NaN 2.0
13 a 1.5 1 0 NaN NaN
14 a 1.6 2 0 NaN NaN
15 a 1.7 0 1 NaN NaN
16 a 0.1 0 5 NaN NaN
17 a 0.2 4 0 NaN NaN
18 a 0.3 1 1 NaN NaN
19 a 0.4 3 0 NaN NaN
20 a 0.5 0 0 NaN NaN
21 a 0.6 0 2 2.0 NaN
22 a 0.7 6 1 NaN NaN
23 a 1.1 4 0 NaN NaN
From this dataframe I want to update ball value, generate 2 new columns and drop 4 entire columns. Condition is
when "wide" or "noball" is null, crun = crun + run + extra until ball
= 0.1 (recursively)
when "wide" or "noball" is not null, concurrent ball value won't be incremented and will be dropped after crun calculation. crun = crun + run + extra. And it will continue until ball = 0.1 (recursively) eg. Let me breakdown: from row index 0 to 8: | 0.1 "wide" or "noball" is null and crun = 1 | 0.2 "wide" or "noball" is null and crun = 1+4 =5| 0.3 "wide" or "noball" is not null (removed) | 0.4 "wide" or "noball" is null (becomes 0.3) and crun = 5+1+5+1 = 12| 0.5 "wide" or "noball" is not null (removed) | 0.6 "wide" or "noball" is null (becomes 0.4) and crun = 12+1+1+2+1 =17| 0.7 "wide" or "noball" is not null (removed) | 0.8 "wide" or "noball" is null (becomes 0.5) and crun = 17+6+2 = 25| 0.9 "wide" or "noball" is null (becomes 0.6) and crun = 25+1+1 =27|
Finally "total" column will be created which returns the max of crun until ball = 0.1 (recursively). Then "run", "extra", "wide", "noball" column should be dropped.
The output I want:
venue ball crun total
0 a 0.1 1 45
1 a 0.2 5 45
2 a 0.3 12 45
3 a 0.4 17 45
4 a 0.5 25 45
5 a 0.6 27 45
6 a 1.1 31 45
7 a 1.2 32 45
8 a 1.3 39 45
9 a 1.4 42 45
10 a 1.5 44 45
11 a 1.6 45 45
12 a 0.1 5 27
13 a 0.2 9 27
14 a 0.3 11 27
15 a 0.4 14 27
16 a 0.5 14 27
17 a 0.6 23 27
18 a 1.1 27 27
I find it too complex, please help. Code I tried:
df = pd.read_csv("data.csv")
gr = df.groupby(df.ball.eq(0.1).cumsum())
df["crun"] = gr.runs.cumsum()
df["total"] = gr.current_run.transform("max")
df = df.drop(['run', 'extra', 'wide', 'noball'], axis=1)
Alrighty. This was a fun one.
(I tried to add comments for clarity.)
Note: "ball," "run," "extra," "wide," and "noball" are all numeric fields.
Note Note: This all assumes your initial DataFrame is under a variable named df.
# Create target groupings by ball value.
df["target_groups"] = df.loc[df["ball"] == 0.1].groupby(level=-1).ngroup()
df["target_groups"].fillna(method="ffill", inplace=True)
# --- Create subgroups --- #
df["target_subgroups"] = df["ball"].astype(int)
# Add field fro sum of run and extra
df["run_extra"] = df[["run", "extra"]].sum(axis=1)
# Apply groupby() and cumsum() as follows to get the cumulative sum
# of each ball group for run and extra.
df["crun"] = df.groupby(["target_groups"])["run_extra"].cumsum()
# Create dataframe for max crun value of each group
group_max_df = df.groupby(["target_groups"])["crun"].max().to_frame().reset_index()
# Merge both of the DataFrames with the given suffixes. The first one
# Just prevents crun from having a suffix added, which is an additional
# step to remove.
# You could probably use .join() in a similar manner.
df = pd.merge(df, group_max_df,
on=["target_groups"],
suffixes=("", "_total"),
sort=False
)
# Rename your new total field.
df.rename(columns={"crun_total": "total"}, inplace = True)
# Apply your wide and noball condition here.
df = df[(df["wide"].isna()) & (df["noball"].isna())].copy()
# -- Reset `ball` column -- #
# Add temp column with static value
df["tmp_ball"] = 0.1
# Generate cumulative sum by subgroup.
# Set `ball` to modulo 0.6
df.loc[:, "ball"] = df.groupby(["target_subgroups"])["tmp_ball"].cumsum() % 0.6
# Find rows where ball == 0.0 and set those to 0.6
df.loc[df["ball"] == 0.0, "ball"] = 0.6
# Add ball and target_subgroups columns to get final ball value.
df["ball"] = df["ball"] + df["target_subgroups"]
# Reset your main index, if desired
df.reset_index(drop=True, inplace=True)
# Select only desired field for output.
df = df.loc[:, ["venue","ball","crun","total"]].copy()
Output of df:
venue ball crun total
0 a 0.1 1 45
1 a 0.2 5 45
2 a 0.4 12 45
3 a 0.6 17 45
4 a 0.8 25 45
5 a 0.9 27 45
6 a 1.1 31 45
7 a 1.2 32 45
8 a 1.3 39 45
9 a 1.5 42 45
10 a 1.6 44 45
11 a 1.7 45 45
12 a 0.1 5 27
13 a 0.2 9 27
14 a 0.3 11 27
15 a 0.4 14 27
16 a 0.5 14 27
17 a 0.7 23 27
18 a 1.1 27 27
EDIT:
df1= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
'position':[50, 500, 1030, 2005 , 3575,50, 250]})
df2 = pd.DataFrame({'Chr':['1', '1', '1', '1',
'1','2','2','2','2','2','3','3','3','3','3'],
'start':
[0,100,1000,2000,3000,0,100,1000,2000,3000,0,100,1000,2000,3000],
'end':
[100,1000,2000,3000,4000,100,1000,2000,3000,4000,100,1000,2000,3000,4000],
'logr':[3, 4, 5, 6, 7,8,9,10,11,12,13,15,16,17,18],
'seg':[0.2,0.5,0.2,0.1,0.5,0.5,0.2,0.2,0.1,0.2,0.1,0.5,0.5,0.9,0.3]})
I wanted to conditionally loop through 'Chr' and 'position' in df1 to 'Chr' and intervals ( where the position in df1 falls between 'start' and 'end') in df2, then add 'logr' and 'seg'column in df1
my desired output is :
df3= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
'position':[50, 500, 1030, 2005 , 3575,50, 250],
'logr':[3, 4, 10,11, 18,13, "NA"],
'seg':[0.2,0.5,0.2,0.1,0.3,0.1,"NA"]})
Thank you in advance.
Use DataFrame.merge with outer join for all combinations, then filter by Series.between and boolean indexing with DataFrame.pop for extract columns and last left join for add missing rows:
df3 = df1.merge(df2, on='Chr', how='outer')
#between is by default inclusive (>=, <=) orwith parameter inclusive=False (>, <)
df3 = df3[df3['position'].between(df3.pop('start'), df3.pop('end'))]
#if need one inclusive and another interval not (e.g. >, <=)
#df3 = df3[(df3['position'] > df3.pop('start')) & (df3['position'] <= df3.pop('end'))]
df3 = df1.merge(df3, how='left')
print (df3)
Chr position logr seg
0 1 50 3.0 0.2
1 1 500 4.0 0.5
2 2 1030 10.0 0.2
3 2 2005 11.0 0.1
4 3 3575 18.0 0.3
5 3 50 13.0 0.1
6 4 250 NaN NaN
Another solution:
df3 = df1.merge(df2, on='Chr', how='outer')
s = df3.pop('start')
e = df3.pop('end')
df3 = df3[df3['position'].between(s, e) | s.isna() | e.isna()]
#if different closed intervals
#df3 = df3[(df3['position'] > s) & (df3['position'] <= e) | s.isna() | e.isna()]
print (df3)
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
try using pd.merge() and
np.where()
import pandas pd
import numpy as np
res_df = pd.merge(df1,df2,on=['Chr'],how='outer')
res_df['check_between'] = np.where((res_df['position']>=res_df['start'])&(res_df['position']<=res_df['end']),True,False)
df3 = res_df[(res_df['check_between']==True) |
(res_df['start'].isnull())|
(res_df['end'].isnull()) ]
df3.drop(['check_between','start','end'],axis=1,inplace=True)
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
Doing left-merge with indicator=True. Next, query checks position between start, end or _merge value is left_only. Finally, drop unwanted columns
df1.merge(df2, 'left', indicator=True).query('(start<=position<=end) | _merge.eq("left_only")') \
.drop(['start', 'end', '_merge'],1)
Out[364]:
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
df = pd.read_csv('test.txt',dtype=str)
print(df)
HE WE
0 aa NaN
1 181 76
2 22 13
3 NaN NaN
I want to overwrite any of these data frames with the following indexes
dff = pd.DataFrame({'HE' : [100,30]},index=[1,2])
print(dff)
HE
1 100
2 30
for i in dff.index:
df._set_value(i,'HE',dff._get_value(i,'HE'))
print(df)
HE WE
0 aa NaN
1 100 76
2 30 13
3 NaN NaN
Is there a way to change it all at once without using 'for'?
Use DataFrame.update, (working inplace):
df.update(dff)
print (df)
HE WE
0 aa NaN
1 100 76.0
2 30 13.0
3 NaN NaN
As an example, imagine i have a df with columns for 'year', 'quarter' (sequential through a year), a variable ('var'), and a measurement ('value'):
year quarter var value
2015 1 A 0.1
2015 2 A 0.5
2015 3 A 0.6
2015 4 A 1.0
2015 1 B 0.1
2015 4 B 0.5
2015 2 C 0.0
2015 3 C 0.7
2015 4 C 1.2
but sometimes there is missing data (example: see [2015,2,'B']). it's not too much of a stretch to insert NaN's into the data using reindexing so that i get this:
year quarter var value
2015 1 A 0.1
2015 2 A 0.5
2015 3 A 0.6
2015 4 A 1.0
2015 1 B 0.1
2015 2 B NaN
2015 3 B NaN
2015 4 B 0.5
2015 1 C NaN
2015 2 C 0.0
2015 3 C 0.7
2015 4 C 1.2
but what i'd like to do is fill in the 'missing' data using forward-filling to propagate values - i.e. df.ffill() - and then fill in the remaining values with zero - i.e. df.fillna(0) so that you end up with something like this:
year quarter var value
2015 1 A 0.1
2015 2 A 0.5
2015 3 A 0.6
2015 4 A 1.0
2015 1 B 0.1
2015 2 B 0.1
2015 3 B 0.1
2015 4 B 0.5
2015 1 C 0.0
2015 2 C 0.0
2015 3 C 0.7
2015 4 C 1.2
however, when i use df.ffill(), i haven't found a way to restrict/partition by 'var' or 'year'.
my first idea was to convert the data to a pivot table:
pd.pivot_table(data,values='value',index=['year','quarter'],columns='var',aggfunc=np.sum)
and then do the forward-fill but i cannot figure how to restrict to year (or how to unpack the pivot table back to it's original form).
any assistance is appreciated!
You basically need your data in a table with time along your row indices and everything else in columns. You can use a pivot table or stack/unstack:
df2 = df.set_index(['year', 'quarter', 'var']).unstack('var')
>>> df2
value
var A B C
year quarter
2015 1 0.1 0.1 NaN
2 0.5 NaN 0.0
3 0.6 NaN 0.7
4 1.0 0.5 1.2
Once the data is in this form, then forward fill and back fill.
df2 = df2.ffill().bfill(0)
Finally, stack and sort your data, and then reset your index if desired:
>>> df2.stack('var').sortlevel(2).reset_index()
year quarter var value
0 2015 1 A 0.1
1 2015 2 A 0.5
2 2015 3 A 0.6
3 2015 4 A 1.0
4 2015 1 B 0.1
5 2015 2 B 0.1
6 2015 3 B 0.1
7 2015 4 B 0.5
8 2015 1 C 0.0
9 2015 2 C 0.0
10 2015 3 C 0.7
11 2015 4 C 1.2