How to create pairs of column names based on a condition? - python

I have the following DataFrame df:
df =
min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
1 10 2 5
0 11 1 6
How can I calculate the difference between pairs of max and min columns?
Expected result:
diff(arc) diff(gbm)_p1
9 3
11 5
I assume that apply(lambda x: ...) should be used to calculate the differences row-wise, but how can I create pairs of columns? In my case, I should only calculate the difference between columns that have the same name, e.g. ...(arc) or ...(gbm)_p1. Please notice that min and max prefixes always appear at the beginning of the column names.

Idea is filter both DataFrames by DataFrame.filter with regex where ^ is start of string, rename columns, so possible subtract, because same columns names in both:
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df2.sub(df1)
print (df)
diff(arc) diff(gbm)_p1
0 9 3
1 11 5
EDIT:
print (df)
id min(arc) max(arc) min(gbm)_p1 max(gbm)_p1
0 123 1 10 2 5
1 546 0 11 1 6
df1 = df.filter(regex='^min').rename(columns= lambda x: x.replace('min','diff'))
df2 = df.filter(regex='^max').rename(columns= lambda x: x.replace('max','diff'))
df = df[['id']].join(df2.sub(df1))
print (df)
id diff(arc) diff(gbm)_p1
0 123 9 3
1 546 11 5

Related

Pandas: Groupby.transform -> assign specific values to column

In general terms, is there a way to assign specific values to a column via groupby.transform(), where the groupby size is known in advance?
For example:
df = pd.DataFrame(data = {'A':[10,10,20,20],'B':['abc','def','ghi','jkl'],'GroupID':[1,1,2,2]})
funcDict = {'A':'sum','B':['specific_val_1', 'specific_val_2']}
df = df.groupby('GroupID').transform(funcDict)
where the result would be:
index
A
B
1
20
specific_val_1
2
20
specific_val_2
3
40
specific_val_1
4
40
specific_val_2
transform can not accepted dict , so we can do agg with merge
out = df.groupby('GroupID',as_index=False)[['A']].sum()
out = out.merge(pd.DataFrame({'B':['specific_val_1', 'specific_val_2']}),how='cross')
Out[90]:
GroupID A B
0 1 20 specific_val_1
1 1 20 specific_val_2
2 2 40 specific_val_1
3 2 40 specific_val_2

Update dataframe column based on another dataframe column without for loop

I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?
Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0

How to check consecutive same value and count of value occur same time pandas

In dataframe column B check which value occurs consecutive same and how many times have same value.
df = pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,11,11,11,11,11,15,16,16]})
df_out=pd.DataFrame({'value':[11],
'values count':[5]})
First get only consecutive groups by fir not equal and Series.shift with cumulative sum, get all duplicates by Series.duplicated and last count by GroupBy.size:
df1 = (df[df.B.ne(df.B.shift()).cumsum().duplicated(keep=False)]
.groupby(df['B'].rename('value'))
.size()
.reset_index(name='count'))
print (df1)
value count
0 11 5
Details:
print (df[df.B.ne(df.B.shift()).cumsum().duplicated(keep=False)])
A B
4 8 11
5 11 11
6 1 11
7 15 11
8 20 11
Or use Series.value_counts for counts:
df2 = (df.loc[df.B.ne(df.B.shift()).cumsum().duplicated(keep=False), 'B']
.value_counts()
.rename_axis('value')
.reset_index(name='count'))
print (df2)
value count
0 11 5
It seems input data was changed after edit, so last 16 create new group:
df = pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,11,11,11,11,11,15,16,16]})
df1 = (df[df.B.ne(df.B.shift()).cumsum().duplicated(keep=False)]
.groupby(df['B'].rename('value'))
.size()
.reset_index(name='count'))
print (df1)
value count
0 11 5
1 16 2

Is there a way to create multiple dataframes from single dataframe using pandas [duplicate]

I have DataFrame with column Sales.
How can I split it into 2 based on Sales value?
First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s
You can use boolean indexing:
df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50
s = 30
df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20
It's also possible to invert mask by ~:
mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
print (df2)
A Sales
0 3 10
1 4 20
print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool
print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool
Using groupby you could split into two dataframes like
In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]
In [1048]: df1
Out[1048]:
A Sales
2 7 30
3 6 40
4 1 50
In [1049]: df2
Out[1049]:
A Sales
0 3 10
1 4 20
Using groupby and list comprehension:
Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.
DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]
accessing the separated DF like this:
ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF
accessing the column value of the separated DF like this:
ansI_chr=ans[i].chr
One-liner using the walrus operator (Python 3.8):
df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]
Consider using copy to avoid SettingWithCopyWarning:
df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()
Alternatively, you can use the method query:
df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()
I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:
df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}
This should do it if you wanted to go based on categorical groups.

Selecting columns from two dataframes according to another column

I have 2 dataframes, one of them contains some general information about football players, and second of them contains other information like winning matches for each player. They both have the "id" column. However, they are not in same length.
What I want to do is creating a new dataframe which contains 2 columns: "x" from first dataframe and "y" from second dataframe, ONLY where the "id" column contains the same value in both dataframes. Thus, I can match the "x" and "y" columns which belong to same person.
I tried to do it using concat function:
pd.concat([firstdataframe['x'], seconddataframe['y']], axis=1, keys=['x', 'y'])
But I didn't manage to know how to apply the condition of the "id" being equal in both dataframes.
It seems you need merge with default inner join, also each values in id columns has to be unique:
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
Sample:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2],'y':[7,0]})
print (df2)
id y
0 1 7
1 2 0
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 2 3 0
Solution with concat is possible, but a bit complicated, becasue need join on indexes with inner join:
df = pd.concat([df1.set_index('id')['x'],
df2.set_index('id')['y']], axis=1, join='inner')
.reset_index()
print (df)
id x y
0 1 4 7
1 2 3 0
EDIT:
If ids are not unique, duplicates create all combinations and output dataframe is expanded:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2,1,1],'y':[7,0,4,2]})
print (df2)
id y
0 1 7
1 2 0
2 1 4
3 1 2
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 1 4 4
2 1 4 2
3 2 3 0

Categories