Extract corresponding df value with reference from another df - python

There are 2 dataframes with 1 to 1 correspondence. I can retrieve an idxmax from all columns in df1.
Input:
df1 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1':[76,23,43,34,0,78,34],'value2':[1,45,8,0,76,45,56]})
df2 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1_pair':[0,0,0,0,180,180,90],'value2_pair':[0,0,0,0,90,180,90]})
df=df1.loc[df1.iloc[:,1:].idxmax(), 'ref']
Output: df1, df2 and df
ref value1 value2
0 2 76 1
1 4 23 45
2 6 43 8
3 8 34 0
4 10 0 76
5 12 78 45
6 14 34 56
ref value1_pair value2_pair
0 2 0 0
1 4 0 0
2 6 0 0
3 8 0 0
4 10 180 90
5 12 180 180
6 14 90 90
5 12
4 10
Name: ref, dtype: int64
Now I want to create a df which contains 3 columns
Desired Output df:
ref max value corresponding value
12 78 180
10 76 90
What are the best options to extract the corresponding values from df2?

Your main problem is matching the columns between df1 and df2. Let's rename them properly, melt both dataframes, merge and extract:
(df1.melt('ref')
.merge(df2.rename(columns={'value1_pair':'value1',
'value2_pair':'value2'})
.melt('ref'),
on=['ref', 'variable'])
.sort_values('value_x')
.groupby('variable').last()
)
Output:
ref value_x value_y
variable
value1 12 78 180
value2 10 76 90

Related

How to calculate Frequency for a column where consists integers and blanks

I have a csv file like below, it has 10000+ rows
enter link description here
ID Ref_r
235R23 3
56982B 3
62C879 blank
625478 11
9S4284 11
985U12 11
524555 58
99L852 60
1024T4 58
102W49 3
258q34 blank
.....
I'd like to calculate the frequency for col Ref_r (1 to 99), where consists integers and blanks:
df = pd.DataFrame(data)
df1 = df['Ref_r']. value_counts()
however it doesn't work....
Expected result would be:
```none
Ref_r Frequency
1 0
2 0
3 3
...
11 3
...
58 2
59 0
60 1
99 0
blank 2
IIUC, you can use pandas with:
import pandas as pd
(pd.read_csv('input.csv', sep='\s+', dtype='str')['Ref_r']
.value_counts(sort=False)
.reindex(list(map(str, range(1,100)))+['blank'], fill_value=0)
.rename_axis('Ref_r')
.reset_index(name='Frequency')
.to_csv('out.csv', index=False, sep='\t')
)
Example output:
Ref_r Frequency
1 0
2 0
3 3
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 3
12 0
...
98 0
99 0
blank 2

Groupby of different columns with different aggreagation with cumsum for next date

I have a dataframe sorted by date and time as :
ID Date Time A B C
abc 06/Feb 11 12 12 10
abc 06/Feb 12 14 13 5
xyz 07/Feb 1 16 14 50
xyz 07/Feb 2 18 15 0
xyz 07/Feb 3 20 16 10
I want to groupby it by ID and Date and get sum as Numerator,count as Denominator but for next date the sum will be cumsum of previous dates and so will be the count as cumcount, and 3 more columns of last value of A,B,C columns will be added.Such as:
ID Date A_Num A_denom B_Num B_Denom C_Num C_Denom A_Last B_Last C_Last
abc 06/Feb 26 2 25 2 15 2 14 13 5
xyz 07/Feb 54 3 45 3 60 3 20 16 10
I am not able to perform all these in one go..Can anyone help me in this.Thanks in advance.
Now I want to add df2 in df1 acc to id as:
ID Date A_Num A_denom B_Num B_Denom C_Num C_Denom A_Last B_Last C_Last
abc 06/Feb 52 4 50 4 30 4 14 13 5
xyz 07/Feb 108 6 90 6 120 6 20 16 10
You can aggregate sum, size and last per groups in GroupBy.agg, then selecting num and denum and use cumulative sum and last add by concat another DataFrame created by aggregate GroupBy.last:
cols = ['A','B','C']
print (df[cols].dtypes)
A int64
B int64
C int64
dtype: object
d = {'sum':'Num','size':'denom'}
df1 = df.groupby(['ID','Date'])[cols].agg(['sum','size']).rename(columns=d).cumsum()
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df2 = df.groupby(['ID','Date'])[cols].last().add_suffix('_Last')
df3 = pd.concat([df1, df2], axis=1).reset_index()
print (df3)
ID Date A_Num A_denom B_Num B_denom C_Num C_denom A_Last \
0 abc 06/Feb 26 2 25 2 15 2 14
1 xyz 07/Feb 80 5 70 5 75 5 20
B_Last C_Last
0 13 5
1 16 10
For write to file without index use:
df3.to_csv('file', index=False)
If there is no .reset_index in solution use:
df3.to_csv('file')

Is it possible to shuffle a dataframe while using while grouping by index in pandas or sklearn? [duplicate]

My dataframe looks like this
sampleID col1 col2
1 1 63
1 2 23
1 3 73
2 1 20
2 2 94
2 3 99
3 1 73
3 2 56
3 3 34
I need to shuffle the dataframe keeping same samples together and the order of the col1 must be same as in above dataframe.
So I need it like this
sampleID col1 col2
2 1 20
2 2 94
2 3 99
3 1 73
3 2 56
3 3 34
1 1 63
1 2 23
1 3 73
How can I do this? If my example is not clear please let me know.
Assuming you want to shuffle by sampleID. First df.groupby, shuffle (import random first), and then call pd.concat:
import random
groups = [df for _, df in df.groupby('sampleID')]
random.shuffle(groups)
pd.concat(groups).reset_index(drop=True)
sampleID col1 col2
0 2 1 20
1 2 2 94
2 2 3 99
3 1 1 63
4 1 2 23
5 1 3 73
6 3 1 73
7 3 2 56
8 3 3 34
You reset the index with df.reset_index(drop=True), but it is an optional step.
I found this to be significantly faster than the accepted answer:
ids = df["sampleID"].unique()
random.shuffle(ids)
df = df.set_index("sampleID").loc[ids].reset_index()
for some reason the pd.concat was the bottleneck in my usecase. Regardless this way you avoid the concatenation.
Just to add one thing to #cs95 answer.
If you want to shuffle by sampleID but you want to have your sampleIDs ordered from 1. So here the sampleID is not that important to keep.
Here is a solution where you have just to iterate over the gourped dataframes and change the sampleID.
groups = [df for _, df in df.groupby('doc_id')]
random.shuffle(groups)
for i, df in enumerate(groups):
df['doc_id'] = i+1
shuffled = pd.concat(groups).reset_index(drop=True)
doc_id sent_id word_id
0 1 1 20
1 1 2 94
2 1 3 99
3 2 1 63
4 2 2 23
5 2 3 73
6 3 1 73
7 3 2 56
8 3 3 34

How to make the first row of each group as sum of other rows in the same group in pandas dataframe?

Let's say I have a Pandas dataframe that looks like this:
A B
0 67 1
1 78 1
2 53 1
3 44 1
4 84 1
5 2 2
6 63 2
7 13 2
8 56 2
9 24 2
My goal is to:
1) group column A based on column B
2) make the first row of each formed group as a result of groupby() a sum of all other rows of this group. In this case, the value in the first row will be overwritten by the sum.
My desired output would be:
A B
0 259 1
1 78 1
2 53 1
3 44 1
4 84 1
5 156 2
6 63 2
7 13 2
8 56 2
9 24 2
So, the first row of group 1 (grouped based on column B), we have 259 in column A because the values, except the very first row, for group 1 are 78+53+44+84 = 259
For group 2, the first row of group 2 is 156 because 63+13+56+24 = 156
I spent days trying to figure out how to do this and I finally surrender, here's hoping someone in this great community will help.
Here is one way:
grp = df.groupby('B')
Method 1 (similar to #Kent deleted answer):
s=grp['A'].transform('sum').sub(df['A'])
idx=grp.head(1).index
df.loc[idx,'A']=s
Method 2:
v= [g.iloc[1:].groupby('B')['A'].sum().iat[0] for _,g in grp]
idx = grp.head(1).index
df.loc[idx,'A'] = v
print(df)
A B
0 259 1
1 78 1
2 53 1
3 44 1
4 84 1
5 156 2
6 63 2
7 13 2
8 56 2
9 24 2

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Categories