Compare two pandas dataframe with different size - python

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris

If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN

You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want

You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31

Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Related

dropping duplicates on one specific column and add a new column as count of repeat records pandas

I have a pandas df like this
student_id
A
B
1
3
13
2
4
23
1
5
12
4
28
32
1
38
12
2
21
14
My desired output:
I want to drop the duplicates, and count how many duplicates there are according to student_id
and keeping the last record/row and append the count column as new column, also average the duplicated rows entry in A and B as new columns
student_id
A
B
count
average A rounded
average B rounded
1
38
12
3
15
12
2
21
14
2
13
19
4
28
32
1
28
32
You can use named aggregation:
df.groupby('student_id', as_index=False).agg(
A=('A', 'last'),
B=('B', 'last'),
count=('student_id', 'count'),
average_A_rounded=('A', lambda x: np.mean(x).round()),
average_B_rounded=('B', lambda x: np.mean(x).round()),
)
# student_id A B count average_A_rounded average_B_rounded
# 0 1 38 12 3 15 12
# 1 2 21 14 2 12 18
# 2 4 28 32 1 28 32
I see that you want round the values "half-up". So to extend the #tdy answer:
def round_half_up(x):
mask = x >= 0
out = np.empty_like(x)
out[mask] = np.floor(x[mask] + 0.5)
out[~mask] = np.ceil(x[~mask] - 0.5)
return out
df = df.groupby("student_id", as_index=False).agg(
A=("A", "last"),
B=("B", "last"),
count=("A", "count"),
average_A_rounded=("A", "mean"),
average_B_rounded=("B", "mean"),
)
print(df.apply(round_half_up).astype(int))
Prints:
student_id A B count average_A_rounded average_B_rounded
0 1 38 12 3 15 12
1 2 21 14 2 13 19
2 4 28 32 1 28 32

Extract corresponding df value with reference from another df

There are 2 dataframes with 1 to 1 correspondence. I can retrieve an idxmax from all columns in df1.
Input:
df1 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1':[76,23,43,34,0,78,34],'value2':[1,45,8,0,76,45,56]})
df2 = pd.DataFrame({'ref':[2,4,6,8,10,12,14],'value1_pair':[0,0,0,0,180,180,90],'value2_pair':[0,0,0,0,90,180,90]})
df=df1.loc[df1.iloc[:,1:].idxmax(), 'ref']
Output: df1, df2 and df
ref value1 value2
0 2 76 1
1 4 23 45
2 6 43 8
3 8 34 0
4 10 0 76
5 12 78 45
6 14 34 56
ref value1_pair value2_pair
0 2 0 0
1 4 0 0
2 6 0 0
3 8 0 0
4 10 180 90
5 12 180 180
6 14 90 90
5 12
4 10
Name: ref, dtype: int64
Now I want to create a df which contains 3 columns
Desired Output df:
ref max value corresponding value
12 78 180
10 76 90
What are the best options to extract the corresponding values from df2?
Your main problem is matching the columns between df1 and df2. Let's rename them properly, melt both dataframes, merge and extract:
(df1.melt('ref')
.merge(df2.rename(columns={'value1_pair':'value1',
'value2_pair':'value2'})
.melt('ref'),
on=['ref', 'variable'])
.sort_values('value_x')
.groupby('variable').last()
)
Output:
ref value_x value_y
variable
value1 12 78 180
value2 10 76 90

Get order of subgroups in pandas dataframe

I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df
Distance Name Time
0 23 Kate 3
1 16 John 5
2 32 Peter 2
3 15 Kate 7
4 31 John 9
5 26 Peter 2
I want to add a column that tells me, for each Name, what's the order of the times.
I want something like this:
Order Distance Name Time
0 16 John 5
1 31 John 9
0 23 Kate 3
1 15 Kate 7
0 32 Peter 2
1 26 Peter 4
I can do it using a for loop:
df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want
for name, row in df.groupby('Name').count().iterrows():
table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
to_concat = [df2,table]
df2 = pd.concat(to_concat)
df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2
This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.
Can someone help me write this in a simpler way that runs faster?
I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.
Best,
Use sort_values with cumcount:
df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
If need first column use insert:
df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
Order Distance Name Time
1 0 16 John 5
4 1 31 John 9
0 0 23 Kate 3
3 1 15 Kate 7
2 0 32 Peter 2
5 1 26 Peter 4
In [67]: df = df.sort_values(['Name','Time']) \
.assign(Order=df.groupby('Name').cumcount())
In [68]: df
Out[68]:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
PS I'm not sure this is the most elegant way to do this...

Returning new dataframe from column range (Pandas)

I currently have the following dataframe:
df1
3 4 5 6
0 NaN NaN Sea NaN
1 light medium light medium
2 26 41.5 15 14
3 32 40 18 29
4 41 29 19 42
And I am trying to return a new dataframe where only the Sea column and onwards remains:
df1
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42
I feel I am very close with my code:
for i in range(len(df.columns)):
if pd.Series.any(df.iloc[:,i].str.contains(pat="Sea")):
xyz = df.columns[i] #This is the piece of code I am having trouble with
df = df.loc[:,[xyz:??]]
Essentially I would like to return the column index of where the word 'Sea' is contained and then create a new dataframe from that index to the length of the dataframe. Hopefully that explanation makes sense, and any help is appreciated
Step 1: Get the column name:
In [542]: c = df[df == 'Sea'].any().argmax(); c
Out[542]: '5'
Step 2: Use df.loc to index:
In [544]: df.loc[:, c:]
Out[544]:
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42
If df.loc[:, c:] doesn't work, you may want to fall back on a more explicit version (thanks to piRSquared for the simplification):
df.iloc[:, df.columns.get_loc(c):]
Maybe you could write a little rudimentary function to do so.
def match_cut(df, to_match):
for col in df.columns:
if df[col].str.match(to_match).any():
return df.loc[:, col:]
return pd.DataFrame()
With that being said, cᴏʟᴅsᴘᴇᴇᴅ's answer should be preferred as it avoids column looping like this function.
>>> match_cut(df, 'Sea')
5 6
0 Sea np.nan
1 light medium
2 15 14
3 18 29
4 19 42
You can try thisby using list and index
df2.ix[:,df2.ix[0,:].tolist().index('Sea'):]
Out[85]:
5 6
0 Sea NaN
1 light medium
2 15 14
3 18 29
4 19 42

pandas - exponentially weighted moving average - similar to excel

Consider I've a dataframe of 10 rows having two columns A and B as following :
A B
0 21 6
1 87 0
2 87 0
3 25 0
4 25 0
5 14 0
6 79 0
7 70 0
8 54 0
9 35 0
In excel I can calculate the rolling mean like this excluding the first row:
How can I do this in pandas?
Here is what I've tried:
import pandas as pd
df = pd.read_clipboard() #copying the dataframe given above and calling read_clipboard will get the df populated
for i in range(1, len(df)):
df.loc[i, 'B'] = df[['A', 'B']].loc[i-1].mean()
This gives me the desired result matching excel. But is there a better pandas way to do it? I've tried using expanding and rolling did not produce desired result.
You have an exponentially weighted moving average, rather than a simple moving average. That's why pd.DataFrame.rolling didn't work. You might be looking for pd.DataFrame.ewm instead.
Starting from
df
Out[399]:
A B
0 21 6
1 87 0
2 87 0
3 25 0
4 25 0
5 14 0
6 79 0
7 70 0
8 54 0
9 35 0
df['B'] = df["A"].shift().fillna(df["B"]).ewm(com=1, adjust=False).mean()
df
Out[401]:
A B
0 21 6.000000
1 87 13.500000
2 87 50.250000
3 25 68.625000
4 25 46.812500
5 14 35.906250
6 79 24.953125
7 70 51.976562
8 54 60.988281
9 35 57.494141
Even on just ten rows, doing it this way speeds up the code by about a factor of 10 with %timeit (959 microseconds from 10.3ms). On 100 rows, this becomes a factor of 100 (1.1ms vs 110ms).

Categories