Custom Label mapping pandas - python

I have a dataset which looks something like this:
ID CD
1 70
2 51
3 54
4 55
5 57
6 14
I want to map these labels to values like 70->1,(51,54,55)->2,57->3,else 4.
Final dataset would look something like this:
ID CD CD_New
1 70 1
2 51 2
3 54 2
4 55 2
5 57 3
6 14 4
How do achieve this in Pandas?

Use np.select
import numpy as np
df = pd.read_clipboard()
conditions = [df['CD']==70, df['CD'].isin([51,54,55]), df['CD']==57]
choices = [1,2,3]
df['CD_New'] = np.select(conditions, choices, default=4)
df
Results:
ID CD CD_New
0 1 70 1
1 2 51 2
2 3 54 2
3 4 55 2
4 5 57 3
5 6 14 4

Related

Efficiently extract numbers from a column in python

I have a column in pandas dataframe as below:
Manufacture_Id Score Rank
0 S1 93 1
1 S1 91 2
2 S1 86 3
3 S2 88 1
4 S25 73 2
5 S100 72 3
6 S100 34 1
7 S100 24 2
I want to extract the ending numbers from the 'Manufacture_Id' column into a new column as below:
Manufacture_Id Score Rank Id
0 S1 93 1 1
1 S1 91 2 1
2 S1 86 3 1
3 S2 88 1 2
4 S25 73 2 25
5 S100 72 3 100
6 S100 34 1 100
7 S100 24 2 100
I have written the below code which gives the results but it is not efficient when the data becomes big.
test['id'] = test.Manufacture_Id.str.extract(r'(\d+\.\d+|\d+)')
Is there a way to do it efficiently?
Data:
#Ceate dataframe
data = [
["S1",93,1],
["S1",91,2],
["S1",86,3],
["S2",88,1],
["S25",73,2],
["S100",72,3],
["S100",34,1],
["S100",24,2],
]
#dataframe
test = pd.DataFrame(data, columns = ['Manufacture_Id', 'Score', 'Rank'])
Following code will be more efficient than regex.
test["id"] = test['Manufacture_Id'].str[1:].astype(int)
But if the S is not constant then you try following snippet.
test["id"] = test.Manufacture_Id.str.extract('(\d+)').astype(int)

Is it possible to shuffle a dataframe while using while grouping by index in pandas or sklearn? [duplicate]

My dataframe looks like this
sampleID col1 col2
1 1 63
1 2 23
1 3 73
2 1 20
2 2 94
2 3 99
3 1 73
3 2 56
3 3 34
I need to shuffle the dataframe keeping same samples together and the order of the col1 must be same as in above dataframe.
So I need it like this
sampleID col1 col2
2 1 20
2 2 94
2 3 99
3 1 73
3 2 56
3 3 34
1 1 63
1 2 23
1 3 73
How can I do this? If my example is not clear please let me know.
Assuming you want to shuffle by sampleID. First df.groupby, shuffle (import random first), and then call pd.concat:
import random
groups = [df for _, df in df.groupby('sampleID')]
random.shuffle(groups)
pd.concat(groups).reset_index(drop=True)
sampleID col1 col2
0 2 1 20
1 2 2 94
2 2 3 99
3 1 1 63
4 1 2 23
5 1 3 73
6 3 1 73
7 3 2 56
8 3 3 34
You reset the index with df.reset_index(drop=True), but it is an optional step.
I found this to be significantly faster than the accepted answer:
ids = df["sampleID"].unique()
random.shuffle(ids)
df = df.set_index("sampleID").loc[ids].reset_index()
for some reason the pd.concat was the bottleneck in my usecase. Regardless this way you avoid the concatenation.
Just to add one thing to #cs95 answer.
If you want to shuffle by sampleID but you want to have your sampleIDs ordered from 1. So here the sampleID is not that important to keep.
Here is a solution where you have just to iterate over the gourped dataframes and change the sampleID.
groups = [df for _, df in df.groupby('doc_id')]
random.shuffle(groups)
for i, df in enumerate(groups):
df['doc_id'] = i+1
shuffled = pd.concat(groups).reset_index(drop=True)
doc_id sent_id word_id
0 1 1 20
1 1 2 94
2 1 3 99
3 2 1 63
4 2 2 23
5 2 3 73
6 3 1 73
7 3 2 56
8 3 3 34

How to make the first row of each group as sum of other rows in the same group in pandas dataframe?

Let's say I have a Pandas dataframe that looks like this:
A B
0 67 1
1 78 1
2 53 1
3 44 1
4 84 1
5 2 2
6 63 2
7 13 2
8 56 2
9 24 2
My goal is to:
1) group column A based on column B
2) make the first row of each formed group as a result of groupby() a sum of all other rows of this group. In this case, the value in the first row will be overwritten by the sum.
My desired output would be:
A B
0 259 1
1 78 1
2 53 1
3 44 1
4 84 1
5 156 2
6 63 2
7 13 2
8 56 2
9 24 2
So, the first row of group 1 (grouped based on column B), we have 259 in column A because the values, except the very first row, for group 1 are 78+53+44+84 = 259
For group 2, the first row of group 2 is 156 because 63+13+56+24 = 156
I spent days trying to figure out how to do this and I finally surrender, here's hoping someone in this great community will help.
Here is one way:
grp = df.groupby('B')
Method 1 (similar to #Kent deleted answer):
s=grp['A'].transform('sum').sub(df['A'])
idx=grp.head(1).index
df.loc[idx,'A']=s
Method 2:
v= [g.iloc[1:].groupby('B')['A'].sum().iat[0] for _,g in grp]
idx = grp.head(1).index
df.loc[idx,'A'] = v
print(df)
A B
0 259 1
1 78 1
2 53 1
3 44 1
4 84 1
5 156 2
6 63 2
7 13 2
8 56 2
9 24 2

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

Speeding up duplicating of rows in Pandas groupby?

I have a very large data frame (hundreds of millions of rows). There are two group ID's, group_id_1 and group_id_2. The data frame looks like this:
group_id_1 group_id_2 value1 time
1 2 45 1
1 2 49 2
1 4 95 1
1 4 55 2
2 2 44 1
2 4 88 1
2 4 90 2
For each group_id_1 x group_id_2 combo, I need to duplicate the row with the latest time, and increment the time by one. In other words, my table should look like:
group_id_1 group_id_2 value1 time
1 2 45 1
1 2 49 2
1 2 49 3
1 4 95 1
1 4 55 2
1 4 55 3
2 2 44 1
2 2 44 2
2 4 88 1
2 4 90 2
2 4 90 3
Right now, I am doing:
for name, group in df.groupby(['group_id_1', 'group_id_2']):
last, = group.sort_values(by='time').tail(1)['time'].values
temp = group[group['time']==last]
temp.loc[:, 'time'] = last + 1
group = group.append(temp)
This is insanely inefficient. If I put the above code into a function, and use the .apply() method with the groupby object, it also takes an enormous amount of time.
How do I speed this process up?
You can use groupby with aggregate last, add time by add and concat to original:
df1 = df.sort_values(by='time').groupby(['group_id_1', 'group_id_2']).last().reset_index()
df1.time = df1.time.add(1)
print (df1)
group_id_1 group_id_2 value1 time
0 1 2 49 3
1 1 4 55 3
2 2 2 44 2
3 2 4 90 3
df = pd.concat([df,df1])
df = df.sort_values(['group_id_1','group_id_2']).reset_index(drop=True)
print (df)
group_id_1 group_id_2 value1 time
0 1 2 45 1
1 1 2 49 2
2 1 2 49 3
3 1 4 95 1
4 1 4 55 2
5 1 4 55 3
6 2 2 44 1
7 2 2 44 2
8 2 4 88 1
9 2 4 90 2
10 2 4 90 3
First, sort the dataframe by time (this should be more efficient than sorting each group by time):
df = df.sort_values('time')
Second, get the last row in each group (without sorting the groups to improve performance):
last = df.groupby(['group_id_1', 'group_id_2'], sort=False).last()
Third, increment the time:
last['time'] = last['time'] + 1
Fourth, concatenate:
df = pd.concat([df, last])
Fifth, sort back to the original order:
df = df.sort_values(['group_id_1', 'group_id_2'])
Explanation: concatenating and then sorting will be much faster than inserting rows one by one.

Categories