Efficiently extract numbers from a column in python - python

I have a column in pandas dataframe as below:
Manufacture_Id Score Rank
0 S1 93 1
1 S1 91 2
2 S1 86 3
3 S2 88 1
4 S25 73 2
5 S100 72 3
6 S100 34 1
7 S100 24 2
I want to extract the ending numbers from the 'Manufacture_Id' column into a new column as below:
Manufacture_Id Score Rank Id
0 S1 93 1 1
1 S1 91 2 1
2 S1 86 3 1
3 S2 88 1 2
4 S25 73 2 25
5 S100 72 3 100
6 S100 34 1 100
7 S100 24 2 100
I have written the below code which gives the results but it is not efficient when the data becomes big.
test['id'] = test.Manufacture_Id.str.extract(r'(\d+\.\d+|\d+)')
Is there a way to do it efficiently?
Data:
#Ceate dataframe
data = [
["S1",93,1],
["S1",91,2],
["S1",86,3],
["S2",88,1],
["S25",73,2],
["S100",72,3],
["S100",34,1],
["S100",24,2],
]
#dataframe
test = pd.DataFrame(data, columns = ['Manufacture_Id', 'Score', 'Rank'])

Following code will be more efficient than regex.
test["id"] = test['Manufacture_Id'].str[1:].astype(int)
But if the S is not constant then you try following snippet.
test["id"] = test.Manufacture_Id.str.extract('(\d+)').astype(int)

Related

Custom Label mapping pandas

I have a dataset which looks something like this:
ID CD
1 70
2 51
3 54
4 55
5 57
6 14
I want to map these labels to values like 70->1,(51,54,55)->2,57->3,else 4.
Final dataset would look something like this:
ID CD CD_New
1 70 1
2 51 2
3 54 2
4 55 2
5 57 3
6 14 4
How do achieve this in Pandas?
Use np.select
import numpy as np
df = pd.read_clipboard()
conditions = [df['CD']==70, df['CD'].isin([51,54,55]), df['CD']==57]
choices = [1,2,3]
df['CD_New'] = np.select(conditions, choices, default=4)
df
Results:
ID CD CD_New
0 1 70 1
1 2 51 2
2 3 54 2
3 4 55 2
4 5 57 3
5 6 14 4

Is it possible to shuffle a dataframe while using while grouping by index in pandas or sklearn? [duplicate]

My dataframe looks like this
sampleID col1 col2
1 1 63
1 2 23
1 3 73
2 1 20
2 2 94
2 3 99
3 1 73
3 2 56
3 3 34
I need to shuffle the dataframe keeping same samples together and the order of the col1 must be same as in above dataframe.
So I need it like this
sampleID col1 col2
2 1 20
2 2 94
2 3 99
3 1 73
3 2 56
3 3 34
1 1 63
1 2 23
1 3 73
How can I do this? If my example is not clear please let me know.
Assuming you want to shuffle by sampleID. First df.groupby, shuffle (import random first), and then call pd.concat:
import random
groups = [df for _, df in df.groupby('sampleID')]
random.shuffle(groups)
pd.concat(groups).reset_index(drop=True)
sampleID col1 col2
0 2 1 20
1 2 2 94
2 2 3 99
3 1 1 63
4 1 2 23
5 1 3 73
6 3 1 73
7 3 2 56
8 3 3 34
You reset the index with df.reset_index(drop=True), but it is an optional step.
I found this to be significantly faster than the accepted answer:
ids = df["sampleID"].unique()
random.shuffle(ids)
df = df.set_index("sampleID").loc[ids].reset_index()
for some reason the pd.concat was the bottleneck in my usecase. Regardless this way you avoid the concatenation.
Just to add one thing to #cs95 answer.
If you want to shuffle by sampleID but you want to have your sampleIDs ordered from 1. So here the sampleID is not that important to keep.
Here is a solution where you have just to iterate over the gourped dataframes and change the sampleID.
groups = [df for _, df in df.groupby('doc_id')]
random.shuffle(groups)
for i, df in enumerate(groups):
df['doc_id'] = i+1
shuffled = pd.concat(groups).reset_index(drop=True)
doc_id sent_id word_id
0 1 1 20
1 1 2 94
2 1 3 99
3 2 1 63
4 2 2 23
5 2 3 73
6 3 1 73
7 3 2 56
8 3 3 34

How to make the first row of each group as sum of other rows in the same group in pandas dataframe?

Let's say I have a Pandas dataframe that looks like this:
A B
0 67 1
1 78 1
2 53 1
3 44 1
4 84 1
5 2 2
6 63 2
7 13 2
8 56 2
9 24 2
My goal is to:
1) group column A based on column B
2) make the first row of each formed group as a result of groupby() a sum of all other rows of this group. In this case, the value in the first row will be overwritten by the sum.
My desired output would be:
A B
0 259 1
1 78 1
2 53 1
3 44 1
4 84 1
5 156 2
6 63 2
7 13 2
8 56 2
9 24 2
So, the first row of group 1 (grouped based on column B), we have 259 in column A because the values, except the very first row, for group 1 are 78+53+44+84 = 259
For group 2, the first row of group 2 is 156 because 63+13+56+24 = 156
I spent days trying to figure out how to do this and I finally surrender, here's hoping someone in this great community will help.
Here is one way:
grp = df.groupby('B')
Method 1 (similar to #Kent deleted answer):
s=grp['A'].transform('sum').sub(df['A'])
idx=grp.head(1).index
df.loc[idx,'A']=s
Method 2:
v= [g.iloc[1:].groupby('B')['A'].sum().iat[0] for _,g in grp]
idx = grp.head(1).index
df.loc[idx,'A'] = v
print(df)
A B
0 259 1
1 78 1
2 53 1
3 44 1
4 84 1
5 156 2
6 63 2
7 13 2
8 56 2
9 24 2

pandas - exponentially weighted moving average - similar to excel

Consider I've a dataframe of 10 rows having two columns A and B as following :
A B
0 21 6
1 87 0
2 87 0
3 25 0
4 25 0
5 14 0
6 79 0
7 70 0
8 54 0
9 35 0
In excel I can calculate the rolling mean like this excluding the first row:
How can I do this in pandas?
Here is what I've tried:
import pandas as pd
df = pd.read_clipboard() #copying the dataframe given above and calling read_clipboard will get the df populated
for i in range(1, len(df)):
df.loc[i, 'B'] = df[['A', 'B']].loc[i-1].mean()
This gives me the desired result matching excel. But is there a better pandas way to do it? I've tried using expanding and rolling did not produce desired result.
You have an exponentially weighted moving average, rather than a simple moving average. That's why pd.DataFrame.rolling didn't work. You might be looking for pd.DataFrame.ewm instead.
Starting from
df
Out[399]:
A B
0 21 6
1 87 0
2 87 0
3 25 0
4 25 0
5 14 0
6 79 0
7 70 0
8 54 0
9 35 0
df['B'] = df["A"].shift().fillna(df["B"]).ewm(com=1, adjust=False).mean()
df
Out[401]:
A B
0 21 6.000000
1 87 13.500000
2 87 50.250000
3 25 68.625000
4 25 46.812500
5 14 35.906250
6 79 24.953125
7 70 51.976562
8 54 60.988281
9 35 57.494141
Even on just ten rows, doing it this way speeds up the code by about a factor of 10 with %timeit (959 microseconds from 10.3ms). On 100 rows, this becomes a factor of 100 (1.1ms vs 110ms).

Speeding up duplicating of rows in Pandas groupby?

I have a very large data frame (hundreds of millions of rows). There are two group ID's, group_id_1 and group_id_2. The data frame looks like this:
group_id_1 group_id_2 value1 time
1 2 45 1
1 2 49 2
1 4 95 1
1 4 55 2
2 2 44 1
2 4 88 1
2 4 90 2
For each group_id_1 x group_id_2 combo, I need to duplicate the row with the latest time, and increment the time by one. In other words, my table should look like:
group_id_1 group_id_2 value1 time
1 2 45 1
1 2 49 2
1 2 49 3
1 4 95 1
1 4 55 2
1 4 55 3
2 2 44 1
2 2 44 2
2 4 88 1
2 4 90 2
2 4 90 3
Right now, I am doing:
for name, group in df.groupby(['group_id_1', 'group_id_2']):
last, = group.sort_values(by='time').tail(1)['time'].values
temp = group[group['time']==last]
temp.loc[:, 'time'] = last + 1
group = group.append(temp)
This is insanely inefficient. If I put the above code into a function, and use the .apply() method with the groupby object, it also takes an enormous amount of time.
How do I speed this process up?
You can use groupby with aggregate last, add time by add and concat to original:
df1 = df.sort_values(by='time').groupby(['group_id_1', 'group_id_2']).last().reset_index()
df1.time = df1.time.add(1)
print (df1)
group_id_1 group_id_2 value1 time
0 1 2 49 3
1 1 4 55 3
2 2 2 44 2
3 2 4 90 3
df = pd.concat([df,df1])
df = df.sort_values(['group_id_1','group_id_2']).reset_index(drop=True)
print (df)
group_id_1 group_id_2 value1 time
0 1 2 45 1
1 1 2 49 2
2 1 2 49 3
3 1 4 95 1
4 1 4 55 2
5 1 4 55 3
6 2 2 44 1
7 2 2 44 2
8 2 4 88 1
9 2 4 90 2
10 2 4 90 3
First, sort the dataframe by time (this should be more efficient than sorting each group by time):
df = df.sort_values('time')
Second, get the last row in each group (without sorting the groups to improve performance):
last = df.groupby(['group_id_1', 'group_id_2'], sort=False).last()
Third, increment the time:
last['time'] = last['time'] + 1
Fourth, concatenate:
df = pd.concat([df, last])
Fifth, sort back to the original order:
df = df.sort_values(['group_id_1', 'group_id_2'])
Explanation: concatenating and then sorting will be much faster than inserting rows one by one.

Categories