pandas split string and extract upto n index position - python

I have my input data like as below stored in a dataframe column
active_days_revenue
active_days_rate
total_revenue
gap_days_rate
I would like to do the below
a) split the string using _ delimiter
b) extract n elements from the delimiter
So, I tried the below
df['text'].split('_')[:1] # but this doesn't work
df['text'].split('_')[0] # this works but returns only the 1st element
I expect my output like below. Instead of just getting items based on 0 index position, I would like to get from 0 to 1st index position
active_days
active_days
total_revenue
gap_days

You can use str.extract with a dynamic regex (fastest):
N = 2
df['out'] = df['text'].str.extract(fr'([^_]+(?:_[^_]+){{,{N-1}}})', expand=False)
Or slicing and agg:
df['out'] = df['text'].str.split('_').str[:2].agg('_'.join)
Or str.extractall and groupby.agg:
df['out'] = df['text'].str.extractall('([^_]+)')[0].groupby(level=0).agg(lambda x: '_'.join(x.head(2)))
Output:
text out
0 active_days_revenue active_days
1 active_days_rate active_days
2 total_revenue total_revenue
3 gap_days_rate gap_days
timings
On 4k rows:
# extract
2.17 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# split/slice/agg
3.56 ms ± 811 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# extractall
361 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
using as grouper for columns
import re
N = 2
df1.groupby(lambda x: m.group() if (m:=re.match(fr'([^_]+(?:_[^_]+){{,{N-1}}})', x)) else x, axis=1, sort=False)

Related

How to apply changes to subset dataframe to source dataframe

I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True

Create a new dataframe based on two columns of value in pandas dataframe

I have a dataframe df1 in python as below:
Type Category
a 1
b 2
c 3
d 4
Expected output:
Type
a/1
b/2
c/3
d/4
The actual dataframe is way larger than this thus i can't type out every cells for the new dataframe.
How can I extract the columns and output to another dataframe with the '/' seperated? Maybe using some for loop?
Using str.cat
The right pandas-y way to proceed is by using str.cat
df['Type'] = df.Type.str.cat(others=df.Category.astype(str), sep='/')
others contains the pd.Series to concatenate, and sep the separator to use.
Result
Type
0 a/1
1 b/2
2 c/3
3 d/4
Performance comparison
%%timeit
df.Type.str.cat(others=df.Category.astype(str), sep='/')
>> 286 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df['Type'] + '/' + df['Category'].astype(str)
>> 348 µs ± 5.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Both solutions give the same result, but using str.cat is about ~20% faster.

avoid repeating the dataframe name when operating on pandas columns

Very much a beginner question, sorry: is there a way to avoid repeating the dataframe name when operating on pandas columns?
In R, data.table allows to operate on a column without repeating the dataframe name like this
very_long_dt_name = data.table::data.table(col1=c(1,2,3),col2=c(3,3,1))
# operate on the columns without repeating the dt name:
very_long_dt_name[,ratio:=round(col1/col2,2)]
I couldn't figure out how to do it with pandas in Python so I keep repeating the df name:
data = {'col1': [1,2,3], 'col2': [3, 3, 1]}
very_long_df_name = pd.DataFrame(data)
# operate on the columns requires repeating the df name
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
I'm sure there's a way to avoid it but I can't find anything on Google. Any hint please? Thank you.
Try assign:
very_long_df_name.assign(ratio=lambda x: np.round(x.col1/x.col2,2))
Output:
col1 col2 ratio
0 1 3 0.33
1 2 3 0.67
2 3 1 3.00
Edit: to reflect comments, tests on 1 million rows:
%%timeit
very_long_df_name.assign(ratio = lambda x:x.col1/x.col2)
# 18.6 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and
%%timeit
very_long_df_name['ratio'] = very_long_df_name['col1']/very_long_df_name['col2']
# 13.3 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And with np.round, assign
%%timeit
very_long_df_name.assign(ratio = lambda x: np.round(x.col1/x.col2,2))
# 64.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and not-assign:
%%timeit
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
# 55.8 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
SO it appears that assign is vectorized, just not as well tuned.

Pandas: Replace a string with 'other' if it is not present in a list of strings

I have the following data frame, df, with column 'Class'
Class
0 Individual
1 Group
2 A
3 B
4 C
5 D
6 Group
I would like to replace everything apart from Group and Individual with 'Other', so the final data frame is
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
The dataframe is huge, with over 600 K rows. What is the best way to optimally look for values other than 'Group' and 'Individual' and replace them with 'Other'?
I have seen examples for replace, such as:
df['Class'] = df['Class'].replace({'A':'Other', 'B':'Other'})
but since the sheer amount of unique values i have are too many i cannot individually do this. I want to rather just use the exclude subset of 'Group' and 'Individual'.
I think you need:
df['Class'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
print (df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
Another solution (slower):
m = (df['Class'] == 'Individual') | (df['Class'] == 'Group')
df['Class'] = np.where(m, df['Class'], 'Other')
Another solution:
df['Class'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
Performance (in real data depends of number of replacements):
#[700000 rows x 1 columns]
df = pd.concat([df] * 100000, ignore_index=True)
#print (df)
In [208]: %timeit df['Class1'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
25.9 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit df['Class2'] = np.where((df['Class'] == 'Individual') | (df['Class'] == 'Group'), df['Class'], 'Other')
120 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [210]: %timeit df['Class3'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
95.7 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [211]: %timeit df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
97.8 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another approach could be:
df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
You can do it this way for example
get list of unique items list = df['Class'].unique()
remove your known class list.remove('Individual')....
then list all Other rows df[df.class is in list]
replace class values df[df.class is in list].class = 'Other'
Sorry for this pseudo-pseudo code, but principle is same.
You can use pd.Series.where:
df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other', inplace=True)
print(df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
This should be efficient versus map + fillna:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other')
# 60.3 ms per loop
%timeit df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
# 133 ms per loop
Another way using apply :
df['Class'] = df['Class'].apply(lambda cl : cl if cl in ["Individual","Group"] else "Other"]

Convert a list of hashIDs stored as string to a column of unique values

I have a dataframe where in one column I have a list of hash values stored like strings:
'[d85235f50b3c019ad7c6291e3ca58093,03e0fb034f2cb3264234b9eae09b4287]' just to be clear.
the dataframe looks like
1
0 [8a88e629c368001c18619c7cd66d3e96, 4b0709dd990a0904bbe6afec636c4213, c00a98ceb6fc7006d572486787e551cc, 0e72ae6851c40799ec14a41496d64406, 76475992f4207ee2b209a4867b42c372]
1 [3277ded8d1f105c84ad5e093f6e7795d]
2 [d85235f50b3c019ad7c6291e3ca58093, 03e0fb034f2cb3264234b9eae09b4287]
I'd like to create a list of unique hash id's present in this column.
What is the efficient way?
Thank you
Option 1
See timing below for fastest option
You can embed the parsing and flattening in one comprehension
[y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]
['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287']
From there, you can use either list(set()), pd.unique, or np.unique
pd.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
array(['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287'], dtype=object)
Option 2
For brevity, use pd.Series.extractall
list(set(df['1'].str.extractall('(\w+)')[0]))
['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287']
#jezrael's list(set()) with my comprehension is fastest
Parse Timing
I kept the same list(set()) for purposes of comparing parsing and flattening.
%timeit list(set(np.concatenate(df['1'].apply(yaml.load).values).tolist()))
%timeit list(set([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]))
%timeit list(set(chain.from_iterable(df['1'].str.strip('[]').str.split(', '))))
%timeit list(set(df['1'].str.extractall('(\w+)')[0]))
1.01 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.42 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
279 µs ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
941 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This takes my comprehension and uses various ways to make unique to compare those speeds
%timeit pd.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
%timeit np.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
%timeit list(set([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]))
57.8 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.5 µs ± 552 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.18 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You need strip with split first and for flatenning chain:
print (df.columns.tolist())
['col']
#convert strings to lists per rows
#change by your column name if necessary
s = df['col'].str.strip('[]').str.split(', ')
print (s)
0 [8a88e629c368001c18619c7cd66d3e96, 4b0709dd990...
1 [3277ded8d1f105c84ad5e093f6e7795d]
2 [d85235f50b3c019ad7c6291e3ca58093, 03e0fb034f2...
Name: col, dtype: object
#check first value
print (type(s.iat[0]))
<class 'list'>
#get unique values - for unique values use set
from itertools import chain
L = list(set(chain.from_iterable(s)))
['76475992f4207ee2b209a4867b42c372', '3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093', '4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc', '03e0fb034f2cb3264234b9eae09b4287',
'8a88e629c368001c18619c7cd66d3e96', '0e72ae6851c40799ec14a41496d64406']
from itertools import chain
s = [x.strip('[]').split(', ') for x in df['col'].values.tolist()]
L = list(set(chain.from_iterable(s)))
print (L)
['76475992f4207ee2b209a4867b42c372', '3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093', '4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc', '03e0fb034f2cb3264234b9eae09b4287',
'8a88e629c368001c18619c7cd66d3e96', '0e72ae6851c40799ec14a41496d64406']
IIUC, you want to flatten your data. Convert it to a column of lists using yaml.load.
import yaml
df = df.applymap(yaml.load)
print(df)
1
0 [8a88e629c368001c18619c7cd66d3e96, 4b0709dd990...
1 [3277ded8d1f105c84ad5e093f6e7795d]
2 [d85235f50b3c019ad7c6291e3ca58093, 03e0fb034f2...
The easiest way would be to construct a new dataframe form the old one's values.
out = pd.DataFrame(np.concatenate(df.iloc[:, 0].values.tolist()))
print(out)
0
0 8a88e629c368001c18619c7cd66d3e96
1 4b0709dd990a0904bbe6afec636c4213
2 c00a98ceb6fc7006d572486787e551cc
3 0e72ae6851c40799ec14a41496d64406
4 76475992f4207ee2b209a4867b42c372
5 3277ded8d1f105c84ad5e093f6e7795d
6 d85235f50b3c019ad7c6291e3ca58093
7 03e0fb034f2cb3264234b9eae09b4287

Categories