Improve performance of a nested apply in pandas - python

I have a pandas DataFrame names and a Series illegal_words.
# names - ca. 250k rows
name
0 MISS ELFRIEDA ALPERT
1 DALE VON PETTY
2 MOHAMMAD IBN MASILLAH
3 YELENA THE MORRIS
4 MR. SHENNA DEMOSS
...
# illegal_words - ca. 2k rows
0 MISS
1 VON
2 THE
...
I want to remove any illegal word from names.
I'm calling a loop in a loop:
import re
for word in illegal_words:
names['name'] = names['name'].apply(lambda x: re.sub(word, '', x))
Output:
# names
name
0 ELFRIEDA ALPERT
1 DALE PETTY
2 MOHAMMAD MASILLAH
3 YELENA MORRIS
4 SHENNA DEMOSS
...
(double spaces are not much of an issue)
And it works, but... this is very slow. The re.sub() method is called 2'000 * 250'000 = 500'000'000 times!
What can I do to speed it up?
(since the illegal_words is very project-specific, I cannot use any external package)

Can you try that?
illegal_words = ['MISS', 'VON', 'THE']
out = df['name'].str.replace(fr"({'|'.join(illegal_words)}) ", '', regex=True)
>>> out
0 ELFRIEDA ALPERT
1 DALE PETTY
2 MOHAMMAD IBN MASILLAH
3 YELENA MORRIS
4 MR. SHENNA DEMOSS
Name: name, dtype: object
Performance
For a random list of 2,500 words and 250,000 records:
%timeit df['name'].str.replace(fr"({'|'.join(illegal_words)}) ", '', regex=True)
130 ms ± 870 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Most efficient way of testing a value is in a list in pandas

I have a dataframe that I have from a csv which I am testing various aspects of. These all seem to go along the lines of either is this column like this regex or is this column in this list.
So I have the dataframe a bit like this:
import pandas as pd
df = pd.DataFrame({'full_name': ['Mickey Mouse', 'M Mouse', 'Mickey RudeWord Mouse'], 'nationality': ['Mouseland', 'United States', 'Canada']})
I am generating new columns based on that content like so:
def full_name_metrics(full_name):
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
# metric of whether full name has less than two distinct elements
full_name_less_than_2_parts = len(full_name.split(' '))<2
# metric of whether full_name contains an initial
full_name_with_initial = 1 in [len(x) for x in full_name.split(' ')]
# metric of whether name matches an offensive word
full_name_with_offensive_word = any(item in full_name.upper().split(' ') for item in lst_rude_words)
return pd.Series([full_name_less_than_2_parts, full_name_with_initial, full_name_with_offensive_word])
df[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
full_name
nationality
full_name_less_than_2_parts
full_name_with_initial
full_name_with_offensive_word
0
Mickey Mouse
Mouseland
False
False
False
1
M Mouse
United States
False
True
False
2
Mickey RudeWord Mouse
Canada
False
False
True
It works but for 25k records and more of these types of controls its taking more time than I'd like.
So is there a better way? Am I better off having the rude word list as another dataframe or am I barking up the wrong tree?
If it is the list checking that you want to speed up - then probably the Series.str.contains method can help -
lst_rude_words_as_str = '|'.join(lst_rude_words)
df['full_name_with_offensive_word'] = df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True)
Here's how the %timeit looks for me:
def func_in_list(full_name):
'''Your function - just removed the other two columns.'''
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
full_name_with_offensive_word = any(item in full_name.upper().split(' ') for item in lst_rude_words)
%timeit df.apply(lambda x: func_in_list(x['full_name']), axis=1) #3.15 ms
%timeit df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True) #505 µs
EDIT
I added the other two columns that I'd left out before - here's the full code
import pandas as pd
df = pd.DataFrame({'full_name': ['Mickey Mouse', 'M Mouse', 'Mickey Rudeword Mouse']})
def df_metrics(input_df):
input_df['full_name_less_than_2_parts'] = input_df['full_name'].str.split().map(len) < 2
input_df['full_name_with_initial'] = input_df['full_name'].str.split(expand=True)[0].map(len) == 1
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
lst_rude_words_as_str = '|'.join(lst_rude_words)
input_df['full_name_with_offensive_word'] = input_df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True)
return input_df
RESULTS
For the 3 row dataset - there is no difference between the two functions -
%timeit df_metrics(df)
#3.5 ms ± 67.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
#3.7 ms ± 59.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But when I increase the size of the dataframe - then there is some speed up
df_big = pd.concat([df] * 10000)
%timeit df_metrics(df_big)
#135 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df_big[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df_big.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
#11.5 s ± 173 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm going to answer piecemeal...
All your ops rely on splitting the full name column on whitespace so do it once:
>>> stuff = df.full_name.str.split()
For name less than two parts:
>>> df['full_name_less_than_2_parts'] = stuff.agg(len) < 2
>>> df
full_name nationality full_name_less_than_2_parts
0 Mickey Mouse Mouseland False
1 M Mouse United States False
2 Mickey RudeWord Mouse Canada False
Name with only an initial.
Explode the, split, Series; find the items with length one; group by the index to consolidate the exploded Series and aggregate with any.
>>> q = (stuff.explode().agg(len) == 1)
>>> df['full_name_with_initial'] = q.groupby(q.index).agg('any')
>>> df
full_name nationality full_name_less_than_2_parts full_name_with_initial
0 Mickey Mouse Mouseland False False
1 M Mouse United States False True
2 Mickey RudeWord Mouse Canada False False
Check for undesirable words.
Make a regular expression pattern from the undesireable words list and use it as an argument to the .str.contains method.
>>> rude_words =r'|'.join( ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
>>> df['rude'] = df.full_name.str.upper().str.contains(rude_words,regex=True)
>>> df
full_name nationality full_name_less_than_2_parts full_name_with_initial rude
0 Mickey Mouse Mouseland False False False
1 M Mouse United States False True False
2 Mickey RudeWord Mouse Canada False False True
Put them yogether in a function (mainly to do a timing test) that returns three Series.
import pandas as pd
from timer import Timer
df = pd.DataFrame(
{
"full_name": ["Mickey Mouse", "M Mouse", "Mickey RudeWord Mouse"]*8000,
"nationality": ["Mouseland", "United States", "Canada"]*8000,
}
)
rude_words = r'|'.join(['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
def f(df):
rude_words = r'|'.join(['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
stuff = df.full_name.str.split()
s1 = stuff.agg(len) < 2
stuff = (stuff.explode().agg(len) == 1)
s2 = stuff.groupby(stuff.index).agg('any')
s3 = df.full_name.str.upper().str.contains(rude_words,regex=True)
return s1,s2,s3
t = Timer('f(df)','from __main__ import pd,df,f')
print(t.timeit(1)) # <--- 0.12 seconds on my computer
x,y,z = f(df)
df.loc[:,'full_name_less_than_2_parts'] = x
df.loc[:,'full_name_with_initial'] = y
df.loc[:,'rude'] = z
# print(df.head(100))
Series Accessors

pandas getting most frequent names from a column which has list of names

my dataframe is like this
star_rating actors_list
0 9.3 [u'Tim Robbins', u'Morgan Freeman']
1 9.2 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 9.1 [u'Al Pacino', u'Robert De Niro']
3 9.0 [u'Christian Bale', u'Heath Ledger']
4 8.9 [u'John Travolta', u'Uma Thurman']
I want to extract the most frequent names in the actors_list column. I found this code. do you have a better suggestion? especially for big data.
import pandas as pd
df= pd.read_table (r'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv',sep=',')
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
expected output for (this data)
robert de niro 13
tom hanks 12
clint eastwood 11
johnny depp 10
al pacino 10
james stewart 9
By my tests, it would be much faster to do the regex cleanup after counting.
from itertools import chain
import re
p = re.compile("""^u['"](.*)['"]$""")
ser = pd.Series(list(chain.from_iterable(
x.title().split(', ') for x in df.actors_list.str[1:-1]))).value_counts()
ser.index = [p.sub(r"\1", x) for x in ser.index.tolist()]
ser.head()
Robert De Niro 18
Brad Pitt 14
Clint Eastwood 14
Tom Hanks 14
Al Pacino 13
dtype: int64
Its always better to go for plain python than depending on pandas since it consumes huge amount of memory if the list is large.
If the list is of size 1000, then the non 1000 length lists will have Nan's when you use expand = True which is a waste of memeory. Try this instead.
df = pd.concat([df]*1000) # For the sake of large df.
%%timeit
df.actors_list.str.replace("(u\'|[\[\]]|\')",'').str.lower().str.split(',',expand=True).stack().value_counts()
10 loops, best of 3: 65.9 ms per loop
%%timeit
df['actors_list'] = df['actors_list'].str.strip('[]').str.replace(', ',',').str.split(',')
10 loops, best of 3: 24.1 ms per loop
%%timeit
words = {}
for i in df['actors_list']:
for w in i :
if w in words:
words[w]+=1
else:
words[w]=1
100 loops, best of 3: 5.44 ms per loop
I will using ast convert the list like to list
import ast
df.actors_list=df.actors_list.apply(ast.literal_eval)
pd.DataFrame(df.actors_list.tolist()).melt().value.value_counts()
according to this code I got below chart
which
coldspeed's code is wen2()
Dark's code is wen4()
Mine code is wen1()
W-B's code is wen3()

optimize a string query with pandas. large data

I have a dataframe data, which has close to 4 millions rows. It is a list of cities on the world. I need to query the city name as fast as possible.
I found out that one with 346ms via indexing the city name:
d2=data.set_index("city",inplace=False)
timeit d2.loc[['PARIS']]
1 loop, best of 3: 346 ms per loop
This is still much too slow. I wonder if with group-by I could achieve faster query (how to do such a query). Each city has around 10 rows in the dataframe (duplicate city).
I searched several days and could not find a clear solution on the internet
thank you
Setup
df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(100000)]*10,columns=['city','value'])
Baseline
df2 = df.set_index('city')
%timeit df2.loc[['Paris9999']]
10 loops, best of 3: 45.6 ms per loop
Solution
Using a lookup dict and then use iloc:
idx_dict = df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
%timeit df.iloc[d['Paris9999']]
1000 loops, best of 3: 432 µs per loop
It seems this approach is almost 100 times faster than the baseline.
Comparing to other approaches:
%timeit df2[df2.index.values=="Paris9999"]
100 loops, best of 3: 16.7 ms per loop
%timeit full_array_based(df2, "Paris9999")
10 loops, best of 3: 19.6 ms per loop
Working with the array data for the index, comparing against the needed index and then using the mask from the comparison might be one option when looking for performance. A sample case might make things clear.
1) Input dataframes :
In [591]: df
Out[591]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [592]: d2 = df.set_index("city",inplace=False)
In [593]: d2
Out[593]:
population
city
Delhi 1000
Paris 56
NY 89
Paris 36
Delhi 300
Paris 52
Paris 34
Delhi 40
NY 89
Delhi 450
2) Indexing with .loc :
In [594]: d2.loc[['Paris']]
Out[594]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
3) Use mask based indexing :
In [595]: d2[d2.index.values=="Paris"]
Out[595]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
4) Finally timings :
In [596]: %timeit d2.loc[['Paris']]
1000 loops, best of 3: 475 µs per loop
In [597]: %timeit d2[d2.index.values=="Paris"]
10000 loops, best of 3: 156 µs per loop
Further boost
Going further with using array data, we can extract the entire input dataframe as array and index into it. Thus, an implementation using that philosophy would look something like this -
def full_array_based(d2, indexval):
df0 = pd.DataFrame(d2.values[d2.index.values==indexval])
df0.index = [indexval]*df0.shape[0]
df0.columns = d2.columns
return df0
Sample run and timings -
In [635]: full_array_based(d2, "Paris")
Out[635]:
population
Paris 56
Paris 36
Paris 52
Paris 34
In [636]: %timeit full_array_based(d2, "Paris")
10000 loops, best of 3: 146 µs per loop
If we are allowed to pre-process to setup a dictonary that could be indexed for extracting city string based data extraction from the input dataframe, here's one solution using NumPy to do so -
def indexed_dict_numpy(df):
cs = df.city.values.astype(str)
sidx = cs.argsort()
scs = cs[sidx]
idx = np.concatenate(( [0], np.flatnonzero(scs[1:] != scs[:-1])+1, [cs.size]))
return {n:sidx[i:j] for n,i,j in zip(cs[sidx[idx[:-1]]], idx[:-1], idx[1:])}
Sample run -
In [10]: df
Out[10]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [11]: dict1 = indexed_dict_numpy(df)
In [12]: df.iloc[dict1['Paris']]
Out[12]:
city population
1 Paris 56
3 Paris 36
5 Paris 52
6 Paris 34
Runtime test against #Allen's solution to setup a similar dictionary with 4 Mil rows -
In [43]: # Setup 4 miliion rows of df
...: df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(400000)]*10,\
...: columns=['city','value'])
...: np.random.shuffle(df.values)
...:
In [44]: %timeit df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
1 loops, best of 3: 2.01 s per loop
In [45]: %timeit indexed_dict_numpy(df)
1 loops, best of 3: 1.15 s per loop

count common entries between two string variables via Python

I would greatly appreciate someones help with counting the number of matching state names from two columns in my csv file. For instance consider the first 7 observations from columns State_born_in and state_lives_in:
State_born_in State_lives_in
New York Florida
Massachusetts Massachusetts
Florida Massachusetts
Illinois Illinois 
Iowa Texas
New Hampshire Massachusetts
California California
Basically I want to count the number of people who lived in the same state they were born in. I then want the percentage of all people who live in the same state they're born in. So in the example above I would have a count = 2 since there are two people that live in the same state they were born in(California and Massachusetts) who live in the same state they were born in. And if I wanted the percentage I would just divide 2 by the number of observations. I'm still relatively new to using pandas but this is what I've tried so far
df = pd.read_csv("uscitizens.csv","a")
import pandas as pd
counts = df[(df['State_born_in'] == df['state_lives_in'])] ; counts
percentage = counts/len(df['State_born_in'])
Moreover, how would I do this on a dataset that has over 2 million observations? I would greatly appreciate anyone's help
You can use first boolean indexing and then simple divide length of filtered DataFrame with length of original (it is same as length of index, what is fastest):
print (df)
State_born_in State_lives_in
0 New York Florida
1 Massachusetts Massachusetts
2 Massachusetts Massachusetts
3 Massachusetts Massachusetts
4 Florida Massachusetts
5 Illinois Illinois
6 Iowa Texas
7 New Hampshire Massachusetts
8 California California
same = df[(df['State_born_in'] == df['State_lives_in'])]
print (same)
State_born_in State_lives_in
1 Massachusetts Massachusetts
2 Massachusetts Massachusetts
3 Massachusetts Massachusetts
5 Illinois Illinois
8 California California
counts = len(same.index)
print (counts)
5
percentage = 100 * counts/len(df.index)
print (percentage)
55.55555555555556
Timings:
In [21]: %timeit len(same.index)
The slowest run took 18.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 546 ns per loop
In [22]: %timeit same.shape[0]
The slowest run took 21.82 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.37 µs per loop
In [23]: %timeit len(same['State_born_in'])
The slowest run took 46.92 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop
Faster solution:
same = (df['State_born_in'] == df['State_lives_in'])
print (same)
0 False
1 True
2 True
3 True
4 False
5 True
6 False
7 False
8 True
dtype: bool
counts = same.sum()
print (counts)
5
percentage = 100 * counts/len(df.index)
print (percentage)
55.5555555556
Timings in 2M DataFrame:
#[2000000 rows x 2 columns]
df = pd.concat([df]*200000).reset_index(drop=True)
#print (df)
In [127]: %timeit (100 * (df['State_born_in'] == df['State_lives_in']).sum()/len(df.index))
1 loop, best of 3: 444 ms per loop
In [128]: %timeit (100 * len(df[(df['State_born_in'] == df['State_lives_in'])].index)/len(df.index))
1 loop, best of 3: 472 ms per loop
Are you expecting this?
counts = df[ df['State_born_in'] == df['State_lives_in'] ].groupby('State_born_in').agg(['count']).sum()
counts / len(df['State_born_in'])

Python and Pandas - Moving Average Crossover

There is a Pandas DataFrame object with some stock data. SMAs are moving averages calculated from previous 45/15 days.
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
I want to find all dates, when SMA_15 and SMA_45 intersect.
Can it be done efficiently using Pandas or Numpy? How?
EDIT:
What I mean by 'intersection':
The data row, when:
long SMA(45) value was bigger than short SMA(15) value for longer than short SMA period(15) and it became smaller.
long SMA(45) value was smaller than short SMA(15) value for longer than short SMA period(15) and it became bigger.
I'm taking a crossover to mean when the SMA lines -- as functions of time --
intersect, as depicted on this investopedia
page.
Since the SMAs represent continuous functions, there is a crossing when,
for a given row, (SMA_15 is less than SMA_45) and (the previous SMA_15 is
greater than the previous SMA_45) -- or vice versa.
In code, that could be expressed as
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
If we change your data to
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
so that there are crossings,
then
import pandas as pd
df = pd.read_table('data', sep='\s+')
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
crossing_dates = df.loc[crossing, 'Date']
print(crossing_dates)
yields
1 20150128
2 20150129
Name: Date, dtype: int64
The following methods gives the similar results, but takes less time than the previous methods:
df['position'] = df['SMA_15'] > df['SMA_45']
df['pre_position'] = df['position'].shift(1)
df.dropna(inplace=True) # dropping the NaN values
df['crossover'] = np.where(df['position'] == df['pre_position'], False, True)
Time taken for this approach: 2.7 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time taken for previous approach: 3.46 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As an alternative to the unutbu's answer, something like below can also be done to find the indices where SMA_15 crosses SMA_45.
diff = df['SMA_15'] < df['SMA_45']
diff_forward = diff.shift(1)
crossing = np.where(abs(diff - diff_forward) == 1)[0]
print(crossing)
>>> [1,2]
print(df.iloc[crossing])
>>>
Date Price SMA_15 SMA_45
1 20150128 103.05 100 106
2 20150129 105.10 112 105

Categories