Pandas: improve running time looping over string contains substring

Pandas: improve running time looping over string contains substring - python

I got a Pandas dataframe which contains a column with pretty long strings (let's say URL_paths) and a list of unique substrings (reference list). For every row in my dataframe, I want to determine the corresponding reference element in my list. Hence, if the URL in a given row is for example abcd1234, and one of the reference values is cd123, then I want to add cd123 as reference to my dataframe, to categorize this row/URL.
I got my code working (see example below), but it's pretty slow due to a for loop (I guess) which I can't get rid off. I got the feeling that my code can be much faster, but can't think of a way to improve it.
How can I improve running time?
See working example below:
import string
import secrets
import pandas as pd
import time
from random import randint
n_ref = 100
n_target = 1000000
## Build reference Series, and target dataframe
reference = pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(randint(10, 19)))
for _ in range(n_ref))
target = pd.Series(reference.sample(n = n_target, replace = True)).reset_index().iloc[:,1]
dfTarget = pd.DataFrame({
'target' : target,
'pre-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'post-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'reference' : pd.Series()})
dfTarget['target_combined'] = dfTarget[['pre-string', 'target', 'post-string']].apply(lambda x: ''.join(x), axis=1)
## Fill in reference column
## Loop over references and return reference in reference column
start_time = time.time()
for x in reference:
dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
print("--- %s seconds ---" % (time.time() - start_time))
Out: 42.60... seconds

On my machine, I see a 17x improvement using pd.Series.apply:
reference_set = set(reference)
def calculator(x):
return next((i for i in reference_set if i in x), None)
dfTarget['reference'] = dfTarget['target_combined'].apply(calculator)
But for optimal performance, see #unutbu's solution.

Here is a slightly (4.3 times) faster approach:
RegEx pattern:
In [23]: pat = '.*({}).*'.format(reference.str.cat(sep='|'))
In [24]: pat
Out[24]: '.*(J6BUVB2BRDLL3IR9S1J|ZOXS91UK513RR18YREI|92KWUFKOK4G9XJAHIBJ|PMEH6N96091AK9XCA5J|3CICA38SDIXLFVED74I|V48OJCY2DS|LX8KGGBORWP6A|7H
V3NN71MU|JMA2K7QSHK72X|CNAOYI3C8T|NZE9SFKPYX|EU9K88XA29YATWR|SB871PEZ7TOPCG8|ZPP76BSDULM8|3QHLISVYEBWH|ST8VOI959D8YPCZ0|02BW83KYG3TEPWMOP|TG
I3P5QZC988GNM8FI0|GJG9MC18G5TU1TIDQB6|V7V5ZZJ5W7O|51KMJ07HEBIX|27GPT3B9DLY|O8KSR85BUB6WBKRC|ZKUEEFX5JFRE0IFRN0|FH8CUWHDETQ5TXWHSS1|N77FTB9VG
LK|JS4RUUQLD7IFP|3R45N7LOY1BZ8RR6O|JY3RXZ0OTC|YJQYOO03G0N7H7E56D|RVJ2VFNK6T7P30|GKPGAK6WAQ2QCAU6H3|7XNJ7A24CHWO1PK|1DVD5G1AE3I40|9F7CCWKHMMF
MBYD18|FWPEUWOWNK2SXR36SG|VTE64VCRY5|YGM8TT19EZTX|GKJYM3QS9ONTERQY1O0|KWMB1TMQTWMC6QCY|JS9SY7W5HI0KK|WNSHPK9KNEP77B|7EIS883NUXSO5Q6|K3HL2UYW
458LCBOSL|XI1FRVGHN0IL0F53CK4|F4HL7GKMOL2Q4Y13|IAXPAA4OX2J1X1|SXPLPYVB6EFSN4U5ZW|5L947F08PX8UW|IONNAOC26A|VQVHXHGYP8634|509ALPOKABO|SUJA66H2
DS7UOXFV|3GYIZATSZAXF8283SZO|A5612XI7X3N4|IH3RB3640D23Q28O|MH0YD83OELSI|RIFFPNRIV0XCY|Y0CXWE6GZPQ3FKH|WSCWR598Z8GBW9G|7C9O59EIA23POSI|UG4D5H
AAOYU5E|F249VSIILZ6KXDQSX|06XZSJHWSM|X01Y9AZ2W5V8HZ|1JLPWMPRGRFWIK|3ZVBSLEQ8DO|WMLKKETELHC|WDPHDS7A7XN7|6X4O4AE2IB3OS|V5J5HWO9RO19ZW2LGT|MK9
P8D9N8V4AJZB|0VT48C38I4T1V6S|R987QUQBTPRHCT7QWA4|D4XXBMCYWQ1172OY|ZUY1O565D2W5GSAL8|V8AR792X1K5UL9DLCKV|CXYK6IQWK3MUC3CO|6X7B6240VC9YL|4QV2D
13ZY15A9D5M1H|WJ7HOMK2FNBZZ6N2Z|QCOWSA3RLR|81I6Z0I5GM|KRD9Y1H3E2WEY9710Q|0161MNQHKEC30E8UI|HGB4XB0QDVHM4H92|RWD6L6EZJUSRK|6U9WOE3YVYKY31K8Q0
K|KCXWHL43B16MRQ1|EO330WAPN7XMX4|VYUX5W2NN277W09NMDB|J8EXE4YIMN0FB|SHE8D14C5A3X|PMPYKSY2FVXFR4Y8X3W|G3YU894U5QGOOM3Z|58J37WJPJBOC7QNKV|NE9WE
JSRXTYFXYZ0TBI|7UPR5XSVOJ244HHZ|N0QZCN6NADW|W2CTEUISOHUY).*'
Replacement:
dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
Timing against 10.000 rows DF:
In [25]: %%timeit
...: dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
...:
617 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [26]: %%timeit
...: [dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] for x in reference]
...:
1.96 s ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [27]: %%timeit
...: for x in reference:
...: dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
...:
2.64 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [28]: 2.64/0.617
Out[28]: 4.278768233387359
In [29]: 2.64/1.96
Out[29]: 1.3469387755102042

Related

Smoothing / interpolating categorical data (fast)

I am currently working with an array, containing categorical data.
Categories are organised like this: None,zoneA, zoneB
My array is a measure of sensors, it tells me if, at any time, the sensor is in zoneA, zoneB or not in a zone.
My goal here is to smooth those values.
For example, the sensor could be out of zoneA or b for a period of 30 measures, but if it happened I want those measures to be "smoothed".
Ex :
array[zoneA, zoneA, zoneA, None, None, zoneA, zoneA, None, None, None, zoneA]
should give
array[zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, None, None, None, zoneA]
with a threshold of 2.
Currently, I am using an iteration over arrays, but its computation is too expensive and can lead to 1 or 2 min of computation. Is there an existing algorithm to answer that problem?
My current code :
def smooth(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= self.delay)
and (
df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")]
== "None"
)
):
df_iter.iloc[
last_index: (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter

Python implementation
I like to start these kind of things clean and simple. Therefore I just wrote a simple class that does exactly what is needed, without thinking too much about optimization. I call it Interpolator as this looks like categorical interpolation to me.
class Interpolator:
def __init__(self, data):
self.data = data
self.current_idx = 0
self.current_nan_region_start = None
self.result = None
self.maxgap = 1
def run(self, maxgap=2):
# Initialization
self.result = [None] * len(self.data)
self.maxgap = maxgap
self.current_nan_region_start = None
prev_isnan = 0
for idx, item in enumerate(self.data):
isnan = item is None
self.current_idx = idx
if isnan:
if prev_isnan:
# Result is already filled with empty data.
# Do nothing.
continue
else:
self.entered_nan_region()
prev_isnan = 1
else: # not nan
if prev_isnan:
self.exited_nan_region()
prev_isnan = 0
else:
self.continuing_in_categorical_region()
def entered_nan_region(self):
self.current_nan_region_start = self.current_idx
def continuing_in_categorical_region(self):
self.result[self.current_idx] = self.data[self.current_idx]
def exited_nan_region(self):
nan_region_end = self.current_idx - 1
nan_region_length = nan_region_end - self.current_nan_region_start + 1
# Always copy the empty region endpoint even if gap is not filled
self.result[self.current_idx] = self.data[self.current_idx]
if nan_region_length > self.maxgap:
# Do not interpolate as exceeding maxgap
return
if self.current_nan_region_start == 0:
# Special case. data starts with "None"
# -> Cannot interpolate
return
if self.data[self.current_nan_region_start - 1] != self.data[self.current_idx]:
# Do not fill as both ends of missing data
# region do not have same value
return
# Fill the gap
for idx in range(self.current_nan_region_start, self.current_idx):
self.result[idx] = self.data[self.current_idx]
def interpolate(data, maxgap=2):
"""
Interpolate categorical variables over missing
values (None's).
Parameters
----------
data: list of objects
The data to interpolate. Holds
categorical data, such as 'cat', 'dog'
or 108. None is handled as missing data.
maxgap: int
The maximum gap to interpolate over.
For example, with maxgap=2, ['car', None,
None, 'car', None, None, None, 'car']
would become ['car', 'car', 'car' 'car',
None, None None, 'car'].
Note: Interpolation will only occur on missing
data regions where both ends contain the same value.
For example, [1, None, 2, None, 2] will become
[1, None, 2, 2, 2].
"""
interpolator = Interpolator(data)
interpolator.run(maxgap=maxgap)
return interpolator.result
This is how one would use it (code for get_data() below):
data = get_data(k=100)
interpolated_data = interpolate(data)
Copy-paste Cython implementation
Most probably the python implementation is fast enough, as with array size of 1000.000, the amount of time needed to process the data is 0.504 seconds on my laptop. Anyway, creating Cython versions is fun and might give small additional timing bonus.
Needed steps:
Copy-paste the python implementation into new file, called fast_categorical_interpolate.pyx
Create setup.py to the same folder, with following contents:
from setuptools import setup
from Cython.Build import cythonize
setup(
ext_modules=cythonize(
"fast_categorical_interpolate.pyx",
language_level="3",
),
)
Run python setup.py build_ext --inplace to build the Cython extension. You'll see something like fast_categorical_interpolate.cp38-win_amd64.pyd in the same folder.
Now, you may use the interpolator like this:
import fast_categorical_interpolate as fpi
data = get_data(k=100)
interpolated_data = fpi.interpolate(data)
Of course, there might be some optimizations that you could do in the Cython code to make this even faster, but on my machine the speed improvement was 38% out of the box with N=1000.000 and 126% when N=10.000.
Timings on my machine
When N=100 (number of items in the list), python implementation is about 160x , and Cython implementation about 250x faster than smooth
In [8]: timeit smooth(test_df, delay=2)
10.2 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: timeit interpolate(data)
64.8 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: timeit fpi.interpolate(data)
41.3 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
When N=10.000, the timing difference is about 190x (Python) to 302x (Cython).
In [5]: timeit smooth(test_df, delay=2)
1.08 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: timeit interpolate(data)
5.69 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: timeit fpi.interpolate(data)
3.57 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
When N=1000.000, the python implementation is about 210x faster and Cython implementation is about 287x faster.
In [9]: timeit smooth(test_df, delay=2)
1min 45s ± 24.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: timeit interpolate(data)
504 ms ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: timeit fpi.interpolate(data)
365 ms ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appendix
Test data creator get_data()
import random
random.seed(0)
def get_data(k=100):
return random.choices(population=[None, "ZoneA", "ZoneB"], weights=[4, 3, 2], k=k)
Function and test data for testing smooth()
import pandas as pd
data = get_data(k=1000)
test_df = pd.DataFrame(dict(landlot=data)).fillna("None")
def smooth(df: pd.DataFrame, delay=2) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= delay)
and (df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")] == "None")
):
df_iter.iloc[
last_index : (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter
Note on the "current code"
I think there must be some copy-paste error somewhere, as the "current code" does not work as all. I replaced the self.delay with a delay=2 keyword argument to indicate the max gap. I assume that is was it was supposed to be. Even with that the logic did not work correcly with the simple example data you provided.

Fast method to create nested list with different types: numpy, pandas or list concatenation?

I am trying to accelerate the code below that produces a list of lists with different types for each column. I originally created pandas dataframe and then converted it to list, but this seems to be fairly slow. How can I create this list faster, by say an order of magnitude? All columns are constant except one.
import pandas as pd
import numpy as np
import time
import datetime
def overflow_check(x):
# in SQL code the column is decimal(13, 2)
p=13
s=3
max_limit = float("9"*(p-s) + "." + "9"*s)
#min_limit = 0.01 #float("0" + "." + "0"*(s-2) + '1')
#min_limit = 0.1
if np.logical_not(isinstance(x, np.ndarray)) or len(x) < 1:
raise Exception("Non-numeric or empty array.")
else:
#print(x)
return x * (np.abs(x) < max_limit) + np.sign(x)* max_limit * (np.abs(x) >= max_limit)
def list_creation(y_forc):
backcast_length = len(y_forc)
backcast = pd.DataFrame(data=np.full(backcast_length, 2),
columns=['TypeId'])
backcast['id2'] = None
backcast['Daily'] = 1
backcast['ForecastDate'] = y_forc.index.strftime('%Y-%m-%d')
backcast['ReportDate'] = pd.to_datetime('today').strftime('%Y-%m-%d')
backcast['ForecastMethodId'] = 1
backcast['ForecastVolume'] = overflow_check(y_forc.values)
backcast['CreatedBy'] = 'test'
backcast['CreatedDt'] = pd.to_datetime('today')
return backcast.values.tolist()
i=pd.date_range('05-01-2010', '21-05-2018', freq='D')
x=pd.DataFrame(index=i, data = np.random.randint(0, 100, len(i)))
t=time.perf_counter()
y =list_creation(x)
print(time.perf_counter()-t)

This should be a bit faster, it just directly creates the list:
def list_creation1(y_forc):
zipped = zip(y_forc.index.strftime('%Y-%m-%d'), overflow_check(y_forc.values)[:,0])
t = pd.to_datetime('today').strftime('%Y-%m-%d')
t1 =pd.to_datetime('today')
return [
[2, None, 1, i, t,
1, v, 'test', t1]
for i,v in zipped
]
%%timeit
list_creation(x)
> 29.3 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
list_creation1(x)
> 17.1 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Edit: one of the large issues with the slowness is the time it takes to go from datetime to specified format. if we can get rid of that by phrasing it as the following:
def list_creation1(i, v):
zipped = zip(i, overflow_check(np.array([[_x] for _x in v]))[:,0])
t = pd.to_datetime('today').strftime('%Y-%m-%d')
t1 =pd.to_datetime('today')
return [
[2, None, 1, i, t,
1, v, 'test', t1]
for i,v in zipped
]
start = datetime.datetime.strptime("05-01-2010", "%d-%m-%Y")
end = datetime.datetime.strptime("21-05-2018", "%d-%m-%Y")
i = [(start + datetime.timedelta(days=x)).strftime("%d-%m-%Y") for x in range(0, (end-start).days)]
x=np.random.randint(0, 100, len(i))
Then this is now a lot faster:
%%timeit
list_creation1(i, x)
> 1.87 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Add leading zeros based on condition in python

I have a dataframe with 5 million rows. Let's say the dataframe looked like below:
>>> df = pd.DataFrame(data={"Random": "86 7639103627 96 32 1469476501".split()})
>>> df
Random
0 86
1 7639103627
2 96
3 32
4 1469476501
Note that the Random column is stored as a string.
If the number in column Random has fewer than 9 digits, I want to add leading zeros to make it 9 digits. If the number has 9 or more digits, I want to add leading zeros to make it 20 digits.
what I have done is this:
for i in range(0,len(df['Random'])):
if len(df['Random'][i]) < 9:
df['Random'][i]=df['Random'][i].zfill(9)
else:
df['Random'][i]=df['Random'][i].zfill(20)
Since the number of rows is over 5 million, this process takes a lot of time! (performance was 5it/sec. Tested using tqdm, estimated time of completion was in days!).
Is there an easier and faster way of performing this task?

Let us do np.where combine with zfill, alternative you can check with str.pad
df.Random=np.where(df.Random.str.len()<9,df.Random.str.zfill(9),df.Random.str.zfill(20))
df
Out[9]:
Random
0 000000086
1 00000000007639103627
2 000000096
3 000000032
4 00000000001469476501

I used 'apply' combined with the fill_zeros function written below to get a run time of 603ms over a dataframe of 1,000,000 rows.
data = {
'Random': [str(randint(0, 100_000_000)) for i in range(0, 1_000_000)]
}
df = pd.DataFrame(data)
def fill_zeros(x):
if len(x) < 9:
return x.zfill(9)
else:
return x.zfill(20)
%timeit df['Random'].apply(fill_zeros)
603 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Compared to:
%timeit np.where(df.Random.str.len()<9,df.Random.str.zfill(9),df.Random.str.zfill(20))
1.57 s ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Since you are asking about efficiency, string operations are one of the common "gotchas" with Pandas, since while they are vectorized (in that you can apply them to an entire Series in one go), that does not mean that they are more efficient than looping, and this is one example where looping is actually going to be faster than using the string accessor, which tends to be more for convenience than speed.
When in doubt, make sure you time functions on your actual data, since something you think may be clunky and slow may be faster than something that looks clean!
I'm going to propose a very basic looping function that I think will beat any approach using the string accessor.
def loopy(series):
return pd.Series(
(
el.zfill(9) if len(el) < 9 else el.zfill(20)
for el in series
),
name=series.name,
)
# to compare more fairly with the apply version
def cache_loopy(series, _len=len, _zfill=str.zfill):
return pd.Series(
(_zfill(el, 9 if _len(el) < 9 else 20) for el in series), name=series.name)
Now let's check the timings, using the code provided by Martijn above and simple_benchmark.
Functions
def loopy(series):
series.copy() # not necessary but just to make timings fair
return pd.Series(
(
el.zfill(9) if len(el) < 9 else el.zfill(20)
for el in series
),
name=series.name,
)
def str_accessor(series):
target = series.copy()
mask = series.str.len() < 9
unmask = ~mask
target[mask] = target[mask].str.zfill(9)
target[unmask] = target[unmask].str.zfill(20)
return target
def np_where_str_accessor(series):
target = series.copy()
return np.where(target.str.len()<9,target.str.zfill(9),target.str.zfill(20))
def fill_zeros(x, _len=len, _zfill=str.zfill):
# len() and str.zfill() are cached as parameters for performance
return _zfill(x, 9 if _len(x) < 9 else 20)
def apply_fill(series):
series = series.copy()
return series.apply(fill_zeros)
def cache_loopy(series, _len=len, _zfill=str.zfill):
series.copy()
return pd.Series(
(_zfill(el, 9 if _len(el) < 9 else 20) for el in series), name=series.name)
Setup
import pandas as pd
import numpy as np
from random import choices, randrange
from simple_benchmark import benchmark
def randvalue(chars="0123456789", _c=choices, _r=randrange):
return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")
fns = [loopy, str_accessor, np_where_str_accessor, apply_fill, cache_loopy]
args = { 2**i: pd.Series([randvalue() for _ in range(2**i)]) for i in range(14, 21)}
b = benchmark(fns, args, 'Series Length')
b.plot()

You need vectorize this; select the columns using a boolean index and use .str.zfill() on the resulting subsets:
# select the right rows to avoid wasting time operating on longer strings
shorter = df.Random.str.len() < 9
longer = ~shorter
df.Random[shorter] = df.Random[shorter].str.zfill(9)
df.Random[longer] = df.Random[longer].str.zfill(20)
Note: I did not use np.where() because we wouldn't want to double the work. A vectorized df.Random.str.zfill() is faster than looping over the rows, but doing it twice still takes more time than doing it just once for each set of rows.
Speed comparison on 1 million rows of strings with values of random lengths (from 5 characters all the way up to 30):
In [1]: import numpy as np, pandas as pd
In [2]: import platform; print(platform.python_version_tuple(), platform.platform(), pd.__version__, np.__version__, sep="\n")
('3', '7', '3')
Darwin-17.7.0-x86_64-i386-64bit
0.24.2
1.16.4
In [3]: !sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-7820HQ CPU # 2.90GHz
In [4]: from random import choices, randrange
In [5]: def randvalue(chars="0123456789", _c=choices, _r=randrange):
...: return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")
...:
In [6]: df = pd.DataFrame(data={"Random": [randvalue() for _ in range(10**6)]})
In [7]: %%timeit
...: target = df.copy()
...: shorter = target.Random.str.len() < 9
...: longer = ~shorter
...: target.Random[shorter] = target.Random[shorter].str.zfill(9)
...: target.Random[longer] = target.Random[longer].str.zfill(20)
...:
...:
825 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: target = df.copy()
...: target.Random = np.where(target.Random.str.len()<9,target.Random.str.zfill(9),target.Random.str.zfill(20))
...:
...:
929 ms ± 69.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(The target = df.copy() line is needed to make sure that each repeated test run is isolated from the one before.)
Conclusion: on 1 million rows, using np.where() is about 10% slower.
However, using df.Row.apply(), as proposed by jackbicknell14, beats either method by a huge margin:
In [9]: def fill_zeros(x, _len=len, _zfill=str.zfill):
...: # len() and str.zfill() are cached as parameters for performance
...: return _zfill(x, 9 if _len(x) < 9 else 20)
In [10]: %%timeit
...: target = df.copy()
...: target.Random = target.Random.apply(fill_zeros)
...:
...:
299 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's about 3 times faster!

df.Random.str.zfill(9).where(df.Random.str.len() < 9, df.Random.str.zfill(20))

Python pandas return value from other column

I have a file "specieslist.txt" which contain the following information:
Bacillus,genus
Borrelia,genus
Burkholderia,genus
Campylobacter,genus
Now, I want python to look for a variable in the first column (in this example "Campylobacter") and return the value of the second ("genus"). I wrote the following code
import csv
import pandas as pd
species_import = 'Campylobacter'
df = pd.read_csv('specieslist.txt', header=None, names = ['species', 'level'] )
input = df.loc[df['species'] == species_import]
print (input['level'])
However, my code return too much, while I am only want "genus"
3 genus
Name: level, dtype: object

You can select first value of Series by iat:
species_import = 'Campylobacter'
out = df.loc[df['species'] == species_import, 'level'].iat[0]
#alternative
#out = df.loc[df['species'] == species_import, 'level'].values[0]
print (out)
genus
Better solution working if no value matched and empty Series is returned - it return no match:
#jpp comment
This solution is better only when you have a large series and the matched value is expected to be near the top
species_import = 'Campylobacter'
out = next(iter(df.loc[df['species'] == species_import, 'level']), 'no match')
print (out)
genus
EDIT:
Idea from comments, thanks #jpp:
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
print (get_first_val(species_import))
genus
print (get_first_val('aaa'))
no match
EDIT:
df = pd.DataFrame({'species':['a'] * 10000 + ['b'], 'level':np.arange(10001)})
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
In [232]: %timeit next(iter(df.loc[df['species'] == 'a', 'level']), 'no match')
1.3 ms ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [233]: %timeit (get_first_val('a'))
1.1 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [235]: %timeit (get_first_val('b'))
1.48 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [236]: %timeit next(iter(df.loc[df['species'] == 'b', 'level']), 'no match')
1.24 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Performance of various methods, to demonstrate when it is useful to use next(...).
n = 10**6
df = pd.DataFrame({'species': ['b']+['a']*n, 'level': np.arange(n+1)})
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
%timeit next(iter(df.loc[df['species'] == 'b', 'level']), 'no match') # 123 ms per loop
%timeit get_first_val('b') # 125 ms per loop
%timeit next(idx for idx, val in enumerate(df['species']) if val == 'b') # 20.3 µs per loop

get
With pandas.Series.get, you can return either a scalar value if the 'species' is unique or a pandas.Series if not unique.
f = df.set_index('species').level.get
f('Campylobacter')
'genus'
If not in the data, you can provide a default
f('X', 'Not In Data')
'Not In Data'
We could also use dict.get and only return scalars. If not unique, this will return the last one.
f = dict(zip(df.species, df.level)).get
If you want to return the first one, you can do that a few ways
f = dict(zip(df.species[::-1], df.level[::-1])).get
Or
f = df.drop_duplicates('species').pipe(
lambda d: dict(zip(d.species, d.level)).get
)

# Change the last line of your code to
print(input['level'].values)
# For Explanation refer below code
import csv
import pandas as pd
species_import = 'Campylobacter'
df = pd.read_csv('specieslist.txt', header=None, names = ['species', 'level'] )
input = df['species'] == species_import # return a pandas dataFrame
print(type(df[input])) # return a Pandas DataFrame
print(type(df[input]['level'])) # return a Pandas Series
# To obtain the value from this Series.
print(df[input]['level'].values) # return 'genus'

Do both lines return a string?

I am currently following through the beginner Codebat track. Both pieces of code work however is there anything fundamentally wrong/different between the two ways of writing the below code?
thanks,
def mine(myStr, x):
myResult = myStr * x
return myResult
def codebat(thierStr, i):
codeResult = ''
for i in range(i):
codeResult += thierStr
return codeResult

import string # string.ascii_letters = 'abcde...ABCDE...'
def mine(s, x):
return s * x # fixed your code so it multiplies by x, not 4
def theirs(s, x): # renamed but the same as codebat
res = ''
for _ in range(x):
res += s
return res
We can see they give the same results
mine(string.ascii_letters, 10) == theirs(string.ascii_letters, 10) # --> True
We can test the time efficiency of these functions however
%timeit mine(string.ascii_letters, 1000)
2.27 µs ± 9.69 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit theirs(string.ascii_letters, 1000)
202 µs ± 4.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As you can see mine is almost 100 times more efficient because under the hood python pre-allocates the memory needed for the new string. In theirs it has to keep reallocating memory each time the string length is increased.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: improve running time looping over string contains substring - python

On my machine, I see a 17x improvement using pd.Series.apply: reference_set = set(reference) def calculator(x): return next((i for i in reference_set if i in x), None) dfTarget['reference'] = dfTarget['target_combined'].apply(calculator) But for optimal performance, see #unutbu's solution.

Related

Smoothing / interpolating categorical data (fast)

Fast method to create nested list with different types: numpy, pandas or list concatenation?

Add leading zeros based on condition in python

Python pandas return value from other column

Do both lines return a string?

Categories

Resources