I am reading two types of csv files that are very similar.
They are about the same lenght, 20 000 lines. Each line represent parameters recorded each second.
Thus, the first column is the timestamp.
In the first file, the pattern is the following: 2018-09-24 15:38
In the second file, the pattern is the following: 2018-09-24 03:38:06 PM
In both cases, the command is the same:
data = pd.read_csv(file)
data['Timestamp'] = pd.to_datetime(data['Timestamp'])
I check the execution time for both lines:
pd.read is as effective in both cases
it takes ~3 to 4 seconds more to execute the second line of the code
The only difference is the date pattern. I would not have suspected that. Do you know why? Do you know how to fix this?
pandas.to_datetime is extremely slow (in certain instances) when it needs to parse the dates automatically. Since it seems like you know the formats, you should explicitly pass them to the format parameter, which will greatly improve the speed.
Here's an example:
import pandas as pd
df1 = pd.DataFrame({'Timestamp': ['2018-09-24 15:38:06']*10**5})
df2 = pd.DataFrame({'Timestamp': ['2018-09-24 03:38:06 PM']*10**5})
%timeit pd.to_datetime(df1.Timestamp)
#21 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.to_datetime(df2.Timestamp)
#14.3 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's 700x slower. Now specify the format explicitly:
%timeit pd.to_datetime(df2.Timestamp, format='%Y-%m-%d %I:%M:%S %p')
#384 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pandas is still parsing the second date format more slowly, but it's not nearly as bad as it was before.
Edit: as of pd.__version__ == '1.0.5' the automatic parsing seems to have gotten much better for what used to be extremely slowly parsed formats, likely due to the implemenation of this performance improvement in pd.__version == '0.25.0'
import pandas as pd
df1 = pd.DataFrame({'Timestamp': ['2018-09-24 15:38:06']*10**5})
df2 = pd.DataFrame({'Timestamp': ['2018-09-24 03:38:06 PM']*10**5})
%timeit pd.to_datetime(df1.Timestamp)
#9.01 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.to_datetime(df2.Timestamp)
#9.1 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
I need to group a DataFrame and apply several chained functions on each group.
My problem is basically the same as in pandas - Groupby two functions: apply cumsum then shift on each group.
There are answers there on how to obtain a correct result, however they seem to have a suboptimal performance. My specific question is thus: is there a more efficient way than the ones I describe below?
First here is some large testing data:
from string import ascii_lowercase
import numpy as np
import pandas as pd
n = 100_000_000
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
Below is the performance of each function:
%timeit df.groupby("x")["y"].cumsum()
4.65 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby("x")["y"].shift()
5.29 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A basic solution is to group twice. It seems suboptimal since grouping is a large part of the total runtime and should only be done once.
%timeit df.groupby("x")["y"].cumsum().groupby(df["x"]).shift()
10.1 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The accepted answer to the aforementioned question suggests to use apply with a custom function to avoid this issue. However for some reason it is actually performing much worse than the previous solution.
def cumsum_shift(s):
return s.cumsum().shift()
%timeit df.groupby("x")["y"].apply(cumsum_shift)
27.8 s ± 858 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any idea how to optimize this code? Especially in a case where I'd like to chain more than two functions, performance gains can become quite significant.
Let me know if this helps, few weeks back I was having the same issue.
I solved it by just spliting the code. And creating a separate groupby object which contains information about the groups.
# creating groupby object
g = df.groupby('x')['y']
%timeit g.cumsum()
592 ms ± 8.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit g.shift()
1.7 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I would suggest to give a try to transform instead of apply
try this:
%timeit df.groupby("x")["y"].transform(np.cumsum).transform(lambda x: x.shift())
or, also try using
from toolz import pipe
%timeit df.groupby("x").pipe(lambda g: g["y"].cumsum().shift())
I am pretty sure that pipe can be more efficient than apply or transform
Let us know if it works well
I am trying to structure a df for productivity at some point i need to verify if a id exist in list and give a indicator in function of that, but its too slow (something like 30 seg for df).
can you enlighten me on a better way to do it?
thats my current code:
data['first_time_it_happen'] = data['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
(i already try to use the colume like a serie but it do not work correctly)
To settle some debate in the comment section, I ran some timings.
Methods to time:
def isin(df, old_data):
return df["id"].isin(old_data["id"])
def apply(df, old_data):
return df['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
def set_(df, old_data):
old = set(old_data['id'].values)
return [x in old for x in df['id']]
import pandas as pd
import string
old_data = pd.DataFrame({"id": list(string.ascii_lowercase[:15])})
df = pd.DataFrame({"id": list(string.ascii_lowercase)})
Small DataFrame tests:
# Tests ran in jupyter notebook
%timeit isin(df, old_data)
184 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit apply(df, old_data)
926 µs ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set_(df, old_data)
28.8 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large dataframe tests:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit isin(df, old_data)
122 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit apply(df, old_data)
56.9 s ± 6.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set_(df, old_data)
974 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems like the set method is a smidge faster than the isin method for a small dataframe. However that comparison radically flips for a much larger dataframe. Seems like in most cases the isin method is will be the best way to go. Then the apply method is always the slowest of the bunch regardless of dataframe size.
I often find myself looping over some long INPUT list (or dataframe, or dictionary). Per iteration I do some calculations on the input data, I then push the results into some OUTPUT data structure. Often the final output is a dataframe (since it is convenient to deal with).
Below are two methods that loop over a long list, and aggregate some dummy results into a dataframe. Approach 1 is very slow (~3 seconds per run), whereas Approach 2 is very fast (~18 ms per run). Approach 1 is not good, because it is slow. Approach 2 is faster, but it is not ideal either, because it effectively "caches" data in a local file (and then relies on pandas to read that file back in very quickly). Ideally, we do everything in memory.
What approaches can people suggest to efficiently aggregate results? Bonus: And what if we don't know the exact size/length of our output structure (e.g. the actual output size may exceed the initial size estimate)? Any ideas appreciated.
import time
import pandas as pd
def run1(long_list):
my_df = pd.DataFrame(columns=['A','B','C'])
for el in long_list:
my_df.loc[(len)] = [el, el+1, 1/el] # Dummy calculations
return my_df
def run2(long_list):
with open('my_file.csv', 'w') as f:
f.write('A,B,C\n')
for el in long_list:
f.write(f'{el},{el+1},{1/el}\n') # Dummy calculations
return pd.read_csv('my_file.csv')
long_list = range(1, 2000)
%timeit df1 = run1(long_list) # 3 s ± 349 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = run2(long_list) # 18 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can do this by creating and then dropping a dummy input column and doing all of the calculations directly in pandas:
def func(long_list):
my_df = pd.DataFrame(long_list, columns=['input'])
my_df = my_df.assign(
A=my_df.input,
B=my_df.input+1,
C=1/my_df.input)
return my_df.drop('input', axis=1)
Comparing the times:
%timeit df1 = run1(long_list)
%timeit df2 = run2(long_list)
%timeit df3 = func(long_list)
3.81 s ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.54 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.19 ms ± 3.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pros:
All in memory
Really fast
Easy to read
Cons:
Probably not as fast as vectorized Numpy operations
You can directly build a DataFrame from a list of lists:
def run3(long_list):
return pd.DataFrame([[el, el+1, 1/el] for el in long_list],
columns=['A','B','C'])
It should be much faster than first one, and still faster that second one, because it does not use disk io.
I'm testing out feather-format as a way to store pandas DataFrame files. The performance of feather seems to be extremely poor when writing columns consisting entirely of None (info() gives 0 non-null object). The following code well encapsulates the issue:
df1 = pd.DataFrame(data={'x': 1000*[None]})
%timeit df1.to_feather('.../x.feather')
5.35 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df1.to_pickle('.../x.pkl')
734 ms ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df1.to_parquet('.../x.parquet')
200 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm using feather-format 0.4.0, pandas 0.23.4, and pyarrow 0.13.0.
How can I get these kinds of DataFrames to save without taking forever?
You could try adding a specific dtype. That being said, the numbers are a little surprising in terms of how poor feather performance is.
I'm applying harmonic mean from scipy.stats for aggfunc parameter in Pandas pivot_table but it is much slower than a simple mean by orders of magnitude.
I would like to know if this is excepted behavior or there is a way to turn this calculation more efficient as I need to do this calculation thousands of times.
I need to use harmonic mean but this is taking a huge amount of processing time.
I've tried using harmonic_mean from statistics form Python 3.6 but still the overhead is the same.
Thanks
import numpy as np
import pandas as pd
import statistics
data = pd.DataFrame({'value1':np.random.randint(1000,size=200000),
'value2':np.random.randint(24,size=200000),
'value3':np.random.rand(200000)+1,
'value4':np.random.randint(100000,size=200000)})
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=hmean)
1.74 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
1.9 s ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=np.mean)
37.4 ms ± 938 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Single run for both functions
%timeit hmean(data.value3[:100])
155 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.mean(data.value3[:100])
138 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I would recommend using multiprocessing.Pool, the code below has been tested for 20 million records, it is 3 times faster than the original, give it try please, for sure code still needs more improvements to answer your specific question about the slow performance of statistics.harmonic_mean.
note: you can get even better results for records > 100 M.
import time
import numpy as np
import pandas as pd
import statistics
import multiprocessing
data = pd.DataFrame({'value1':np.random.randint(1000,size=20000000),
'value2':np.random.randint(24,size=20000000),
'value3':np.random.rand(20000000)+1,
'value4':np.random.randint(100000,size=20000000)})
def chunk_pivot(data):
result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
return result
DataFrameDict=[]
for i in range(4):
print(i*250,i*250+250)
DataFrameDict.append(data[:][data.value1.between(i*250,i*250+249)])
def parallel_pivot(prcsr):
# 6 is a number of processes I've tested
p = multiprocessing.Pool(prcsr)
out_df=[]
for result in p.imap(chunk_pivot, DataFrameDict):
#print (result)
out_df.append(result)
return out_df
start =time.time()
dict_pivot=parallel_pivot(6)
multiprocessing_result=pd.concat(dict_pivot,axis=0)
#singleprocessing_result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
end = time.time()
print(end-start)