Why is Pandas so madly fast? How to define such functions? - python

I tried comparing the performance of Pandas and the traditional loop. I realized that with the same input and output, Pandas performed terribly fast calculations compared to the traditional loop.
My code:
#df_1h has been imported before
import time
n = 14
pd.options.display.max_columns = 8
display("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
close = df_1h['close']
start = time.time()
df_1h['sma_14_pandas'] = close.rolling(14).mean()
end = time.time()
display('pandas: {}'.format(end - start))
start = time.time()
df_1h['sma_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['sma_14_loop'][i] = close[i-n+1:i+1].mean()
end = time.time()
display('loop: {}'.format(end - start))
display(df_1h.tail())
Output:
"df_1h's Shape 16598 rows x 15 columns"
'pandas: 0.0030088424682617188'
'loop: 7.2529966831207275'
open_time open high low ... ignore rsi_14 sma_14_pandas sma_14_loop
16593 1.562980e+12 11707.39 11739.90 11606.04 ... 0.0 51.813151 11646.625714 11646.625714
16594 1.562983e+12 11664.32 11712.61 11625.00 ... 0.0 49.952679 11646.834286 11646.834286
16595 1.562987e+12 11632.64 11686.47 11510.00 ... 0.0 47.583619 11643.321429 11643.321429
16596 1.562990e+12 11582.06 11624.04 11500.00 ... 0.0 48.725262 11644.912857 11644.912857
16597 1.562994e+12 11604.96 11660.00 11588.16 ... 0.0 50.797087 11656.723571 11656.723571
5 rows × 15 columns
Pandas almost faster than 2.5k times!!!
My Questions:
Is my code wrong?
If my code is correct, why is Pandas so fast?
How to define custom functions that run so fast for Pandas?

As to your three questions:
Your code is correct in the sense that it produces the correct result. Explicitely iterating over the rows of a dataframe is as a rule however not so good an idea in terms of performance. Most often the same result can be achieved far more efficiently by pandas methods (as you demonstrated yourself).
Pandas is so fast because it uses numpy under the hood. Numpy implements highly efficient array operations. Also, the original creator of pandas, Wes McKinney, is kinda obsessed with efficiency and speed.
Use numpy or other optimized libraries. I recommend reading the Enhancing performance section of the pandas docs. If you can't use built-in pandas methods, if often makes sense to retrieve a numpy respresentation of the dataframe or series (using the value attribute or to_numpy() method), do all the calculations on the numpy array and only then store the result back to the dataframe or series.
Why is the loop algorithm so slow?
In your loop algorithm, mean is calculated over 16500 times, each time adding up 14 elements to find the mean. Pandas' rolling method uses a more sophisticated approach, heavily reducing the number of arithmetic operations.
You can achieve similar (and in fact about 3 times better) performance than pandas if you do the calculations in numpy. This is illustrated in the following example:
import pandas as pd
import numpy as np
import time
data = np.random.uniform(10000,15000,16598)
df_1h = pd.DataFrame(data, columns=['Close'])
close = df_1h['Close']
n = 14
print("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
start = time.time()
df_1h['SMA_14_pandas'] = close.rolling(14).mean()
print('pandas: {}'.format(time.time() - start))
start = time.time()
df_1h['SMA_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['SMA_14_loop'][i] = close[i-n+1:i+1].mean()
print('loop: {}'.format(time.time() - start))
def np_sma(a, n=14) :
ret = np.cumsum(a)
ret[n:] = ret[n:] - ret[:-n]
return np.append([np.nan]*(n-1), ret[n-1:] / n)
start = time.time()
df_1h['SMA_14_np'] = np_sma(close.values)
print('np: {}'.format(time.time() - start))
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_pandas.values, equal_nan=True)
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_np.values, equal_nan=True)
Output:
df_1h's Shape 16598 rows x 1 columns
pandas: 0.0031278133392333984
loop: 7.605962753295898
np: 0.0010571479797363281

Related

Slow down in pandas operations in python

I have the following code that operates on a pandas dataframe. Repeating the operation makes the operation slower, geometrically.
import time
import pandas as pd
import numpy as np
def computation(_DF):
for key in _DF.columns:
_DF[f'{key}_1'] = _DF[f'{key}'].shift(1)
_DF[f'{key}_diff'] = _DF[f'{key}'].diff()
return _DF[-1:]
DF = pd.DataFrame()
for idx in range(5):
DF[f'col_{idx}']=np.random.randint(10,size=10)
avg=0
for idx in range(10):
startnow = time.time()
result = computation(DF)
avg+=time.time()-startnow
print(f'Time: {time.time()-startnow}')
print(f'Average: {float(avg/10)}')
The result is:
Time: 0.0017206668853759766
Time: 0.0038039684295654297
Time: 0.008582830429077148
Time: 0.023931026458740234
Time: 0.03588294982910156
Time: 0.07429194450378418
Time: 0.15292000770568848
Time: 0.34928417205810547
Time: 0.880706787109375
Time: 2.5228471755981445
Average: 0.4053963661193848
The problem seems to be very basic and yet I cannot find the 'duh' solution. In the real code, computation only needs n rows from the dataframe (where n>>10) and only a few rows from the computation needs to be sent back. I have tried copying the result to a variable and deleting the _DF dataframe (setting it up for the garbage collector) but it makes no difference.
Thanks

Data Cleaning(Flagging) Dead Sensor

I have a large timeseries(pandas dataframe) of windspeed (10min average) which contains error data (dead sensor). How can it be flagged automatically. I was trying with moving average.
Some other approach other then moving average is much appreciated. I have attached the sample data image below.
There are several ways to deal with this problem. I will first pass to differences:
%matplotlib inline
import pandas as pd
import numpy as np
np.random.seed(0)
n = 200
y = np.cumsum(np.random.randn(n))
y[100:120] = 2
y[150:160] = 0
ts = pd.Series(y)
ts.diff().plot();
The next step is to find how long are the strikes of consecutive zeros.
def getZeroStrikeLen(x):
""" Accept a boolean array only
"""
res = np.diff(np.where(np.concatenate(([x[0]],
x[:-1] != x[1:],
[True])))[0])[::2]
return res
vec = ts.diff().values == 0
out = getZeroStrikeLen(vec)
Now if len(out)>0 you can conclude that there is a problem. If you want to go one step further you can have a look to this. It is in R but it's not that hard to replicate in Python.

Huge quantity of Dask computations causing memory issues

I am working on a task where I need to determine where two geospatial points are within 250 meters of each other and occur within 20 minutes of each other. My data set is approximately 1.2M rows and 10 columns. So, I need to determine a distance, time difference, and whether they meet my criteria by going through 1.2M**2 calculations.
I have been able to run the code below where I create 10,000 Dask objects to compute without problem. However, when I attempt to test 100,000 objects Dask runs up against memory limitations and I see significant CPU usage for swap. To be clear, I'm running this on a 32 core node with 125 GB of memory.
Admittedly, I'm quite new to Dask, so I'd like to know: is there a better way to solve this problem than processing in 10,000 row chunks?
#!/usr/bin/env python
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.array import sqrt
import time
import multiprocessing as mp
df = pd.read_hdf(...) # Used to select single item for comparison
ddf = dd.read_hdf(...) # Used for Dask operations
def distCheck(item,df=ddf):
'''
Determine if any records in df are within 250m of item and within 20
minutes of item. Return Dask object for calculation.
'''
dist = sqrt(((ddf.LCC_x1-item.LCC_x1)**2+(ddf.LCC_y1-item.LCC_y1)**2))
distcrit = dist[dist < 250]
delta = (ddf.Date - item.Date).abs()
timecrit = delta[delta < np.timedelta64(20,'m')]
res1 = ddf.copy()
res1['dist'] = dist
res1['delta'] = delta
res1 = res1.loc[(distcrit.index) & (timecrit.index) & (idcrit.index)]
res1['MatchMMSI'] = item.MMSI
res1['MatchVoy'] = item.Voyage
out = res1
return out
def getDaskCalls(start,stop):
'''
Get Dask objects to assess temporal and spatial proximity for df
indices from start to stop.
'''
# Kick off multiprocessing pool, submit, and close
pool = mp.Pool(processes=32)
daskers = []
for i in range(start,stop):
result = pool.apply_async(distCheck,args=(df.iloc[i,:],ddf,))
daskers.append(result)
dasky = [i.get() for i in daskers]
pool.close()
return dasky
def runDask(calls):
result = pd.DataFrame([],columns=calls[0].columns)
output = dd.compute(calls)
result = pd.concat([result]+[i for i in output[0] if i.shape[0] != 0])
return result
###
### Process
###
# Get initial timestamp
start = time.time()
# Create Dask Calls & determine duration
dcalls = getDaskCalls(0,10000)
callsCreated = time.time()
# Print time required to create calls
print("Dask Calls Created.")
print(callsCreated-start)
# Compute the calls with Dask
print("Computing...")
result = runDask(dcalls)
# Print the time for computation
computation = time.time()
print(" ...Done.")
print(computation-callsCreated)

Why does using multiprocessing with pandas apply lead to such a dramatic speedup?

Suppose I have a pandas dataframe and a function I'd like to apply to each row. I can call df.apply(apply_fn, axis=1), which should take time linear in the size of df. Or I can split df and use pool.map to call my function on each piece, and then concatenate the results.
I was expecting the speedup factor from using pool.map to be roughly equal to the number of processes in the pool (new_execution_time = original_execution_time/N if using N processors -- and that's assuming zero overhead).
Instead, in this toy example, time falls to around 2% (0.005272 / 0.230757) when using 4 processors. I was expecting 25% at best. What is going on and what am I not understanding?
import numpy as np
from multiprocessing import Pool
import pandas as pd
import pdb
import time
n = 1000
variables = {"hello":np.arange(n), "there":np.random.randn(n)}
df = pd.DataFrame(variables)
def apply_fn(series):
return pd.Series({"col_5":5, "col_88":88,
"sum_hello_there":series["hello"] + series["there"]})
def call_apply_fn(df):
return df.apply(apply_fn, axis=1)
n_processes = 4 # My machine has 4 CPUs
pool = Pool(processes=n_processes)
t0 = time.process_time()
new_df = df.apply(apply_fn, axis=1)
t1 = time.process_time()
df_split = np.array_split(df, n_processes)
pool_results = pool.map(call_apply_fn, df_split)
new_df2 = pd.concat(pool_results)
t2 = time.process_time()
new_df3 = df.apply(apply_fn, axis=1) # Try df.apply a second time
t3 = time.process_time()
print("identical results: %s" % np.all(np.isclose(new_df, new_df2))) # True
print("t1 - t0 = %f" % (t1 - t0)) # I got 0.230757
print("t2 - t1 = %f" % (t2 - t1)) # I got 0.005272
print("t3 - t2 = %f" % (t3 - t2)) # I got 0.229413
I saved the code above and ran it using python3 my_filename.py.
PS I realize that in this toy example new_df can be created in a much more straightforward way, without using apply. I'm interested in applying similar code with a more complex apply_fn that doesn't just add columns.
Edit (My previous answer was actually wrong.)
time.process_time() (doc) measures time only in the current process (and doesn't include sleeping time). So the time spent in child processes is not taken into account.
I run your code with time.time(), which measures real-world time (showing no speedup at all) and with a more reliable timeit.timeit (about 50% speedup). I have 4 cores.

Python, parsing data 24 hours at a time out of 263 days

I have an excel/( to be converted to CSV a link) file.
The data- , has 8 columns. The first two are day of the year and time respectively while two before the last are minimum temperature and maximum temperature. For each day I need to find the maximum and minimum of the day subtract and save the value for that day.
Two problems I ran into, how do I parse 24 lines at a time ( there are no missing data lines!) and in each batch find the maximum or minimum.
I have 63126 lines=24 hr*263 days
So to iterate through the lines;
import numpy as np
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
dt_per_all_days=[]
for i in range (0,63126,1):
# I get stuck here how to limit the iteration for 24 at a time.
# if I can do that I think I can get the rest done.
min_d=[]
max_d=[]
min_d.append( up_air_min[i])
max_d.append( up_air_max[i])
max_per_day=max(max_d)
min_per_day=min(min_d)
dt_d=max_per_day-min_per_day
dt_per_all_days.append(dt_d)
del(min_d)
del(max_d)
move to the next batch of 24 lines....
`
Use the Numpy, Luke, avoid for-loops.
Then you have ap_air_min and ap_air_max numpy arrays you can easily do what you want by using numpy element-wise functions.
At first, create 2d array with 263 rows (one for a day) and 24 columns like this:
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
Then use np.min and np.max functions along axis 1 (good array tip sheet):
min_temperature = np.min(min_matrix, axis=1)
max_temperature = mp.max(max_matrix, axis=1)
And find the difference:
dt = max_temperature - min_temperature
dt is array with needed values. Let's save it to foo.csv:
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
And final code looks like this:
import numpy as np
# This I got from your answer.
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
# Split arrays and create matrix with 263 lines-days and 24 values in every line.
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
# Find min temperature for every day. min_temperature is an array with 263 values.
min_temperature = np.min(min_matrix, axis=1)
# The same for max temperature.
max_temperature = mp.max(max_matrix, axis=1)
# Subtract min temperature from max.
dt = max_temperature - min_temperature
# Save result in csv.
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
A reasonably pythonic way to do this would be to have a function that loops over the rows, gathering them up and spitting out the gathered rows using yield when the day changes. This gives you a generator that generates 263 lists each holding 24 values, which is a bit easier to process.
If you've definitely not got any missing values, you could use a trivial doubly-nested loop without batching up the elements first. That's a bit more fragile, but it sounds like you might not be planning to re-use the code anyway.
Here's a somewhat contrived example of how you could chunk things by 24 lines at a time.
from StringIO import StringIO
from random import random as r
import numpy as np
import operator
s = StringIO()
for x in xrange(0,10000):
s.write('%f,%f,%f\n' % (r(),r()*10,r()*100))
s.seek(0)
data = np.genfromtxt(s,dtype=None, names=['pitch','yaw','thrust'], delimiter=',')
for x in range(0,len(data),24):
print('Acting on hours %d through %d' % (x, x+24))
one_day = data[x:x+24]
minimum_yaw = min(one_day['yaw'])
max_yaw = max(one_day['yaw'])
print 'min',minimum_yaw,'max',max_yaw,'one_day',one_day['yaw']

Categories