I have the following code that operates on a pandas dataframe. Repeating the operation makes the operation slower, geometrically.
import time
import pandas as pd
import numpy as np
def computation(_DF):
for key in _DF.columns:
_DF[f'{key}_1'] = _DF[f'{key}'].shift(1)
_DF[f'{key}_diff'] = _DF[f'{key}'].diff()
return _DF[-1:]
DF = pd.DataFrame()
for idx in range(5):
DF[f'col_{idx}']=np.random.randint(10,size=10)
avg=0
for idx in range(10):
startnow = time.time()
result = computation(DF)
avg+=time.time()-startnow
print(f'Time: {time.time()-startnow}')
print(f'Average: {float(avg/10)}')
The result is:
Time: 0.0017206668853759766
Time: 0.0038039684295654297
Time: 0.008582830429077148
Time: 0.023931026458740234
Time: 0.03588294982910156
Time: 0.07429194450378418
Time: 0.15292000770568848
Time: 0.34928417205810547
Time: 0.880706787109375
Time: 2.5228471755981445
Average: 0.4053963661193848
The problem seems to be very basic and yet I cannot find the 'duh' solution. In the real code, computation only needs n rows from the dataframe (where n>>10) and only a few rows from the computation needs to be sent back. I have tried copying the result to a variable and deleting the _DF dataframe (setting it up for the garbage collector) but it makes no difference.
Thanks
I have a large timeseries(pandas dataframe) of windspeed (10min average) which contains error data (dead sensor). How can it be flagged automatically. I was trying with moving average.
Some other approach other then moving average is much appreciated. I have attached the sample data image below.
There are several ways to deal with this problem. I will first pass to differences:
%matplotlib inline
import pandas as pd
import numpy as np
np.random.seed(0)
n = 200
y = np.cumsum(np.random.randn(n))
y[100:120] = 2
y[150:160] = 0
ts = pd.Series(y)
ts.diff().plot();
The next step is to find how long are the strikes of consecutive zeros.
def getZeroStrikeLen(x):
""" Accept a boolean array only
"""
res = np.diff(np.where(np.concatenate(([x[0]],
x[:-1] != x[1:],
[True])))[0])[::2]
return res
vec = ts.diff().values == 0
out = getZeroStrikeLen(vec)
Now if len(out)>0 you can conclude that there is a problem. If you want to go one step further you can have a look to this. It is in R but it's not that hard to replicate in Python.
I am working on a task where I need to determine where two geospatial points are within 250 meters of each other and occur within 20 minutes of each other. My data set is approximately 1.2M rows and 10 columns. So, I need to determine a distance, time difference, and whether they meet my criteria by going through 1.2M**2 calculations.
I have been able to run the code below where I create 10,000 Dask objects to compute without problem. However, when I attempt to test 100,000 objects Dask runs up against memory limitations and I see significant CPU usage for swap. To be clear, I'm running this on a 32 core node with 125 GB of memory.
Admittedly, I'm quite new to Dask, so I'd like to know: is there a better way to solve this problem than processing in 10,000 row chunks?
#!/usr/bin/env python
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.array import sqrt
import time
import multiprocessing as mp
df = pd.read_hdf(...) # Used to select single item for comparison
ddf = dd.read_hdf(...) # Used for Dask operations
def distCheck(item,df=ddf):
'''
Determine if any records in df are within 250m of item and within 20
minutes of item. Return Dask object for calculation.
'''
dist = sqrt(((ddf.LCC_x1-item.LCC_x1)**2+(ddf.LCC_y1-item.LCC_y1)**2))
distcrit = dist[dist < 250]
delta = (ddf.Date - item.Date).abs()
timecrit = delta[delta < np.timedelta64(20,'m')]
res1 = ddf.copy()
res1['dist'] = dist
res1['delta'] = delta
res1 = res1.loc[(distcrit.index) & (timecrit.index) & (idcrit.index)]
res1['MatchMMSI'] = item.MMSI
res1['MatchVoy'] = item.Voyage
out = res1
return out
def getDaskCalls(start,stop):
'''
Get Dask objects to assess temporal and spatial proximity for df
indices from start to stop.
'''
# Kick off multiprocessing pool, submit, and close
pool = mp.Pool(processes=32)
daskers = []
for i in range(start,stop):
result = pool.apply_async(distCheck,args=(df.iloc[i,:],ddf,))
daskers.append(result)
dasky = [i.get() for i in daskers]
pool.close()
return dasky
def runDask(calls):
result = pd.DataFrame([],columns=calls[0].columns)
output = dd.compute(calls)
result = pd.concat([result]+[i for i in output[0] if i.shape[0] != 0])
return result
###
### Process
###
# Get initial timestamp
start = time.time()
# Create Dask Calls & determine duration
dcalls = getDaskCalls(0,10000)
callsCreated = time.time()
# Print time required to create calls
print("Dask Calls Created.")
print(callsCreated-start)
# Compute the calls with Dask
print("Computing...")
result = runDask(dcalls)
# Print the time for computation
computation = time.time()
print(" ...Done.")
print(computation-callsCreated)
Suppose I have a pandas dataframe and a function I'd like to apply to each row. I can call df.apply(apply_fn, axis=1), which should take time linear in the size of df. Or I can split df and use pool.map to call my function on each piece, and then concatenate the results.
I was expecting the speedup factor from using pool.map to be roughly equal to the number of processes in the pool (new_execution_time = original_execution_time/N if using N processors -- and that's assuming zero overhead).
Instead, in this toy example, time falls to around 2% (0.005272 / 0.230757) when using 4 processors. I was expecting 25% at best. What is going on and what am I not understanding?
import numpy as np
from multiprocessing import Pool
import pandas as pd
import pdb
import time
n = 1000
variables = {"hello":np.arange(n), "there":np.random.randn(n)}
df = pd.DataFrame(variables)
def apply_fn(series):
return pd.Series({"col_5":5, "col_88":88,
"sum_hello_there":series["hello"] + series["there"]})
def call_apply_fn(df):
return df.apply(apply_fn, axis=1)
n_processes = 4 # My machine has 4 CPUs
pool = Pool(processes=n_processes)
t0 = time.process_time()
new_df = df.apply(apply_fn, axis=1)
t1 = time.process_time()
df_split = np.array_split(df, n_processes)
pool_results = pool.map(call_apply_fn, df_split)
new_df2 = pd.concat(pool_results)
t2 = time.process_time()
new_df3 = df.apply(apply_fn, axis=1) # Try df.apply a second time
t3 = time.process_time()
print("identical results: %s" % np.all(np.isclose(new_df, new_df2))) # True
print("t1 - t0 = %f" % (t1 - t0)) # I got 0.230757
print("t2 - t1 = %f" % (t2 - t1)) # I got 0.005272
print("t3 - t2 = %f" % (t3 - t2)) # I got 0.229413
I saved the code above and ran it using python3 my_filename.py.
PS I realize that in this toy example new_df can be created in a much more straightforward way, without using apply. I'm interested in applying similar code with a more complex apply_fn that doesn't just add columns.
Edit (My previous answer was actually wrong.)
time.process_time() (doc) measures time only in the current process (and doesn't include sleeping time). So the time spent in child processes is not taken into account.
I run your code with time.time(), which measures real-world time (showing no speedup at all) and with a more reliable timeit.timeit (about 50% speedup). I have 4 cores.
I have an excel/( to be converted to CSV a link) file.
The data- , has 8 columns. The first two are day of the year and time respectively while two before the last are minimum temperature and maximum temperature. For each day I need to find the maximum and minimum of the day subtract and save the value for that day.
Two problems I ran into, how do I parse 24 lines at a time ( there are no missing data lines!) and in each batch find the maximum or minimum.
I have 63126 lines=24 hr*263 days
So to iterate through the lines;
import numpy as np
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
dt_per_all_days=[]
for i in range (0,63126,1):
# I get stuck here how to limit the iteration for 24 at a time.
# if I can do that I think I can get the rest done.
min_d=[]
max_d=[]
min_d.append( up_air_min[i])
max_d.append( up_air_max[i])
max_per_day=max(max_d)
min_per_day=min(min_d)
dt_d=max_per_day-min_per_day
dt_per_all_days.append(dt_d)
del(min_d)
del(max_d)
move to the next batch of 24 lines....
`
Use the Numpy, Luke, avoid for-loops.
Then you have ap_air_min and ap_air_max numpy arrays you can easily do what you want by using numpy element-wise functions.
At first, create 2d array with 263 rows (one for a day) and 24 columns like this:
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
Then use np.min and np.max functions along axis 1 (good array tip sheet):
min_temperature = np.min(min_matrix, axis=1)
max_temperature = mp.max(max_matrix, axis=1)
And find the difference:
dt = max_temperature - min_temperature
dt is array with needed values. Let's save it to foo.csv:
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
And final code looks like this:
import numpy as np
# This I got from your answer.
input_temps='/L7_HW_SASP_w1112.csv'
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
# Split arrays and create matrix with 263 lines-days and 24 values in every line.
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
# Find min temperature for every day. min_temperature is an array with 263 values.
min_temperature = np.min(min_matrix, axis=1)
# The same for max temperature.
max_temperature = mp.max(max_matrix, axis=1)
# Subtract min temperature from max.
dt = max_temperature - min_temperature
# Save result in csv.
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
A reasonably pythonic way to do this would be to have a function that loops over the rows, gathering them up and spitting out the gathered rows using yield when the day changes. This gives you a generator that generates 263 lists each holding 24 values, which is a bit easier to process.
If you've definitely not got any missing values, you could use a trivial doubly-nested loop without batching up the elements first. That's a bit more fragile, but it sounds like you might not be planning to re-use the code anyway.
Here's a somewhat contrived example of how you could chunk things by 24 lines at a time.
from StringIO import StringIO
from random import random as r
import numpy as np
import operator
s = StringIO()
for x in xrange(0,10000):
s.write('%f,%f,%f\n' % (r(),r()*10,r()*100))
s.seek(0)
data = np.genfromtxt(s,dtype=None, names=['pitch','yaw','thrust'], delimiter=',')
for x in range(0,len(data),24):
print('Acting on hours %d through %d' % (x, x+24))
one_day = data[x:x+24]
minimum_yaw = min(one_day['yaw'])
max_yaw = max(one_day['yaw'])
print 'min',minimum_yaw,'max',max_yaw,'one_day',one_day['yaw']