Applying a function to every observation in a dataframe - python

I have a large df of coordinates that I'm putting through a function (reverse geocoder),
How can I run through the whole df without iterating (Takes very long)
Example df:
Latitude Longitude
0 -25.66026 28.0914
1 -25.67923 28.10525
2 -30.68456 19.21694
3 -30.12345 22.34256
4 -15.12546 17.12365
After running through the function I want (without a for loop...) a df:
City
0 HappyPlace
1 SadPlace
2 AveragePlace
3 CoolPlace
4 BadPlace
Note: I dont need to know how to do reverse geocoding, this is a question about applying a function to a whole df without iteration.
EDIT:
using df.apply() might not work as my code looks like this:
for i in range(len(df)):
results = g.reverse_geocode(df['LATITUDE'][i], df['LONGITUDE'][i])
city.append(results.city)

Slower approach Iterating through the list of geo points and fetching city of the geo point
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
# example method of g.reverse_geocode() -> geo_reverse
def geo_reverse(lat, long):
time.sleep(2)
#assuming that your reverse_geocode will take 2 second
print(lat, long)
for i in range(len(df)):
results = geo_reverse(df['Latitude'][i], df['Longitude'][i])
Because of time.sleep(2). above program will take at least 20 seconds to process all ten geo point.
Better approach than above:
import pandas as pd
import time
d = {'Latitude': [-25.66026,-25.67923,-30.68456,-30.12345,-15.12546,-25.66026,-25.67923,-30.68456,-30.12345,-15.12546], 'Longitude': [28.0914, 28.10525,19.21694,22.34256,17.12365,28.0914, 28.10525,19.21694,22.34256,17.12365]}
df = pd.DataFrame(data=d)
import threading
def runnable_method(f, args):
result_info = [threading.Event(), None]
def runit():
result_info[1] = f(args)
result_info[0].set()
threading.Thread(target=runit).start()
return result_info
def gather_results(result_infos):
results = []
for i in range(len(result_infos)):
result_infos[i][0].wait()
results.append(result_infos[i][1])
return results
def geo_reverse(args):
time.sleep(2)
return "City Name of ("+str(args[0])+","+str(args[1])+")"
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
cities_result = gather_results(result_info)
print(cities_result)
Notice the method geo_reverse has processing time of 2 seconds to fetch the data based on the geo points. In this second example the code will take only 2 seconds to process as many points as you want.
Note: Try both approach assuming that your geo_reverse will take approx. 2 seconds to fetch data. First approach will take 20+1 seconds and the processing time will increase with the increasing number of inputs but second approach will have almost constant processing time (i.e. approx 2+1) seconds no matter how many geo points you want to process.
Assume g.reverse_geocode() method is geo_reverse() on above code. Run both code (approach) above separately and see the difference on your own.
Explanation:
Take a look on above code and its major part that is creating list of tuples and comprehending that list passing each tuple to a dynamically created threads (Major part):
#Converting df of geo points into list of tuples
geo_points = []
for i in range(len(df)):
tuple_i = (df['Latitude'][i], df['Longitude'][i])
geo_points.append(tuple_i)
#List comprehension with custom methods and create run-able threads
result_info = [runnable_method(geo_reverse, geo_point) for geo_point in geo_points]
#gather result from each thread.
cities_result = gather_results(result_info)
print(cities_result)

Related

Assemble a dataframe from two csv files

I wrote the following code to form a data frame containing the energy consumption and the temperature. The data for each of the variables is collected from a different csv file:
def match_data():
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
new_time = []
new_pwr = []
new_tmp = []
for i in range(1,len(pwr_data)):
for j in range(1,len(temp_data)):
if pwr_data['time'][i] == temp_data['Date'][j]:
time = pwr_data['time'][i]
pwr = pwr_data['watt_hour'][i]
tmp = temp_data['Temp'][j]
new_time.append(time)
new_pwr.append(pwr)
new_tmp.append(tmp)
return pd.DataFrame({'Time' : new_time,'watt_hour' : new_pwr,'Temp':new_tmp})
I was trying to collect data with matching time indices so that I can assemble them in a data frame.
The code works well but it takes time(43 seconds for around 1300 data points). At the moment I don't have much data but I was wondering if there was a more efficient and faster way to do so
Do the pwr_data['time'] and temp_data['Date] columns have the same granularity?
If so, you can pd.merge() the two dataframes after reading them.
# read data
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
# merge data on time and Date columns
# you can set the how to be 'inner' or 'right' depending on your needs
df = pd.merge(pwr_data, temp_data, how='left', left_on='time', right_on='Date')
Just like #greco recommended this did the trick and in no time!
pd.merge(pwr_data,temp_data,how='inner',left_on='time',right_on='Date')
'time' and Date are the columns on which you want to base the merge.

How to speed up iterating through an array / matrix? Tried both pandas and numpy arrays

I would like to do some feature enrichment through a large 2 dimensional array (15,100m).
Working on a sample set with 100'000 records showed that I need to get this faster.
Edit (data model info)
To simplify, let's say we have only two relevant columns:
IP (identifier)
Unix (timestamp in seconds since 1970)
I would like to add a 3rd column, counting how many times this IP has shown up in the past 12 hours.
End edit
My first attempt was using pandas, because it was comfortable working with named dimensions, but too slow:
for index,row in tqdm_notebook(myData.iterrows(),desc='iterrows'):
# how many times was the IP address (and specific device) around in the prior 5h?
hours = 12
seen = myData[(myData['ip']==row['ip'])
&(myData['device']==row['device'])
&(myData['os']==row['os'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
ip_seen = myData[(myData['ip']==row['ip'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
myData.loc[index,'seen'] = seen
myData.loc[index,'ip_seen'] = ip_seen
Then I switched to numpy arrays and hoped for a better result, but it is still too slow to run against the full dataset:
# speed test numpy arrays
for i in np.arange(myArray.shape[0]):
hours = 12
ip,device,os,ts = myArray[i,[0,3,4,12]]
ip_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
device_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,2] == device)
& (myArray[:,3] == os)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
myArray[i,13]=ip_seen
myArray[i,14]=device_seen
My next idea would be to iterate only once, and maintain a growing dictionary of the current count, instead of looking backwards in every iteration.
But that would have some other drawbacks (e.g. how to keep track when to reduce count for observations falling out of the 12h window).
How would you approach this problem?
Could it be even an option to use low level Tensorflow functions to involve a GPU?
Thanks
The only way to speed up things is not looping. In your case you can try using rolling with a window of the time span that you want, using the Unix timestamp as a datetime index (assuming that records are sorted by timestamp, otherwise you would need to sort first). This should work fine for the ip_seen:
ip = myData['ip']
ip.index = pd.to_datetime(myData['unix'], unit='s')
myData['ip_seen'] = ip.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
However, when the aggregation involves multiple columns, like in the seen column, it gets more complicated. Currently (see Pandas issue #15095) rolling functions do not support aggregations spanning two dimensions. A workaround could be merging the columns of interest into a single new series, for example a tuple (which may work better if values are numbers) or a string (which may be better is values are already strings). For example:
criteria = myData['ip'] + '|' + myData['device'] + '|' + myData['os']
criteria.index = pd.to_datetime(myData['unix'], unit='s')
myData['seen'] = criteria.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
EDIT
Apparently rolling only works with numeric types, which leaves as with two options:
Manipulate the data to use numeric types. For the IP this is easy, since it actually represents a 32 bit number (or 64 if IPv6 I guess). For device and OS, assuming they are strings now, it get's more complicated, you would have to map each possible value to an integer and the merge it with the IP in a long value, e.g. putting these in the higher bits or something like that (maybe even impossible with IPv6, since the biggest integers NumPy supports right now are 64 bits).
Roll over the index of myData (which should now be not datetime, because rolling cannot work with that either) and use the index window to get the necessary data and operate:
# Use sequential integer index
idx_orig = myData.index
myData.reset_index(drop=True, inplace=True)
# Index to roll
idx = pd.Series(myData.index)
idx.index = pd.to_datetime(myData['unix'], unit='s')
# Roll aggregation function
def agg_seen(w, data, fields):
# Use slice for faster data frame slicing
slc = slice(int(w[0]), int(w[-2])) if len(w) > 1 else []
match = data.loc[slc, fields] == data.loc[int(w[-1]), fields]
return np.count_nonzero(np.all(match, axis=1))
# Do rolling
myData['ip_seen'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip'])) \
.values.astype(np.int32)
myData['ip'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip', 'device', 'os'])) \
.values.astype(np.int32)
# Put index back
myData.index = idx_orig
This is not how rolling is meant to be used, though, and I'm not sure if this gives much better performance than just looping.
as mentioned in the comment to #jdehesa, I took another approach which allows me to only iterate once through the entire dataset and pull the (decaying) weight from an index.
decay_window = 60*60*12 # every 12
decay = 0.5 # fall by 50% every window
ip_idx = pd.DataFrame(myData.ip.unique())
ip_idx['ts_seen'] = 0
ip_idx['ip_seen'] = 0
ip_idx.columns = ['ip','ts_seen','ip_seen']
ip_idx.set_index('ip',inplace=True)
for index, row in myData.iterrows(): # all
# How often was this IP seen?
prior_ip_seen = ip_idx.loc[(row['ip'],'ip_seen')]
prior_ts_seen = ip_idx.loc[(row['ip'],'ts_seen')]
delay_since_count = row['unix']-ip_idx.loc[(row['ip'],'ts_seen')]
new_ip_seen = prior_ip_seen*decay**(delay_since_count/decay_window)+1
ip_idx.loc[(row['ip'],'ip_seen')] = new_ip_seen
ip_idx.loc[(row['ip'],'ts_seen')] = row['unix']
myData.iloc[index,14] = new_ip_seen-1
That way the result is not the fixed time window as requested initially, but prior observations "fade out" over time, giving frequent recent observations a higher weight.
This feature carries more information than the simplified (and turned out more expensive) approach initially planned.
Thanks for your input!
Edit
In the meantime I switched to numpy arrays for the same operation, which now only takes a fraction of the time (loop with 200m updates in <2h).
Just in case somebody looks for a starting point:
%%time
import sys
## temporary lookup
ip_seen_ts = [0]*365000
ip_seen_count = [0]*365000
cnt = 0
window = 60*60*12 # 12h
decay = 0.5
counter = 0
chunksize = 10000000
store = pd.HDFStore('store.h5')
t = time.process_time()
try:
store.remove('myCount')
except:
print("myData not present.")
for myHdfData in store.select_as_multiple(['myData','myFeatures'],columns=['ip','unix','ip_seen'],chunksize=chunksize):
print(counter, time.process_time() - t)
#display(myHdfData.head(5))
counter+=chunksize
t = time.process_time()
sys.stdout.flush()
keep_index = myHdfData.index.values
myArray = myHdfData.as_matrix()
for row in myArray[:,:]:
#for row in myArray:
i = (row[0].astype('uint32')) # IP as identifier
u = (row[1].astype('uint32')) # timestamp
try:
delay = u - ip_seen_ts[i]
except:
delay = 0
ip_seen_ts[i] = u
try:
ip_seen_count[i] = ip_seen_count[i]*decay**(delay/window)+1
except:
ip_seen_count[i] = 1
row[3] = np.tanh(ip_seen_count[i]-1) # tanh to normalize between 0 and 1
myArrayAsDF = pd.DataFrame(myArray,columns=['c_ip','c_unix','c_ip2','ip_seen'])
myArrayAsDF.set_index(keep_index,inplace=True)
store.append('myCount',myArrayAsDF)
store.close()

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

RAM consumption by pandas DataFrame

I am trying to work with around 100 csv files to do a time series analysis.
To build an efficient algorithm to use I've structured my data read_csv function such that it only reads all the files at once and don't have to repeat the same process again and again. To explain further following is my code:
start_date = '2016-06-01'
end_date = '2017-09-02'
allocation = 170000
#contains 100 symbols
usesymbols = ['']
cost_matrix = []
def data():
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in usesymbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
dataframe = data()
Problem is that if I run the above code with 15 symbols it works perfectly.
But that's not sufficient, I want to use 100 symbols.
If I run the code with 100 items in usesymbols, my RAM is used up completely and the machine freezes.
Is there anything that can be done to avoid this situation?
Edited Part:
1) I've 16 GB RAM.
2) the issue is with the variable power_set, if I don't call powerset function data gets retrieved easily.
DataFrame.memory_usage(index=False)
Return:
sizes : Series
A series with column names as index and memory usage of columns with units of bytes.

Trouble with pandas iterrows and loop counter

I have a dataset containing the US treasury curve for each day over a few years. Rows = Dates, Columns = tenor of specific treasury bond (3 mo, 1 yr, 10yr, etc)
I have python code that loops through each day and calibrates parameters for an interest rate model. I am having trouble looping through each row via iterrows and with my loop counter. The goal is to go row by row and calibrate the model to that daily curve, store the calibrated parameters in a dataframe, and then move onto the next row and repeat.
def do_calibration_model1():
global i
for index, row in curves.iterrows():
day = np.array(row) #the subsequent error_fxn uses this daily curve
calibration()
i += 1
def calibration():
i = 0
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
calibration.loc[i] = np.array(opt) # store result of minimization (parameters for that day)
The code works correctly for the first iteration but then keeps repeating the calibration for the first row in the dataframe (curves). Further, it does not store the parameters in the next row of the calibration dataframe. I view the first issue as relating to the iterrows while the second is an issue of the loop counter.
Any thoughts on what is going wrong? I have a Matlab background and find the pandas setup to be very frustrating.
For reference I have consulted the links below to no avail.
https://www.python.org/dev/peps/pep-0212/
http://nipunbatra.github.io/2015/06/pandas-iteration/
Per Jason's comment below I have updated the code to:
def do_calibration_model1():
global i
for index, row in curves.iterrows():
for i in range(0,len(curves)):
day = np.array(row) #the subsequent error_fxn uses this daily curve
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
calibration.loc[i] = np.array(opt) # store result of minimization (parameters for that day)
i += 1
The revised code now places the appropriate parameters in each row of the calibration dataframe based on the loop counter.
*However, it still does not move to the second (or subsequent rows) of the curves dataframe for the pandas iterrows function.
Each time calibration is called, you set i = 0. As a result, when you call calibration.loc[i] = np.array(opt), what is being written is item 0 of calibration. The variable i is never actually anything except 0 in this function.
In function do_calibration_model1(), you declare global i and then augment i by one at the end of the function call. I'm not sure what this i counter is meant to accomplish. Perhaps you think that the i in do_calibration_model1() is updating the value of the i variable in the calibration() function, but this is not the case. Given that there is no global i statement in calibration(), the i in this function is a local variable.
Regarding iterrows, I don't think you need the embedded for loop that cycles through the length of curves. Here's a quick example to show you how iterrows works:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
new = pd.DataFrame({'sum': [],
'mean': []})
for index, row in df.iterrows():
temp = {'sum': sum(row), 'mean': np.mean(row)}
new = new.append(temp, ignore_index=True)
In the above, df looks like this:
A B C D
0 -2.197018 1.905543 0.773851 -0.006683
1 0.675442 0.818040 -0.561957 0.002737
2 -0.833482 0.248135 -1.159698 -0.302912
3 0.784216 -0.156225 -0.043505 -2.539486
4 -0.637248 0.034303 -1.405159 -1.590045
5 0.289257 -0.085030 -0.619899 -0.211158
6 0.804702 -0.838365 0.199911 0.210378
7 -0.031306 0.166793 -0.200867 1.343865
And the new dataframe populated through the iterrows loop looks like this:
mean sum
0 0.118923 0.475693
1 0.233566 0.934262
2 -0.511989 -2.047958
3 -0.488750 -1.954999
4 -0.899537 -3.598148
5 -0.156707 -0.626830
6 0.094157 0.376626
7 0.319621 1.278485
Note that using append here makes unnecessary the use of an i counter and simplifies the code.
Returning to your code, I suggest something like the following:
def do_calibration_model1():
callibration = pd.DataFrame({'a': [],
'b': []})
for index, row in curves.iterrows():
day = np.array(row)
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
temp = {'a': ..., 'b': ...} # put opt values into dict
callibration = calibration.append(temp, ignore_index=True)
return callibration
In this step callibration = pd.DataFrame({'a': [], 'b': []}) you will need to set up the dataframe to ingest opt. Previously, you transformed opt to a numpy array, but you will need to arrange the values of opt so they fit your callibration dataframe, in the same way that I did for temp here: temp = {'sum': sum(row), 'mean': np.mean(row)}.

Categories