Need help optimizing this code for faster results - python

To give an overview of the data, there are multiple rows of data which have the same id, and furthermore, have multiple columns with the same values. Now there are some functions which will output the same result for rows with the same id. Therefore, I group by this id, perform the functions I need to perform on them, and then I begin looping through each row within each group, to perform the functions which will yield different results for each row, even with the same id.
Here is some sample data:
id map_sw_lon map_sw_lat map_ne_lon map_ne_lat exact_lon exact_lat
1 10 15 11 16 20 30
1 10 15 11 16 34 50
2 20 16 21 17 44 33
2 20 16 21 17 50 60
Here is my code:
for id, group in df.groupby("id", sort=False):
viewport = box(group["map_sw_lon"].iloc[0],
group["map_sw_lat"].iloc[0], group["map_ne_lon"].iloc[0],
center_of_viewport = viewport.centroid
center_hex = h3.geo_to_h3(center_of_viewport.y, center_of_viewport.x, 8)
# everything above here can be done only once per group.
# everything below needs to be done per row per group.
for index, row in group.iterrows():
current_hex = h3.geo_to_h3(row["exact_lat"], row["exact_lon"], 8)[index,'hex_id'] = current_hex[index, 'hit_count'] = 1[index, 'center_hex'] = center_hex
distance_to_center = h3.h3_distance(current_hex, center_hex)[index,'hex_dist_to_center'] = distance_to_center
This code works in around 5 mins for 1 million rows of data. The problem is I’m dealing with data much larger than that, and need something that works faster. I know it isn’t recommended to use for loops in Pandas, but I’m not sure how to solve this problem without using them. Any help would be appreciated.
Edit: Still struggling with this..any help would be appreciated!

You need to do some profiling to see how much time each part of the code takes to run. I conjecture the most time consuming parts are the geo_to_h3 and h3_distance calls. If so, other possible improvements on data frame operations (e.g., using DataFrame.apply and GroupBy.transform) would not help a lot.


Speed up "is in" for DataFrame

Given a DataFrame, I would like to add a row if its not in the DF already.
if state not in df.index:
# append new state DataFrame
df = df.append(pd.Series([0] * len(self.actions), index=df.columns, name=state))
state is a string like this [0 1 12 36 67 0 14 5 6 4] (a list of 10 entries, handed over as a string).
For the first few rows added, this takes about 0.0045 seconds on average. Having 10.000+ rows already makes it significantly slower, about 0.0623 seconds, and with 100.000+ rows it becomes something like 0.1364 seconds...
Is there any way to speed up the check if the index already exists? I am new to python, but maybe there is a way to keep the index in the RAM and check there for better performance? Maybe hashing the index would speed it up, or maybe a combination of those?
Any hint is highly appreciated!

pandas data frame, apply t-test on rows simultaneously grouping by column names (have duplicates!)

I have a data frame with particular readouts as indexes (different types of measurements for a given sample), each column is a sample for which these readouts were taken. I also have a treatment group assigned as the column name for each sample. You can see the example below.
What I need to do: for a given readout (row) group samples by treatment (column name) and perform a t-test (Welch's t-test) on each group (each treatment). T-test must be done as a comparison with one fixed treatment (control treatment). I do not care about tracking out the sample ids (it was required, now I dropped them on purpose), I'm not going to do paired tests.
For example here, for readout1 I need to compare treatment1 vs treatment3, treatment2 vs treatment3 (it's ok if I'll also get treatment3 vs treatment3).
Example of data frame:
frame = pd.DataFrame(np.arange(27).reshape((3, 9)),
index=['readout1', 'readout2', 'readout3'],
columns=['treatment1', 'treatment1', 'treatment1',\
'treatment2', 'treatment2', 'treatment2', \
'treatment3', 'treatment3', 'treatment3'])
treatment1 treatment1 ... treatment3 treatment3
readout1 0 1 ... 7 8
readout2 9 10 ... 16 17
readout3 18 19 ... 25 26
[3 rows x 9 columns]
I'm fighting it for several days now. Tried to unstack/stack the data, transposing the data frame, then grouping by index, removing nan and applying lambda. Tried other strategies but none worked. Will appreciate any help.
thank you!

Selecting, slicing, and aggregating temporal data with Pandas

I'm trying to handle temporal data with pandas and I'm having a hard time...
Here is a sample of the DataFrame :
index ip app dev os channel click_time
0 29540 3 1 42 489 2017-11-08 03:57:46
1 26777 11 1 25 319 2017-11-09 11:02:14
2 140926 12 1 13 140 2017-11-07 04:36:14
3 69375 2 1 19 377 2017-11-09 13:17:20
4 119166 9 2 15 445 2017-11-07 12:11:37
This is a click prediction problem, so I want to create a time window aggregating the past behaviour of a specific ip ( for a given ip, how many clicks in the last 4 hours, 8 hours ? ).
I tried creating one new column which was simply :
I wanted to use this so that for each row I have a specific 8 hours window on which to aggregate my data.
I have also tried resampling with little success, my understanding of the function isn't optimal let's say.
Can anyone help ?
If you just need to select a particular 8 hours, you can do as follows:
start_time = datetime.datetime(2017, 11, 9,11, 2, 14)
df[(df['click_time' >= start_time)
& (df['click_time'] <= start_time+datetime.timedelta(0, 60*60*8))]
Otherwise I really think you need to look more at resample. Mind you, if you want resample to have your data divided into 8 hour chunks that are always consistent (e.g. from 00:00-08:00, 08:00-16:00, 16:00-00:00), then you will probably want to crop your data to a certain start time.
Using parts of the solution given by Martin, I was able to create this function that outputs what I wanted :
def window_filter_clicks(df, h):
ip_array = df.ip.unique()
for ip in ip_array:
for row, i in zip(df_ip['click_time'],df_ip['click_time'].index):
df_window = df_ip[(df_ip['click_time']>= row-timedelta(hours=h)) & (df_ip['click_time']<= row) ]
nb_clicks_4h = len(df_window)
df['nb_clicks_{}h'.format(h)].iloc[i]= nb_clicks_4h
return df
h allows me to select the size of the window on which to iterate.
Now this works fine, but it is very slow and I am working with a lot of rows.
Does anyone know how to improve the speed of such a function ? ( Or if there is anything similar built-in ? )

pandas - how to combine selected rows in a DataFrame

I've been reading a huge (5 GB) gzip file in the form:
User1 User2 W
0 11 12 1
1 12 11 2
2 13 14 1
3 14 13 2
which is basically a directed graph representation of connections among users with a certain weight W. Since the file is so big, I tried to read it through networkx, building a Directed Graph and then set it to Undirected. But it took too much time. So I was thinking in doing the same thing analysing a pandas dataframe. I would like to return the previous dataframe in the form:
User1 User2 W
0 11 12 3
1 13 14 3
where the common links in the two directions have been merged into one having as W the sum of the single weights. Any help would be appreciated.
There is probably a more concise way, but this works. The main trick is just to normalize the data such that User1 is always the lower number ID. Then you can use groupby since 11,12 and 12,11 are now recognized as representing the same thing.
In [330]: df = pd.DataFrame({"User1":[11,12,13,14],"User2":[12,11,14,13],"W":[1,2,1,2]})
In [331]: df['U1'] = df[['User1','User2']].min(axis=1)
In [332]: df['U2'] = df[['User1','User2']].max(axis=1)
In [333]: df = df.drop(['User1','User2'],axis=1)
In [334]: df.groupby(['U1','U2'])['W'].sum()
U1 U2
11 12 3
13 14 3
Name: W, dtype: int64
For more concise code that avoids creating new variables, you could replace the middle 3 steps with:
In [400]: df.ix[df.User1>df.User2,['User1','User2']] = df.ix[df.User1>df.User2,['User2','User1']].values
Note that column switching can be trickier than you'd think, see here: What is correct syntax to swap column values for selected rows in a pandas data frame using just one line?
As far as making this code fast in general, it will depend on your data. I don't think the above code will be as important as other things you might do. For example, your problem should be amenable to a chunking approach where you iterate over sections of the code, gradually shrinking it on each pass. In that case, the main thing you need to think about is sorting the data before chunking, so as to minimize how many passes you need to make. But doing it that way should allow you to do all the work in memory.

Indexing by row counts in a pandas dataframe

I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.
My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.
In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.
Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.
How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.
Adding some example data:
item_id date X DUMMY_ROWS
20 2010-11-01 16759 0
2010-12-01 16961 1
2011-01-01 17126 2
2011-02-01 17255 3
2011-03-01 17400 4
2011-04-01 17551 5
21 2007-09-01 4 6
2007-10-01 5 7
2007-11-01 6 8
2007-12-01 10 9
22 2006-05-01 10 10
2006-07-01 13 11
23 2006-05-01 2 12
24 2008-01-01 2 13
2008-02-01 9 14
2008-03-01 18 15
2008-04-01 19 16
2008-05-01 23 17
2008-06-01 32 18
I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.
One option is to reset the index then group by id.
df_new = df.reset_index()
this leaves your original df intact and gets you the mean across all months for each item id.
For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.
df.sort(['item_id', 'date'], inplace = True)
def sequence_id(item):
item['seq_id'] = range(0,len(item)-1,1)
return item
df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.
Hopefully this is more of what you want.
Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by #cwharland:
def sequence_id(item):
item['seq'] = range(0,len(item),1)
return item
shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)
Testing this on a 10,000 row subset of the data frame:
%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop
It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:
dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):
%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop
Calculating the average as in my original question is only slightly different. The original method was:
But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:
The result is the same.
