Selecting, slicing, and aggregating temporal data with Pandas - python

I'm trying to handle temporal data with pandas and I'm having a hard time...
Here is a sample of the DataFrame :
index ip app dev os channel click_time
0 29540 3 1 42 489 2017-11-08 03:57:46
1 26777 11 1 25 319 2017-11-09 11:02:14
2 140926 12 1 13 140 2017-11-07 04:36:14
3 69375 2 1 19 377 2017-11-09 13:17:20
4 119166 9 2 15 445 2017-11-07 12:11:37
This is a click prediction problem, so I want to create a time window aggregating the past behaviour of a specific ip ( for a given ip, how many clicks in the last 4 hours, 8 hours ? ).
I tried creating one new column which was simply :
df['minus_8']=df['click_time']-timedelta(hours=8)
I wanted to use this so that for each row I have a specific 8 hours window on which to aggregate my data.
I have also tried resampling with little success, my understanding of the function isn't optimal let's say.
Can anyone help ?

If you just need to select a particular 8 hours, you can do as follows:
start_time = datetime.datetime(2017, 11, 9,11, 2, 14)
df[(df['click_time' >= start_time)
& (df['click_time'] <= start_time+datetime.timedelta(0, 60*60*8))]
Otherwise I really think you need to look more at resample. Mind you, if you want resample to have your data divided into 8 hour chunks that are always consistent (e.g. from 00:00-08:00, 08:00-16:00, 16:00-00:00), then you will probably want to crop your data to a certain start time.

Using parts of the solution given by Martin, I was able to create this function that outputs what I wanted :
def window_filter_clicks(df, h):
df['nb_clicks_{}h'.format(h)]=0
ip_array = df.ip.unique()
for ip in ip_array:
df_ip=df[df['ip']==ip]
for row, i in zip(df_ip['click_time'],df_ip['click_time'].index):
df_window = df_ip[(df_ip['click_time']>= row-timedelta(hours=h)) & (df_ip['click_time']<= row) ]
nb_clicks_4h = len(df_window)
df['nb_clicks_{}h'.format(h)].iloc[i]= nb_clicks_4h
return df
h allows me to select the size of the window on which to iterate.
Now this works fine, but it is very slow and I am working with a lot of rows.
Does anyone know how to improve the speed of such a function ? ( Or if there is anything similar built-in ? )

Related

Need help optimizing this code for faster results

To give an overview of the data, there are multiple rows of data which have the same id, and furthermore, have multiple columns with the same values. Now there are some functions which will output the same result for rows with the same id. Therefore, I group by this id, perform the functions I need to perform on them, and then I begin looping through each row within each group, to perform the functions which will yield different results for each row, even with the same id.
Here is some sample data:
id map_sw_lon map_sw_lat map_ne_lon map_ne_lat exact_lon exact_lat
1 10 15 11 16 20 30
1 10 15 11 16 34 50
2 20 16 21 17 44 33
2 20 16 21 17 50 60
Here is my code:
for id, group in df.groupby("id", sort=False):
viewport = box(group["map_sw_lon"].iloc[0],
group["map_sw_lat"].iloc[0], group["map_ne_lon"].iloc[0],
group["map_ne_lat"].iloc[0])
center_of_viewport = viewport.centroid
center_hex = h3.geo_to_h3(center_of_viewport.y, center_of_viewport.x, 8)
# everything above here can be done only once per group.
# everything below needs to be done per row per group.
for index, row in group.iterrows():
current_hex = h3.geo_to_h3(row["exact_lat"], row["exact_lon"], 8)
df.at[index,'hex_id'] = current_hex
df.at[index, 'hit_count'] = 1
df.at[index, 'center_hex'] = center_hex
distance_to_center = h3.h3_distance(current_hex, center_hex)
df.at[index,'hex_dist_to_center'] = distance_to_center
This code works in around 5 mins for 1 million rows of data. The problem is I’m dealing with data much larger than that, and need something that works faster. I know it isn’t recommended to use for loops in Pandas, but I’m not sure how to solve this problem without using them. Any help would be appreciated.
Edit: Still struggling with this..any help would be appreciated!
You need to do some profiling to see how much time each part of the code takes to run. I conjecture the most time consuming parts are the geo_to_h3 and h3_distance calls. If so, other possible improvements on data frame operations (e.g., using DataFrame.apply and GroupBy.transform) would not help a lot.

Date manipulation periods

I have this problem for work. So I have this dataset as follows:
Client Date Transaction Num
A 7/20/2017 1
A 7/26/2017 1
A 7/31/2017 1
A 8/23/2017 2
A 8/31/2017 2
A 9/11/2017 2
A 9/19/2017 3
A 9/27/2017 3
A 10/4/2017 3
B 6/1/2017 1
B 6/29/2017 1
B 7/6/2017 2
B 8/27/2017 3
B 9/28/2017 4
B 10/16/2017 4
B 11/30/2017 5
What I need to do is generate the transaction num based on the date for each client as follows:
For the starting date (for client A, it is 7/20/17), I need to assign a starting transaction Number = 1. Then for every 30 days from this starting date, I need to increment the transaction number by one. So 30 days from 7/20/17 is 8/19/17, so all dates falling within this range get transaction num =1, then if they exceed, the transaction number increments by one for every 30 days from starting date. This pattern goes on, so 30 days from 8/19/17 is 9/18/17, so dates within this range gets transaction num =2, and after 9/18/17, gets transaction num = 3 and so on.
I need to do this for a large excel. Any help would be appreciated. If it easier in python, please let me know as well.
Thanks,
Sammy
Interesting question, possibly multiple sollutions but I came up with the one below:
So in C1 enter this formula:
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)+1
Confirm with CTRL+SHIFT+ENTER, and drag your formula down.
Note: Sorry for the difference in layout of dates, I have to deal with Dutch version of Excel :)
EDIT: Explaination
Step 1 - Get minimum date corresponding to Cell A1:
=MIN(IF($A$1:$A$17=A1,$B$1:$B$17))
Step 2 - Get difference of cell B1 and minimmum and round it of. Doesn't matter if its one or 0 decimals:
=ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)
Step 3 - Devide difference through 30 days:
=ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30
Step 4 - Make sure you round of this outcome to below (probably bad english) with floor function to its closest multiple you want to round to. In this case it will be 1.
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)
Step 5 - Now we just need to add 1 to this outcome to prevent starting at 0
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)+1
Confirm all through CTRL+SHIFT+ENTER
If the dates are in order, you could just do a VLOOKUP to get the first one and subtract, but #JvdV's answer is more general
=INT((B2-VLOOKUP(A2,A:B,2,FALSE))/30)+1

Create pandas timeseries from list of dictionaries with many keys

I have multiple timeseries that are outputs of various algorithms. These algorithms can have various parameters and they produce timeseries as a result:
timestamp1=1;
value1=5;
timestamp2=2;
value2=8;
timestamp3=3;
value3=4;
timestamp4=4;
value4=12;
resultsOfAlgorithms=[
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'200',
'result-of-algorithm':[[timestamp1,value1],[timestamp2,value2]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp1,value1],[timestamp3,value3]]
},
{
'algorithm':'minmax',
'param-a':'12',
'param-b':'30',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
},
{
'algorithm':'delta',
'param-a':'12',
'param-b':'50',
'result-of-algorithm':[[timestamp2,value2],[timestamp4,value4]]
}
]
I would like to be able to filter the timeseries by algorithm and parameters and plot filtered timeseries to see how given parameters affect the output. To do that I need to know all the occurring values for given parameter and then to be able to select timeseries with desired parameters. E.g. I would like to plot all results of minmax algorithm with param-b==30. There are 2 results that were produced with minmax algorithm and param-b==30. Thus I would like to have a plot with 2 timeseries in it.
Is this possible with pandas or is this out of pandas functionality? How could this be implemented?
Edit:
Searching more the internet I think I am looking for a way to use hierarchical indexing. Also the timeseries should stay separated. Each result is a an individual time-series. It should not be merged together with other result. I need to filter the results of algorithms by parameters used. The result of filter should be still a list of timeseries.
Edit 2:
There are multiple sub-problems:
Find all existing values for each parameter (user does not know all the values since parameters can be auto-generated by system)
user selects some of values for filtering
One way this could be provided by user is a dictionary (but more-user friendly ideas are welcome):
filter={
'param-b':[30,50],
'algorithm':'minmax'
}
Timeseries from resultsOfAlgorithms[1:2] (2nd and 3rd result) are given as a result of filtering, since these results were produced by minmax algorithm and param-b was 30. Thus in this case
[
[[timestamp1,value1],[timestamp3,value3]],
[[timestamp1,value1],[timestamp3,value3]]
]
The result of filtering will return multiple time series, which I want to plot and compare.
user wants to try various filters to see how they affect results
I am doing all this in Jupyter notebook. And I would like to allow user to try various filters with the least hassle possible.
Timestamps in results are not shared. Timestamps between results are not necessarily shared. E.g. all timeseries might occur between 1pm and 3 pm and have roundly same amount of values but the timestamps nor the amount of values are not identical.
So there are two options here, one is to clean up the dict first, then convert it easily to a dataframe, the second is to convert it to a dataframe, then clean up the column that will have nested lists in it. For the first solution, you can just restructure the dict like this:
import pandas as pd
from collections import defaultdict
data = defaultdict(list)
for roa in resultsOfAlgorithms:
for i in range(len(roa['result-of-algorithm'])):
data['algorithm'].append(roa['algorithm'])
data['param-a'].append(roa['param-a'])
data['param-b'].append(roa['param-b'])
data['time'].append(roa['result-of-algorithm'][i][0])
data['value'].append(roa['result-of-algorithm'][i][1])
df = pd.DataFrame(data)
In [31]: df
Out[31]:
algorithm param-a param-b time value
0 minmax 12 200 1 5
1 minmax 12 200 2 8
2 minmax 12 30 1 5
3 minmax 12 30 3 4
4 minmax 12 30 2 8
5 minmax 12 30 4 12
6 delta 12 50 2 8
7 delta 12 50 4 12
And from here you can do whatever analysis you need with it, whether it's plotting or making the time column the index or grouping and aggregating, and so on. You can compare this to making a dataframe first in this link:
Splitting a List inside a Pandas DataFrame
Where they basically did the same thing, with splitting a column of lists into multiple rows. I think fixing the dictionary will be easier though, depending on how representative your fairly simple example is of the real data.
Edit: If you wanted to turn this into a multi-index, you can add one more line:
df_mi = df.set_index(['algorithm', 'param-a', 'param-b'])
In [25]: df_mi
Out[25]:
time value
algorithm param-a param-b
minmax 12 200 1 5
200 2 8
30 1 5
30 3 4
30 2 8
30 4 12
delta 12 50 2 8
50 4 12

Comparison between one element and all the others of a DataFrame column

I have a list of tuples which I turned into a DataFrame with thousands of rows, like this:
frag mass prot_position
0 TFDEHNAPNSNSNK 1573.675712 2
1 EPGANAIGMVAFK 1303.659458 29
2 GTIK 417.258734 2
3 SPWPSMAR 930.438172 44
4 LPAK 427.279469 29
5 NEDSFVVWEQIINSLSALK 2191.116099 17
...
and I have the follow rule:
def are_dif(m1, m2, ppm=10):
if abs((m1 - m2) / m1) < ppm * 0.000001:
v = False
else:
v = True
return v
So, I only want the "frag"s that have a mass that difers from all the other fragments mass. How can I achieve that "selection"?
Then, I have a list named "pinfo" that contains:
d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.
So, I want to sum 1 to the "hits_fit" value, on the dictionary respective to the protein.
If I'm understanding correctly (not sure if I am), you can accomplish quite a bit by just sorting. First though, let me adjust the data to have a mix of close and far values for mass:
Unnamed: 0 frag mass prot_position
0 0 TFDEHNAPNSNSNK 1573.675712 2
1 1 EPGANAIGMVAFK 1573.675700 29
2 2 GTIK 417.258734 2
3 3 SPWPSMAR 417.258700 44
4 4 LPAK 427.279469 29
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17
Then I think you can do something like the following to select the "good" ones. First, create 'pdiff' (percent difference) to see how close mass is to the nearest neighbors:
ppm = .00001
df = df.sort('mass')
df['pdiff'] = (df.mass-df.mass.shift()) / df.mass
Unnamed: 0 frag mass prot_position pdiff
3 3 SPWPSMAR 417.258700 44 NaN
2 2 GTIK 417.258734 2 8.148421e-08
4 4 LPAK 427.279469 29 2.345241e-02
1 1 EPGANAIGMVAFK 1573.675700 29 7.284831e-01
0 0 TFDEHNAPNSNSNK 1573.675712 2 7.625459e-09
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 2.817926e-01
The first and last data lines make this a little tricky so this next line backfills the first line and repeats the last line so that the following mask works correctly. This works for the example here, but might need to be tweaked for other cases (but only as far as the first and last lines of data are concerned).
df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
Results:
Unnamed: 0 frag mass prot_position pdiff
4 4 LPAK 427.279469 29 0.023452
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 0.281793
Sorry, I don't understand the second part of the question at all.
Edit to add: As mentioned in a comment to #AmiTavory's answer, I think possibly the sorting approach and groupby approach could be combined for a simpler answer than this. I might try at a later time, but everyone should feel free to give this a shot themselves if interested.
Here's something that's slightly different from what you asked, but it is very simple, and I think gives a similar effect.
Using numpy.round, you can create a new column
import numpy as np
df['roundedMass'] = np.round(df.mass, 6)
Following that, you can do a groupby of the frags on the rounded mass, and use nunique to count the numbers in the group. Filter for the groups of size 1.
So, the number of frags per bin is:
df.frag.groupby(np.round(df.mass, 6)).nunique()
Another solution can be create a dup of your list (if you need to preserve it for further processing later), iterate over it and remove all element that are not corresponding with your rule (m1 & m2).
You will get a new list with all unique masses.
Just don't forget that if you do need to use the original list later you will need to use deepcopy.

pandas - how to combine selected rows in a DataFrame

I've been reading a huge (5 GB) gzip file in the form:
User1 User2 W
0 11 12 1
1 12 11 2
2 13 14 1
3 14 13 2
which is basically a directed graph representation of connections among users with a certain weight W. Since the file is so big, I tried to read it through networkx, building a Directed Graph and then set it to Undirected. But it took too much time. So I was thinking in doing the same thing analysing a pandas dataframe. I would like to return the previous dataframe in the form:
User1 User2 W
0 11 12 3
1 13 14 3
where the common links in the two directions have been merged into one having as W the sum of the single weights. Any help would be appreciated.
There is probably a more concise way, but this works. The main trick is just to normalize the data such that User1 is always the lower number ID. Then you can use groupby since 11,12 and 12,11 are now recognized as representing the same thing.
In [330]: df = pd.DataFrame({"User1":[11,12,13,14],"User2":[12,11,14,13],"W":[1,2,1,2]})
In [331]: df['U1'] = df[['User1','User2']].min(axis=1)
In [332]: df['U2'] = df[['User1','User2']].max(axis=1)
In [333]: df = df.drop(['User1','User2'],axis=1)
In [334]: df.groupby(['U1','U2'])['W'].sum()
Out[334]:
U1 U2
11 12 3
13 14 3
Name: W, dtype: int64
For more concise code that avoids creating new variables, you could replace the middle 3 steps with:
In [400]: df.ix[df.User1>df.User2,['User1','User2']] = df.ix[df.User1>df.User2,['User2','User1']].values
Note that column switching can be trickier than you'd think, see here: What is correct syntax to swap column values for selected rows in a pandas data frame using just one line?
As far as making this code fast in general, it will depend on your data. I don't think the above code will be as important as other things you might do. For example, your problem should be amenable to a chunking approach where you iterate over sections of the code, gradually shrinking it on each pass. In that case, the main thing you need to think about is sorting the data before chunking, so as to minimize how many passes you need to make. But doing it that way should allow you to do all the work in memory.

Categories