I need to run a function on a large groupby query that checks whether two subGroups have any overlapping dates. Below is an example of a single group tmp:
ID num start stop subGroup
0 21 10 2006-10-10 2008-10-03 1
1 21 46 2006-10-10 2100-01-01 2
2 21 5 1997-11-25 1998-09-29 1
3 21 42 1998-09-29 2100-01-01 2
4 21 3 1997-01-07 1997-11-25 1
5 21 6 2006-10-10 2008-10-03 1
6 21 47 1998-09-29 2006-10-10 2
7 21 4 1997-01-07 1998-09-29 1
The function I wrote to do this looks like this:
def hasOverlap(tmp):
d2_starts = tmp[tmp['subGroup']==2]['start']
d2_stops = tmp[tmp['subGroup']==2]['stop']
return tmp[tmp['subGroup']==1].apply(lambda row_d1:
(
#Check for part nested D2 in D1
((d2_starts >= row_d1['start']) &
(d2_starts < row_d1['stop']) ) |
((d2_stops >= row_d1['start']) &
(d2_stops < row_d1['stop']) ) |
#Check for fully nested D1 in D2
((d2_stops >= row_d1['stop']) &
(d2_starts <= row_d1['start']) )
).any()
,axis = 1
).any()
The problem is that this code has many redundancies and when I run the query:
groups.agg(hasOverlap)
It takes an unreasonably long time to terminate.
Are there any performance fixes (such as using built-in functions or set_index) that I could do to speed this up?
Are you just looking to return "True" or "False" based on the presence of an overlap? If so, I'd just get a list of the dates for each subgroup, and then uses pandas isin method to check if they overlap.
You could try something like this:
#split subgroups into separate DF's
group1 = groups[groups.subgroup==1]
group2 = groups[groups.subgroup==2]
#check if any of the start dates from group 2 are in group 1
if len(group1[group1.start.isin(list(group2.start))]) >0:
print "Group1 overlaps group2"
#check if any of the start dates from group 1 are in group 2
if len(group2[group2.start.isin(list(group1.start))]) >0:
print "Group2 overlaps group1"
Related
I am trying to create a variable that display how many days a bulb were functional, from different variables (Score_day_0).
The dataset I am using is like this one bellow, where score at different days are: 1--> Working very well and 10-->stop working.
What I want is to understand / know how to create the variable Days, where it will display the number of days the bulbs were working, ie. for sample 2, the score at day 10 is 8 and day_20 is 10 (stop working) and therefore the number of days that the bulb was working is 20.
Any suggestion?
Thank you so much for your help, hope you have a terrific day!!
sample
Score_Day_0
Score_Day_10
Score_Day_20
Score_Day_30
Score_Day_40
Days
sample 1
1
3
5
8
10
40
sample 2
3
8
10
10
10
20
I've tried to solve by myself generating a conditional loop, but the number of observations in Days are much higher than the number of observation from the original df.
Here is the code I used:
cols = df[['Score_Day_0', 'Score_Day_10....,'Score_Day_40']]
Days = []
for j in cols['Score_Day_0']:
if j = 10:
Days.append(0)
for k in cols['Score_Day_10']:
if k = 10:
Days.append('10')
for l in cols['Score_Day_20']:
if l = 10:
Days.append('20')
for n in cols['Score_Day_30']:
if n = 105:
Days.append('30')
for n in cols['Score_Day_40']:
if m = 10:
Days.append('40')
Your looking for the first column label (left to right) at which the value is maximal in each row.
You can apply a given function on each row using pandas.DataFrame.apply with axis=1:
df.apply(function, axis=1)
The passed function will get the row as Series object. To find the first occurrence of a value in a series we use a simple locator with our condition and retrieve the first value of the index containing - what we were looking for - the label of the column where the row first reaches its maximal values.
lambda x: x[x == x.max()].index[0]
Example:
df = pd.DataFrame(dict(d0=[1,1,1],d10=[1,5,10],d20=[5,10,10],d30=[8,10,10]))
# d0 d10 d20 d30
# 0 1 1 5 8
# 1 1 5 10 10
# 2 1 10 10 10
df['days'] = df.apply(lambda x: x[x == x.max()].index[0], axis=1)
df
# d0 d10 d20 d30 days
# 0 1 1 5 8 d30
# 1 1 5 10 10 d20
# 2 1 10 10 10 d10
I have a parking lot with cars of different models (nr) and the cars are so closely packed that in order for one to get out one might need to move some others. A little like a 15Puzzle, only I can take one or more cars out of the parking lot. Ordered_car_List includes the cars that will be picked up today, and they need to be taken out of the parking lot with as few non-ordered cars as possible moved. There are more columns to this panda, but this is what I can't figure out.
I have a Program that works good for small sets of data, but it seems that this is not the way of the PANDAS :-)
I have this:
cars = pd.DataFrame({'x': [1,1,1,1,1,2,2,2,2],
'y': [1,2,3,4,5,1,2,3,4],
'order_number':[6,6,7,6,7,9,9,10,12]})
cars['order_number_no_dublicates_down'] = None
Ordered_car_List = [6,9,9,10,28]
i=0
while i < len(cars):
temp_val = cars.at[i, 'order_number']
if temp_val in Ordered_car_List:
cars.at[i, 'order_number_no_dublicates_down'] = temp_val
Ordered_car_List.remove(temp_val)
i+=1
If I use cars.apply(lambda..., how can I change the Ordered_car_List in each iteration?
Is there another approach that I can take?
I found this page, and it made me want to be faster. The Lambda approach is in the middle when it comes to speed, but it still is so much faster than what I am doing now.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
Updating cars
We can vectorize this based on two counters:
cumcount() to cumulatively count each unique value in cars['order_number']
collections.Counter() to count each unique value in Ordered_car_List
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
# order_number cumcount maxcount
# 0 6 1 1
# 1 6 2 1
# 2 7 1 0
# 3 6 3 1
# 4 7 2 0
# 5 9 1 2
# 6 9 2 2
# 7 10 1 1
# 8 12 1 0
So then we only want to keep cars['order_number'] where cumcount <= maxcount:
either use DataFrame.loc[]
cars.loc[cumcount <= maxcount, 'nodup'] = cars['order_number']
or Series.where()
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
or Series.mask() with the condition inverted
cars['nodup'] = cars['order_number'].mask(cumcount > maxcount)
Updating Ordered_car_List
The final Ordered_car_List is a Counter() difference:
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
# [6, 9, 9, 10]
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Final output
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
# x y order_number nodup
# 0 1 1 6 6.0
# 1 1 2 6 NaN
# 2 1 3 7 NaN
# 3 1 4 6 NaN
# 4 1 5 7 NaN
# 5 2 1 9 9.0
# 6 2 2 9 9.0
# 7 2 3 10 10.0
# 8 2 4 12 NaN
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Timings
Note that your loop is still very fast with small data, but the vectorized counter approach just scales much better:
I have a very big dataframe (20.000.000+ rows) that containes a column called 'sequence' amongst others.
The 'sequence' column is calculated from a time series applying a few conditional statements. The value "2" flags the start of a sequence, the value "3" flags the end of a sequence, the value "1" flags a datapoint within the sequence and the value "4" flags datapoints that need to be ignored. (Note: the flag values doesn't necessarily have to be 1,2,3,4)
What I want to achieve is a continous ID value (written in a seperate column - see 'desired_Id_Output' in the example below) that labels the slices of sequences from 2 - 3 in a unique fashion (the length of a sequence is variable ranging from 2 [Start+End only] to 5000+ datapoints) to be able to do further groupby-calculations on the individual sequences.
index sequence desired_Id_Output
0 2 1
1 1 1
2 1 1
3 1 1
4 1 1
5 3 1
6 2 2
7 1 2
8 1 2
9 3 2
10 4 NaN
11 4 NaN
12 2 3
13 3 3
Thanks in advance and BR!
I can't think of anything better than the "dumb" solution of looping through the entire thing, something like this:
import numpy as np
counter = 0
tmp = np.empty_like(df['sequence'].values, dtype=np.float)
for i in range(len(tmp)):
if df['sequence'][i] == 4:
tmp[i] = np.nan
else:
if df['sequence'][i] == 2:
counter += 1
tmp[i] = counter
df['desired_Id_output'] = tmp
Of course this is going to be pretty slow with a 20M-sized DataFrame. One way to improve this is via just-in-time compilation with numba:
import numba
#numba.njit
def foo(sequence):
# put in appropriate modification of the above code block
return tmp
and call this with argument df['sequence'].values.
Does it work to count the sequence starts? And then just set the ignore values (flag 4) afterwards. Like this:
sequence_starts = df.sequence == 2
sequence_ignore = df.sequence == 4
sequence_id = sequence_starts.cumsum()
sequence_id[sequence_ignore] = numpy.nan
I am trying to extend my current pattern to accommodate extra conditions of +- a percentage of the last value rather than strict does it match previous value.
data = np.array([[2,30],[2,900],[2,30],[2,30],[2,30],[2,1560],[2,30],
[2,300],[2,30],[2,450]])
df = pd.DataFrame(data)
df.columns = ['id','interval']
UPDATE 2 (id fix): Updated Data 2 with more data:
data2 = np.array([[2,30],[2,900],[2,30],[2,29],[2,31],[2,30],[2,29],[2,31],[2,1560],[2,30],[2,300],[2,30],[2,450], [3,40],[3,900],[3,40],[3,39],[3,41], [3,40],[3,39],[3,41] ,[3,1560],[3,40],[3,300],[3,40],[3,450]])
df2 = pd.DataFrame(data2)
df2.columns = ['id','interval']
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
results in [30,30,30]
however I really want to catch near number conditions say when a number is +-10% of the previous number.
so looking at df2 I would like to pickup the series [30,29,31]
for i, g in df2.groupby([(df2.interval != <???+- 10% magic ???>).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
UPDATE: Here is the end of line processing code where I store the gathered lists into a dictionary with the ID as the key
leak_intervals = {}
final_leak_intervals = {}
serials = []
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist()) >= 3:
print(g.interval.tolist())
serial = g.id.values[0]
if serial not in serials:
serials.append(serial)
if serial not in leak_intervals:
leak_intervals[serial] = g.interval.tolist()
else:
leak_intervals[serial] = leak_intervals[serial] + (g.interval.tolist())
UPDATE:
In [116]: df2.groupby(df2.interval.pct_change().abs().gt(0.1).cumsum()) \
.filter(lambda x: len(x) >= 3)
Out[116]:
id interval
2 2 30
3 2 29
4 2 31
5 2 30
6 2 29
7 2 31
15 3 40
16 3 39
17 2 41
18 2 40
19 2 39
20 2 41
I'm a beginner in Python Data Science. I'm working on clickstream data and want to find out the duration of a session. For that I find the start time and end time of the session. However on subtraction, I'm getting wrong answer for the same.
Here is the data
Sid Tstamp Itemid Category
0 1 2014-04-07T10:51:09.277Z 214536502 0
1 1 2014-04-07T10:54:09.868Z 214536500 0
2 1 2014-04-07T10:54:46.998Z 214536506 0
3 1 2014-04-07T10:57:00.306Z 214577561 0
4 2 2014-04-07T13:56:37.614Z 214662742 0
5 2 2014-04-07T13:57:19.373Z 214662742 0
6 2 2014-04-07T13:58:37.446Z 214825110 0
7 2 2014-04-07T13:59:50.710Z 214757390 0
8 2 2014-04-07T14:00:38.247Z 214757407 0
9 2 2014-04-07T14:02:36.889Z 214551617 0
10 3 2014-04-02T13:17:46.940Z 214716935 0
11 3 2014-04-02T13:26:02.515Z 214774687 0
12 3 2014-04-02T13:30:12.318Z 214832672 0
I referred this question for the code- Timestamp Conversion
Here is my code-
k.columns=['Sid','Tstamp','Itemid','Category']
k=k.loc[:,('Sid','Tstamp')]
#Find max timestamp
idx=k.groupby(['Sid'])['Tstamp'].transform(max) == k['Tstamp']
ah=k[idx].reset_index()
#Find min timestamp
idy=k.groupby(['Sid'])['Tstamp'].transform(min) == k['Tstamp']
ai=k[idy].reset_index()
#grouping by Sid and applying count to retain the distinct Sid values
kgrp=k.groupby('Sid').count()
i=0
for temp1,temp2 in zip(ah['Tstamp'],ai['Tstamp']):
sv1= datetime.datetime.strptime(temp1, "%Y-%m-%dT%H:%M:%S.%fZ")
sv2= datetime.datetime.strptime(temp2, "%Y-%m-%dT%H:%M:%S.%fZ")
d1=time.mktime(sv1.timetuple()) + (sv1.microsecond / 1000000.0)
d2=time.mktime(sv2.timetuple()) + (sv2.microsecond / 1000000.0)
kgrp.loc[i,'duration']= d1-d2
i=i+1
Here is the output.
kgrp
Out[5]:
Tstamp duration
Sid
1 4 359.275
2 6 745.378
3 3 1034.468
For session id 2, the duration should be close to 6 minutes however I'm getting almost 12 minutes. I reckon I'm making some silly mistake here.
Also, I'm grouping by Sid and applying count on it so as to get the Sid column and store each duration as a separate column. Is there any easier method through which I can store only the Sid (not the 'Tstamp' Count Column) and its duration values?
You are assigning the duration value to the wrong label.
In your test data sid starts from 1 but i starts from 0:
# for sid 1, i == 0
kgrp.loc[i,'duration']= d1-d2
i=i+1
Update
A more pythonic way to handle this :)
def calculate_duration(dt1, dt2):
# do the calculation here, return the duration in seconds
k = k.loc[:, ('Sid', 'Tstamp')]
result = k.groupby(['Sid'])['Tstamp'].agg({
'Duration': lambda x: calculate_duration(x.max(), x.min()),
'Count': lambda x: x.count()
})