Time Series from different variables - python

I am trying to create a variable that display how many days a bulb were functional, from different variables (Score_day_0).
The dataset I am using is like this one bellow, where score at different days are: 1--> Working very well and 10-->stop working.
What I want is to understand / know how to create the variable Days, where it will display the number of days the bulbs were working, ie. for sample 2, the score at day 10 is 8 and day_20 is 10 (stop working) and therefore the number of days that the bulb was working is 20.
Any suggestion?
Thank you so much for your help, hope you have a terrific day!!
sample
Score_Day_0
Score_Day_10
Score_Day_20
Score_Day_30
Score_Day_40
Days
sample 1
1
3
5
8
10
40
sample 2
3
8
10
10
10
20
I've tried to solve by myself generating a conditional loop, but the number of observations in Days are much higher than the number of observation from the original df.
Here is the code I used:
cols = df[['Score_Day_0', 'Score_Day_10....,'Score_Day_40']]
Days = []
for j in cols['Score_Day_0']:
if j = 10:
Days.append(0)
for k in cols['Score_Day_10']:
if k = 10:
Days.append('10')
for l in cols['Score_Day_20']:
if l = 10:
Days.append('20')
for n in cols['Score_Day_30']:
if n = 105:
Days.append('30')
for n in cols['Score_Day_40']:
if m = 10:
Days.append('40')

Your looking for the first column label (left to right) at which the value is maximal in each row.
You can apply a given function on each row using pandas.DataFrame.apply with axis=1:
df.apply(function, axis=1)
The passed function will get the row as Series object. To find the first occurrence of a value in a series we use a simple locator with our condition and retrieve the first value of the index containing - what we were looking for - the label of the column where the row first reaches its maximal values.
lambda x: x[x == x.max()].index[0]
Example:
df = pd.DataFrame(dict(d0=[1,1,1],d10=[1,5,10],d20=[5,10,10],d30=[8,10,10]))
# d0 d10 d20 d30
# 0 1 1 5 8
# 1 1 5 10 10
# 2 1 10 10 10
df['days'] = df.apply(lambda x: x[x == x.max()].index[0], axis=1)
df
# d0 d10 d20 d30 days
# 0 1 1 5 8 d30
# 1 1 5 10 10 d20
# 2 1 10 10 10 d10

Related

Conditional Iteration over a Pandas Dataframe

I am trying to loop through over a pandas data frame to meet specific conditions in an optimization task.
Let me provide some backgrounds and what I have done so far.
So the table below is my sample of the top 10 rows of my input data (named df_long) after loading and melting using pandas. I have 150 rows in my actual dataset.
Hour TypeofTask TaskFrequency TotalTaskatSpecific Hour
0 08 A 5 50
1 09 D 8 30
2 08 D 7 50
3 10 C 4 20
4 09 B 6 30
5 08 B 9 50
6 10 A 2 20
7 09 D 1 30
8 08 C 3 50
9 08 E 2 50
10 09 A 7 30
I have also created decision variables i.e. x0, x1, x2,..... xn for each row of the above input data set as above using loop statements as below;
decision_variables = []
for rownum, row in df_long.iterrows():
variable = str('x' + str(rownum))
variable = pulp.LpVariable(str(variable), lowBound = 0, cat= 'Integer')
decision_variables.append(variable)
My actual question..
I want to be able to loop through the pandas dataframe to find all the TaskFrequency that happened at a specific hour and then multiply each TaskFrequency by the respective decision variable for each row - the whole expression should be less than or equal to the TotalTaskatSpecificHour for a specific hour
e.g. an expression like this for Hour 10 would be:
4*x3 + 2*x6 <= 20
So far I have been able to do this:
to = ""
for rownum, row in df_long.iterrows():
for i, wo in enumerate(decision_variables):
if rownum == i:
formula = row['TaskFrequency']*wo
to += formula
prob += to
this gave me:
5*x0 + 8*x1 + 7*x2 + 4*x3 + 6*x4 + 9*x5 + 2*x6 + 1*x7 +3*x8 + 2*x9 + 7*x10
I also tried this:
for rownum, row in df_long.iterrows():
for i, wo in enumerate(decision_variables):
for x,y,z in zip(df_long['Hour'],df_long['TypeofTask'],df_long['TaskFrequency']):
if rownum == i:
formula1 = row['TaskFrequency']*wo
I just get 7*x10
what I wish to get is the same expression but for a specific Hour instead of the whole thing combined e.g.
for Hour 10 it should be,
4*x3 + 2*x6 <= 20
for Hour 9 it should be,
8*x1 + 6*x4 + 1*x7 + 7*x10 <= 30
I look forward to your suggestions and help.
Regards
Diva
you would want a return column * (no of hours), in essence you dont need to apply function row by row, but condense the df by groupby like above answer, or slicing:
I think groupby is a standard way to do it but lambda is a no brainer.
def fun1(df, Hours, prod):
return sum(df[df['Hour']==Hours].apply(lambda row:int(row.name)*row['TaskFrequency'],axis=1)) <= prod

Discard points with X,Y coordinate close to eachother in Dataframe

I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5

Python Data Frame: cumulative sum of column until condition is reached and return the index

I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.
Say I have a simple data frame with two columns:
index Num_Albums Num_authors
0 10 4
1 1 5
2 4 4
3 7 1000
4 1 44
5 3 8
Num_Abums_tot = sum(Num_Albums) = 30
I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.
Example:
cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):
10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.
Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4
I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....
[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].
Opt - 1:
You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.
Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.
Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.
val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val] # for faster access, use .iat
4
When performing np.isclose on the series later converted to an array:
np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False, True, False, False, False], dtype=bool)
Opt - 2:
Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.
val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4
Opt - 3:
Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:
df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4
I think you can directly add a column with the cumulative sum as:
In [3]: df
Out[3]:
index Num_Albums Num_authors
0 0 10 4
1 1 1 5
2 2 4 4
3 3 7 1000
4 4 1 44
5 5 3 8
In [4]: df['cumsum'] = df['Num_Albums'].cumsum()
In [5]: df
Out[5]:
index Num_Albums Num_authors cumsum
0 0 10 4 10
1 1 1 5 11
2 2 4 4 15
3 3 7 1000 22
4 4 1 44 23
5 5 3 8 26
And then apply the condition you want on the cumsum column. For instance you can use where to get the full row according to the filter. Setting the tolerance tol:
In [18]: tol = 2
In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()
In [20]: cond
Out[20]:
index Num_Albums Num_authors cumsum
2 2.0 4.0 4.0 15.0
This could even be done as following code:
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return (index,df.loc([df.Num_Albums==i,'Num_authors']))
This would actually return a tuple of your index and the corresponding value of Num_authors as soon as the "your condition" is reached.
or could even be returned as an array by
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return df.loc([df.Num_Albums==i,'Num_authors']).index.values
I am not able to figure out the condition you mentioned of the cumulative sum as when to stop summing so I mentioned it as " your_condition " in the code!!
I am also new so hope it helps !!

Pandas check for overlapping dates in multiple rows

I need to run a function on a large groupby query that checks whether two subGroups have any overlapping dates. Below is an example of a single group tmp:
ID num start stop subGroup
0 21 10 2006-10-10 2008-10-03 1
1 21 46 2006-10-10 2100-01-01 2
2 21 5 1997-11-25 1998-09-29 1
3 21 42 1998-09-29 2100-01-01 2
4 21 3 1997-01-07 1997-11-25 1
5 21 6 2006-10-10 2008-10-03 1
6 21 47 1998-09-29 2006-10-10 2
7 21 4 1997-01-07 1998-09-29 1
The function I wrote to do this looks like this:
def hasOverlap(tmp):
d2_starts = tmp[tmp['subGroup']==2]['start']
d2_stops = tmp[tmp['subGroup']==2]['stop']
return tmp[tmp['subGroup']==1].apply(lambda row_d1:
(
#Check for part nested D2 in D1
((d2_starts >= row_d1['start']) &
(d2_starts < row_d1['stop']) ) |
((d2_stops >= row_d1['start']) &
(d2_stops < row_d1['stop']) ) |
#Check for fully nested D1 in D2
((d2_stops >= row_d1['stop']) &
(d2_starts <= row_d1['start']) )
).any()
,axis = 1
).any()
The problem is that this code has many redundancies and when I run the query:
groups.agg(hasOverlap)
It takes an unreasonably long time to terminate.
Are there any performance fixes (such as using built-in functions or set_index) that I could do to speed this up?
Are you just looking to return "True" or "False" based on the presence of an overlap? If so, I'd just get a list of the dates for each subgroup, and then uses pandas isin method to check if they overlap.
You could try something like this:
#split subgroups into separate DF's
group1 = groups[groups.subgroup==1]
group2 = groups[groups.subgroup==2]
#check if any of the start dates from group 2 are in group 1
if len(group1[group1.start.isin(list(group2.start))]) >0:
print "Group1 overlaps group2"
#check if any of the start dates from group 1 are in group 2
if len(group2[group2.start.isin(list(group1.start))]) >0:
print "Group2 overlaps group1"

Timestamp String to Seconds Conversion in Python

I'm a beginner in Python Data Science. I'm working on clickstream data and want to find out the duration of a session. For that I find the start time and end time of the session. However on subtraction, I'm getting wrong answer for the same.
Here is the data
Sid Tstamp Itemid Category
0 1 2014-04-07T10:51:09.277Z 214536502 0
1 1 2014-04-07T10:54:09.868Z 214536500 0
2 1 2014-04-07T10:54:46.998Z 214536506 0
3 1 2014-04-07T10:57:00.306Z 214577561 0
4 2 2014-04-07T13:56:37.614Z 214662742 0
5 2 2014-04-07T13:57:19.373Z 214662742 0
6 2 2014-04-07T13:58:37.446Z 214825110 0
7 2 2014-04-07T13:59:50.710Z 214757390 0
8 2 2014-04-07T14:00:38.247Z 214757407 0
9 2 2014-04-07T14:02:36.889Z 214551617 0
10 3 2014-04-02T13:17:46.940Z 214716935 0
11 3 2014-04-02T13:26:02.515Z 214774687 0
12 3 2014-04-02T13:30:12.318Z 214832672 0
I referred this question for the code- Timestamp Conversion
Here is my code-
k.columns=['Sid','Tstamp','Itemid','Category']
k=k.loc[:,('Sid','Tstamp')]
#Find max timestamp
idx=k.groupby(['Sid'])['Tstamp'].transform(max) == k['Tstamp']
ah=k[idx].reset_index()
#Find min timestamp
idy=k.groupby(['Sid'])['Tstamp'].transform(min) == k['Tstamp']
ai=k[idy].reset_index()
#grouping by Sid and applying count to retain the distinct Sid values
kgrp=k.groupby('Sid').count()
i=0
for temp1,temp2 in zip(ah['Tstamp'],ai['Tstamp']):
sv1= datetime.datetime.strptime(temp1, "%Y-%m-%dT%H:%M:%S.%fZ")
sv2= datetime.datetime.strptime(temp2, "%Y-%m-%dT%H:%M:%S.%fZ")
d1=time.mktime(sv1.timetuple()) + (sv1.microsecond / 1000000.0)
d2=time.mktime(sv2.timetuple()) + (sv2.microsecond / 1000000.0)
kgrp.loc[i,'duration']= d1-d2
i=i+1
Here is the output.
kgrp
Out[5]:
Tstamp duration
Sid
1 4 359.275
2 6 745.378
3 3 1034.468
For session id 2, the duration should be close to 6 minutes however I'm getting almost 12 minutes. I reckon I'm making some silly mistake here.
Also, I'm grouping by Sid and applying count on it so as to get the Sid column and store each duration as a separate column. Is there any easier method through which I can store only the Sid (not the 'Tstamp' Count Column) and its duration values?
You are assigning the duration value to the wrong label.
In your test data sid starts from 1 but i starts from 0:
# for sid 1, i == 0
kgrp.loc[i,'duration']= d1-d2
i=i+1
Update
A more pythonic way to handle this :)
def calculate_duration(dt1, dt2):
# do the calculation here, return the duration in seconds
k = k.loc[:, ('Sid', 'Tstamp')]
result = k.groupby(['Sid'])['Tstamp'].agg({
'Duration': lambda x: calculate_duration(x.max(), x.min()),
'Count': lambda x: x.count()
})

Categories