Conditional Iteration over a Pandas Dataframe - python

I am trying to loop through over a pandas data frame to meet specific conditions in an optimization task.
Let me provide some backgrounds and what I have done so far.
So the table below is my sample of the top 10 rows of my input data (named df_long) after loading and melting using pandas. I have 150 rows in my actual dataset.
Hour TypeofTask TaskFrequency TotalTaskatSpecific Hour
0 08 A 5 50
1 09 D 8 30
2 08 D 7 50
3 10 C 4 20
4 09 B 6 30
5 08 B 9 50
6 10 A 2 20
7 09 D 1 30
8 08 C 3 50
9 08 E 2 50
10 09 A 7 30
I have also created decision variables i.e. x0, x1, x2,..... xn for each row of the above input data set as above using loop statements as below;
decision_variables = []
for rownum, row in df_long.iterrows():
variable = str('x' + str(rownum))
variable = pulp.LpVariable(str(variable), lowBound = 0, cat= 'Integer')
decision_variables.append(variable)
My actual question..
I want to be able to loop through the pandas dataframe to find all the TaskFrequency that happened at a specific hour and then multiply each TaskFrequency by the respective decision variable for each row - the whole expression should be less than or equal to the TotalTaskatSpecificHour for a specific hour
e.g. an expression like this for Hour 10 would be:
4*x3 + 2*x6 <= 20
So far I have been able to do this:
to = ""
for rownum, row in df_long.iterrows():
for i, wo in enumerate(decision_variables):
if rownum == i:
formula = row['TaskFrequency']*wo
to += formula
prob += to
this gave me:
5*x0 + 8*x1 + 7*x2 + 4*x3 + 6*x4 + 9*x5 + 2*x6 + 1*x7 +3*x8 + 2*x9 + 7*x10
I also tried this:
for rownum, row in df_long.iterrows():
for i, wo in enumerate(decision_variables):
for x,y,z in zip(df_long['Hour'],df_long['TypeofTask'],df_long['TaskFrequency']):
if rownum == i:
formula1 = row['TaskFrequency']*wo
I just get 7*x10
what I wish to get is the same expression but for a specific Hour instead of the whole thing combined e.g.
for Hour 10 it should be,
4*x3 + 2*x6 <= 20
for Hour 9 it should be,
8*x1 + 6*x4 + 1*x7 + 7*x10 <= 30
I look forward to your suggestions and help.
Regards
Diva

you would want a return column * (no of hours), in essence you dont need to apply function row by row, but condense the df by groupby like above answer, or slicing:
I think groupby is a standard way to do it but lambda is a no brainer.
def fun1(df, Hours, prod):
return sum(df[df['Hour']==Hours].apply(lambda row:int(row.name)*row['TaskFrequency'],axis=1)) <= prod

Related

Time Series from different variables

I am trying to create a variable that display how many days a bulb were functional, from different variables (Score_day_0).
The dataset I am using is like this one bellow, where score at different days are: 1--> Working very well and 10-->stop working.
What I want is to understand / know how to create the variable Days, where it will display the number of days the bulbs were working, ie. for sample 2, the score at day 10 is 8 and day_20 is 10 (stop working) and therefore the number of days that the bulb was working is 20.
Any suggestion?
Thank you so much for your help, hope you have a terrific day!!
sample
Score_Day_0
Score_Day_10
Score_Day_20
Score_Day_30
Score_Day_40
Days
sample 1
1
3
5
8
10
40
sample 2
3
8
10
10
10
20
I've tried to solve by myself generating a conditional loop, but the number of observations in Days are much higher than the number of observation from the original df.
Here is the code I used:
cols = df[['Score_Day_0', 'Score_Day_10....,'Score_Day_40']]
Days = []
for j in cols['Score_Day_0']:
if j = 10:
Days.append(0)
for k in cols['Score_Day_10']:
if k = 10:
Days.append('10')
for l in cols['Score_Day_20']:
if l = 10:
Days.append('20')
for n in cols['Score_Day_30']:
if n = 105:
Days.append('30')
for n in cols['Score_Day_40']:
if m = 10:
Days.append('40')
Your looking for the first column label (left to right) at which the value is maximal in each row.
You can apply a given function on each row using pandas.DataFrame.apply with axis=1:
df.apply(function, axis=1)
The passed function will get the row as Series object. To find the first occurrence of a value in a series we use a simple locator with our condition and retrieve the first value of the index containing - what we were looking for - the label of the column where the row first reaches its maximal values.
lambda x: x[x == x.max()].index[0]
Example:
df = pd.DataFrame(dict(d0=[1,1,1],d10=[1,5,10],d20=[5,10,10],d30=[8,10,10]))
# d0 d10 d20 d30
# 0 1 1 5 8
# 1 1 5 10 10
# 2 1 10 10 10
df['days'] = df.apply(lambda x: x[x == x.max()].index[0], axis=1)
df
# d0 d10 d20 d30 days
# 0 1 1 5 8 d30
# 1 1 5 10 10 d20
# 2 1 10 10 10 d10

Function for extract-min on Young tableau succeeds on one array and not the other

For those not familiar with Young tableaus, they must increase in value from left to right and from top to bottom. We may have infinity values, but infinity values can only be at the end, as they are the largest.
I've written a function, extract_min, that removes the smallest value and replaces it with the largest value, and puts infinity at the index where the largest value once was. It then must shift the value at the first position until the rules of the Young tableau are restored (values increase from left to right and top to bottom). For example, in the following table:
12
13
15
13
18
20
15
23
25
We will remove 12, replace it with 25, and replace 25 with infinity, resulting in the following table:
25
13
15
13
18
20
15
23
inf
We then perform an operation that moves the value 25 until the rows and columns are in increasing order from left to right and from top to bottom, respectively. By the end, it should look like this:
11
13
15
18
20
25
22
23
inf
My code is as follows:
def extract_min(arr):
min=arr[0][0]
arr[0][0], arr[len(arr)-1][len(arr)-1]=arr[len(arr)-1][len(arr)-1], float('inf')
young_order(arr,0,0)
return min
def young_order(arr, x, y):
while (arr[x][y] >= arr[x + 1][y] and arr[x][y] >= arr[x][y + 1]):
if arr[x + 1][y] < arr[x][y + 1] or y+1==len(arr):
arr[x][y], arr[x + 1][y] = arr[x + 1][y], arr[x][y]
x += 1
if arr[x][y+1] < arr[x+1][y] or x + 1 == len(arr):
arr[x][y], arr[x][y + 1] = arr[x][y + 1], arr[x][y]
y += 1
for i in range(len(empty)):
print(empty[i])
For some reason, this function works on the example I've provided, but not on the following table:
1
2
3
4
5
6
7
8
9
My questions are:
Why, and how do I fix it?
How can I possibly make this algorithm run recursively?
TIA.

How to use Pandas to block average data frame with a length of 10 frames?

I am new to Pandas. So I wonder whether there are some ways better to finish this task.
I have a data frame like the following format:
This is a DNA simulation data from molecular dynamics.
And the data set is here:BPdata.csv
So, here is in total 1000 Frames and my purpose is to get the average of each 10 Frames, So, in the end, I want the data to be like this:
Block Base1 Base2 Shear Stretch Stagger .....
1 1 66 XX XX XX
1 2 65 XX XX XX
... ... ... ... ... ...
1 33 34 XX XX XX
2 1 66 XX XX XX
2 2 65 XX XX XX
... ... ... ... ... ...
2 33 34 XX XX XX
3 1 66 XX XX XX
3 2 65 XX XX XX
... ... ... ... ... ...
3 33 34 XX XX XX
4 1 66 XX XX XX
4 2 65 XX XX XX
... ... ... ... ... ...
4 33 34 XX XX XX
Where Block 1 represents the mean of 1 ~ 10 Frames and 2 represents Frame 11 ~ 20.
Although, I think by carefully assign the index of each row I can finish these task, I wonder whether there is some convenient way to finish this task. I have checked some web pages about the groupby functions in pandas by it seems does not have this group each 10 row to get a block average function.
Thank you!
=============================== Update ==================================
Sorry for not be clear on the description of my purpose, and I have figured out a way to do the task and a sample output to better illustrated my purpose.
For double strand DNA, We know it is a double helix structure with AGCT, so Base1 means one base for DNA and Base2 means the complementary base of another strand. The two corresponding bases are linked together by hydrogen bonds.
like:
Base1 : AAAGGGCCCTTT
||||||||||||
Base2 : TTTCCCGGGAAA
So here in BPdata.csv each combination of Base1 and Base2 means a pair of DNA bases.
Here in BPdata.csv, this is a 33 base pair DNA simulated in different time frames noted as 1,2,3,4...1000.
Then I want to group each 10-time frames together, like 1~10,11~20,21~30...., and in each group, do the average for each Base pair.
And here is the data I figured out:
# -*- coding: utf-8 -*-
import pandas as pd
'''
Data Input
'''
# Import CSV data to Python
BPdata = pd.read_csv("BPdata.csv", delim_whitespace = True, skip_blank_lines = False)
BPdata.rename(columns={'#Frame':'Frame'}, inplace=True)
'''
Data Processing
'''
# constant block average parameters
Interval20ns = 10
IntervalInBPdata = 34
# BPdataBlockAverageSummary
LEN_BPdata = len(BPdata)
# For Frame 1
i = 1
indexStarting = 0
indexEnding = 0
indexStarting = indexEnding
indexEnding = Interval20ns * IntervalInBPdata * i - 1
GPtemp = BPdata.loc[indexStarting : indexEnding]
GPtemp['Frame'] = str(i)
BPdata_blockOF1K_mean = GPtemp.groupby(['Frame','Base1','Base2']).mean()
BPdata_blockOF1K_mean.loc[len(BPdata_blockOF1K_mean)] = str(i)
# For Frame 2 and so on
i = i + 1
indexStarting = indexEnding + 1
indexEnding = Interval20ns * IntervalInBPdata * i - 1
while ( indexEnding <= LEN_BPdata - 1):
GPtemp = BPdata.loc[indexStarting : indexEnding]
GPtemp['Frame'] = str(i)
meanTemp = GPtemp.groupby(['Frame','Base1','Base2']).mean()
meanTemp.loc[len(meanTemp)] = str(i)
BPdata_blockOF1K_mean = pd.concat([BPdata_blockOF1K_mean,meanTemp])
i = i + 1
indexStarting = indexEnding + 1
indexEnding = Interval20ns * IntervalInBPdata * i - 1
And the result is something like this, which is what I wanted:
And here is the sample output, BPdataresult.csv
But so far I got there warnings:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
GPtemp['Frame'] = str(i) /home/iphyer/Downloads/dataProcessing.py:62:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
GPtemp['Frame'] = str(i)
And Here I wonder:
Is this warning serious?
Due groupby function of Pandas, now the index of the data frame is a combination of (Frame,Base1,Base2), how can I separate them apart like the original form. Instead supplement #Frame to Block index.
Can I improve the code OR use some more Pandas way to do this task?
Best!
Grouping in pandas can be done in a variety of ways. One of those ways is to pass a series. So you could pass a series that has values for 10 row blocks. The solutions works as follows:
import pandas as pd
import numpy as np
#create datafram with 1000 rows
df = pd.DataFrame(np.random.rand(1000, 1)
#create series for grouping
groups_of_ten = pd.Series(np.repeat(range(int(len(df)/10)), 10))
#group the data
grouped = df.groupby(groups_of_ten)
#aggregate
grouped.agg('mean')
The grouping series looks like this on the inside:
In [21]: groups_of_ten.head(20)
Out[21]:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1

Pandas track consecutive near numbers via compare-cumsum-groupby pattern

I am trying to extend my current pattern to accommodate extra conditions of +- a percentage of the last value rather than strict does it match previous value.
data = np.array([[2,30],[2,900],[2,30],[2,30],[2,30],[2,1560],[2,30],
[2,300],[2,30],[2,450]])
df = pd.DataFrame(data)
df.columns = ['id','interval']
UPDATE 2 (id fix): Updated Data 2 with more data:
data2 = np.array([[2,30],[2,900],[2,30],[2,29],[2,31],[2,30],[2,29],[2,31],[2,1560],[2,30],[2,300],[2,30],[2,450], [3,40],[3,900],[3,40],[3,39],[3,41], [3,40],[3,39],[3,41] ,[3,1560],[3,40],[3,300],[3,40],[3,450]])
df2 = pd.DataFrame(data2)
df2.columns = ['id','interval']
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
results in [30,30,30]
however I really want to catch near number conditions say when a number is +-10% of the previous number.
so looking at df2 I would like to pickup the series [30,29,31]
for i, g in df2.groupby([(df2.interval != <???+- 10% magic ???>).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
UPDATE: Here is the end of line processing code where I store the gathered lists into a dictionary with the ID as the key
leak_intervals = {}
final_leak_intervals = {}
serials = []
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist()) >= 3:
print(g.interval.tolist())
serial = g.id.values[0]
if serial not in serials:
serials.append(serial)
if serial not in leak_intervals:
leak_intervals[serial] = g.interval.tolist()
else:
leak_intervals[serial] = leak_intervals[serial] + (g.interval.tolist())
UPDATE:
In [116]: df2.groupby(df2.interval.pct_change().abs().gt(0.1).cumsum()) \
.filter(lambda x: len(x) >= 3)
Out[116]:
id interval
2 2 30
3 2 29
4 2 31
5 2 30
6 2 29
7 2 31
15 3 40
16 3 39
17 2 41
18 2 40
19 2 39
20 2 41

Pandas check for overlapping dates in multiple rows

I need to run a function on a large groupby query that checks whether two subGroups have any overlapping dates. Below is an example of a single group tmp:
ID num start stop subGroup
0 21 10 2006-10-10 2008-10-03 1
1 21 46 2006-10-10 2100-01-01 2
2 21 5 1997-11-25 1998-09-29 1
3 21 42 1998-09-29 2100-01-01 2
4 21 3 1997-01-07 1997-11-25 1
5 21 6 2006-10-10 2008-10-03 1
6 21 47 1998-09-29 2006-10-10 2
7 21 4 1997-01-07 1998-09-29 1
The function I wrote to do this looks like this:
def hasOverlap(tmp):
d2_starts = tmp[tmp['subGroup']==2]['start']
d2_stops = tmp[tmp['subGroup']==2]['stop']
return tmp[tmp['subGroup']==1].apply(lambda row_d1:
(
#Check for part nested D2 in D1
((d2_starts >= row_d1['start']) &
(d2_starts < row_d1['stop']) ) |
((d2_stops >= row_d1['start']) &
(d2_stops < row_d1['stop']) ) |
#Check for fully nested D1 in D2
((d2_stops >= row_d1['stop']) &
(d2_starts <= row_d1['start']) )
).any()
,axis = 1
).any()
The problem is that this code has many redundancies and when I run the query:
groups.agg(hasOverlap)
It takes an unreasonably long time to terminate.
Are there any performance fixes (such as using built-in functions or set_index) that I could do to speed this up?
Are you just looking to return "True" or "False" based on the presence of an overlap? If so, I'd just get a list of the dates for each subgroup, and then uses pandas isin method to check if they overlap.
You could try something like this:
#split subgroups into separate DF's
group1 = groups[groups.subgroup==1]
group2 = groups[groups.subgroup==2]
#check if any of the start dates from group 2 are in group 1
if len(group1[group1.start.isin(list(group2.start))]) >0:
print "Group1 overlaps group2"
#check if any of the start dates from group 1 are in group 2
if len(group2[group2.start.isin(list(group1.start))]) >0:
print "Group2 overlaps group1"

Categories