pandas: rapidly calculating sum of column with certain values

pandas: rapidly calculating sum of column with certain values - python

I have a pandas dataframe and I need to calculate the sum of a column of values that fall within a certain window. So for instance, if I have a window of 500, and my initial value is 1000, I want to sum all values that are between 499 and 999, and also between 1001 and 1501.
This is easier to explain with some data:
chrom pos end AFR EUR pi
0 1 10177 10177 0.4909 0.4056 0.495988
1 1 10352 10352 0.4788 0.4264 0.496369
2 1 10617 10617 0.9894 0.9940 0.017083
3 1 11008 11008 0.1346 0.0885 0.203142
4 1 11012 11012 0.1346 0.0885 0.203142
5 1 13110 13110 0.0053 0.0567 0.053532
6 1 13116 13116 0.0295 0.1869 0.176091
7 1 13118 13118 0.0295 0.1869 0.176091
8 1 13273 13273 0.0204 0.1471 0.139066
9 1 13550 13550 0.0008 0.0080 0.007795
10 1 14464 14464 0.0144 0.1859 0.161422
11 1 14599 14599 0.1210 0.1610 0.238427
12 1 14604 14604 0.1210 0.1610 0.238427
13 1 14930 14930 0.4811 0.5209 0.500209
14 1 14933 14933 0.0015 0.0507 0.044505
15 1 15211 15211 0.5371 0.7316 0.470848
16 1 15585 15585 0.0008 0.0020 0.002635
17 1 15644 15644 0.0008 0.0080 0.007795
18 1 15777 15777 0.0159 0.0149 0.030470
19 1 15820 15820 0.4849 0.2714 0.477153
20 1 15903 15903 0.0431 0.4652 0.349452
21 1 16071 16071 0.0091 0.0010 0.011142
22 1 16142 16142 0.0053 0.0020 0.007721
23 1 16949 16949 0.0227 0.0159 0.038759
24 1 18643 18643 0.0023 0.0080 0.009485
25 1 18849 18849 0.8411 0.9911 0.170532
26 2 30923 30923 0.6687 0.9364 0.338400
27 2 20286 46286 0.0053 0.0010 0.006863
28 2 21698 46698 0.0015 0.0010 0.002566
29 2 42159 47159 0.0083 0.0696 0.067187
So I need to subset based on the first two columns. For example, if my window = 500, my chrom = 1 and my pos = 15500, I will need to subset my df to include only those rows that have chrom = 1 and 15000 > pos < 16000.
I would then like to sum the AFR column of this subset of data.
Here is the function I have made:
#vdf is my main dataframe,
#polyChrom is the chromosome to subset by,
#polyPos is the position to subset by.
#Distance is how far the window should be from the polyPos.
#windowSize is the size of the window itself
#E.g. if distance=20000 and windowSize= 500, we are looking at a window
#that is (polyPos-20000)-500 to (polyPos-20000) and a window that is
#(polyPos+20000) to (polyPos+20000)+500.
def mafWindow(vdf, polyChrom, polyPos, distance, windowSize):
#If start position becomes less than 0, set it to 0
if(polyPos - distance < 0):
start1 = 0
end1 = windowSize
else:
start1 = polyPos - distance
end1 = start1 + windowSize
end2 = polyPos + distance
start2 = end2 - windowSize
#subset df
df = vdf.loc[(vdf['chrom'] == polyChrom) & ((vdf['pos'] <= end1) & (vdf['pos'] >= start1))|
((vdf['pos'] <= end2) & (vdf['pos'] >= start2))].copy()
return(df.AFR.sum())
This whole method works on subsetting the dataframe and is very slow when my dataframe contains ~55k rows. Is there a quicker and more efficient way of doing this?

The trick is to drop down to numpy arrays. Pandas indexing and slicing is slow.
import pandas as pd
df = pd.DataFrame([[1, 10177, 0.5], [1, 10178, 0.2], [1, 20178, 0.1],
[2, 10180, 0.3], [1, 10180, 0.4]], columns=['chrom', 'pos', 'AFR'])
chrom = df['chrom'].values
pos = df['pos'].values
afr = df['AFR'].values
def filter_sum(chrom_arr, pos_arr, afr_arr, chrom_val, pos_start, pos_end):
return sum(k for i, j, k in zip(chrom_arr, pos_arr, afr_arr) \
if pos_start < j < pos_end and i == chrom_val)
filter_sum(chrom, pos, afr, 1, 10150, 10200)
# 1.1

Related

Attribute change with variable number of time steps

I would like to simulate individual changes in growth and mortality for a variable number of days. My dataframe is formatted as follows...
import pandas as pd
data = {'unique_id': ['2', '4', '5', '13'],
'length': ['27.7', '30.2', '25.4', '29.1'],
'no_fish': ['3195', '1894', '8', '2774'],
'days_left': ['253', '253', '254', '256'],
'growth': ['0.3898', '0.3414', '0.4080', '0.3839']
}
df = pd.DataFrame(data)
print(df)
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Ideally, I would like the initial length (i.e., length) to increase by the daily growth rate (i.e., growth) for each of the days remaining in the year (i.e., days_left).
df['final'] = df['length'] + (df['days_left'] * df['growth']
However, I would also like to update the number of fish that each individual represents (i.e., no_fish) on a daily basis using a size-specific equation. I'm fairly new to python so I initially thought to use a for-loop (I'm not sure if there is another, more efficient way). My code is as follows:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_to_forecast[indx]):
# (1) update individual length
df.lgth[indx] = df.lgth[indx] + df.linearGR[indx]
# (2) estimate daily size-specific mortality
if df.lgth[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.lgth[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.lgth[indx])
elif df.lgth[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.lgth[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
The above code now works correctly, but it is still far to inefficient to run for 40,000 individuals each for 200+ days.
I would really appreciate any advice on how to modify the following code to make it pythonic.
Thanks

Another option that was suggested to me is to use the pd.dataframe.apply function. This dramatically reduced the overall the run time and could be useful to someone else in the future.
### === RUN SIMULATION === ###
start_time = time.perf_counter() # keep track of run time -- START
#-------------------------------------------------------------------------#
def function_to_apply( df ):
df['z_instantMort'] = ''
for indx in range(int(df['days_left'])):
# (1) update individual length
df['length'] = df['length'] + df['growth']
# (2) estimate daily size-specific mortality
if df['length'] > 50.0:
df['z_instantMort'] = 0.01
else:
if df['length'] <= 50.0:
df['z_instantMort'] = 0.052857-((0.03/35)*df['length'])
elif df['length'] < 15.0:
df['z_instantMort'] = 0.728*np.exp(-0.1892*df['length'])
whole_fish = round(df['no_fish'], 0)
if whole_fish < 1.0:
df['no_fish'] = 0.0
elif whole_fish >= 1.0:
df['no_fish'] = df['no_fish']*np.exp(-(df['z_instantMort']))
return df
#-------------------------------------------------------------------------#
sim_results = df.apply(function_to_apply, axis=1)
total_elapsed_time = round(time.perf_counter() - start_time, 2) # END
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(sim_results)
### ====================== ###
output being...
Forecast iteration completed in 0.05 seconds
unique_id length no_fish days_left growth z_instantMort
0 2.0 126.3194 148.729190 253.0 0.3898 0.01
1 4.0 116.5742 93.018465 253.0 0.3414 0.01
2 5.0 129.0320 0.000000 254.0 0.4080 0.01
3 13.0 127.3784 132.864757 256.0 0.3839 0.01

As I said in my comment, a preferable alternative to for loops in this setting is using vector operations. For instance, running your code:
import pandas as pd
import time
import math
import numpy as np
data = {'unique_id': [2, 4, 5, 13],
'length': [27.7, 30.2, 25.4, 29.1],
'no_fish': [3195, 1894, 8, 2774],
'days_left': [253, 253, 254, 256],
'growth': [0.3898, 0.3414, 0.4080, 0.3839]
}
df = pd.DataFrame(data)
print(df)
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for indx in range(len(df)):
count = 1
while count <= int(df.days_left[indx]):
# (1) update individual length
df.length[indx] = df.length[indx] + df.growth[indx]
# (2) estimate daily size-specific mortality
if df.length[indx] > 50.0:
df.z[indx] = 0.01
else:
if df.length[indx] <= 50.0:
df.z[indx] = 0.052857-((0.03/35)*df.length[indx])
elif df.length[indx] < 15.0:
df.z[indx] = 0.728*math.exp(-0.1892*df.length[indx])
df['no_fish'].round(decimals = 0)
if df.no_fish[indx] < 1.0:
df.no_fish[indx] = 0.0
elif df.no_fish[indx] >= 1.0:
df.no_fish[indx] = df.no_fish[indx]*math.exp(-(df.z[indx]))
# (3) reduce no. of days left in forecast by 1
count = count + 1
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output:
unique_id length no_fish days_left growth
0 2 27.7 3195 253 0.3898
1 4 30.2 1894 253 0.3414
2 5 25.4 8 254 0.4080
3 13 29.1 2774 256 0.3839
Forecast iteration completed in 31.75 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.01
1 4 116.5742 93.018465 253 0.3414 0.01
2 5 129.0320 0.000000 254 0.4080 0.01
3 13 127.3784 132.864757 256 0.3839 0.01
Now with vector operations, you could do something like:
# keep track of run time - START
start_time = time.perf_counter()
df['z'] = 0.0
for day in range(1, df.days_left.max() + 1):
update = day <= df['days_left']
# (1) update individual length
df[update]['length'] = df[update]['length'] + df[update]['growth']
# (2) estimate daily size-specific mortality
df[update]['z'] = np.where( df[update]['length'] > 50.0, 0.01, 0.052857-( ( 0.03 / 35)*df[update]['length'] ) )
df[update]['z'] = np.where( df[update]['length'] < 15.0, 0.728 * np.exp(-0.1892*df[update]['length'] ), df[update]['z'] )
df[update]['no_fish'].round(decimals = 0)
df[update]['no_fish'] = np.where(df[update]['no_fish'] < 1.0, 0.0, df[update]['no_fish'] * np.exp(-(df[update]['z'])))
# keep track of run time - END
total_elapsed_time = round(time.perf_counter() - start_time, 2)
print("Forecast iteration completed in {} seconds".format(total_elapsed_time))
print(df)
with output
Forecast iteration completed in 1.32 seconds
unique_id length no_fish days_left growth z
0 2 126.3194 148.729190 253 0.3898 0.0
1 4 116.5742 93.018465 253 0.3414 0.0
2 5 129.0320 0.000000 254 0.4080 0.0
3 13 127.3784 132.864757 256 0.3839 0.0

creating a dataframe and based on 2 dataframe sets that have different lengths

I have 2 dataframe sets , I want to create a third one. I am trying to to write a code that to do the following :
if A_pd["from"] and A_pd["To"] is within the range of B_pd["from"]and B_pd["To"] then add to the C_pd dateframe A_pd["from"] and A_pd["To"] and B_pd["Value"].
if the A_pd["from"] is within the range of B_pd["from"]and B_pd["To"] and A_pd["To"] within the range of B_pd["from"]and B_pd["To"] of teh next row , then i want to split the range A_pd["from"] and A_pd["To"] to 2 ranges (A_pd["from"] and B_pd["To"]) and ( B_pd["To"] and A_pd["To"] ) and the corresponded B_pd["Value"].
I created the following code:
import pandas as pd
A_pd = {'from':[0,20,80,180,250],
'To':[20, 50,120,210,300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0,20,100,200],
'To':[20, 100,200,300],
'Value':[20, 17,15,12]}
B_pd=pd.DataFrame(B_pd)
for i in range(len(A_pd)):
numberOfIntrupt=0
for j in range(len(B_pd)):
if A_pd["from"].values[i] >= B_pd["from"].values[j] and A_pd["from"].values[i] > B_pd["To"].values[j]:
numberOfIntrupt+=1
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols, index=range(len(A_pd)+numberOfIntrupt))
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a=A_pd ["from"].values[i]
b=A_pd["To"].values[i]
c_eval=B_pd["Value"].values[j]
range_s=B_pd["from"].values[j]
range_f=B_pd["To"].values[j]
if a >= range_s and a <= range_f and b >= range_s and b <= range_f :
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=b
C_dp['C_value'].loc[i]=c_eval
elif a >= range_s and b > range_f:
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=range_f
C_dp['C_value'].loc[i]=c_eval
C_dp['C_from'].loc[i+1]=range_f
C_dp['C_To'].loc[i+1]=b
C_dp['C_value'].loc[i+1]=B_pd["Value"].values[j+1]
print(C_dp)
The current result is C_dp:
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 180 200 15
4 250 300 12
5 200 300 12
6 NaN NaN NaN
7 NaN NaN NaN
the expected should be :
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12
Thank you a lot for the support

I'm sure there is a better way to do this without loops, but this will help your logic flow.
import pandas as pd
A_pd = {'from':[0, 20, 80, 180, 250],
'To':[20, 50, 120, 210, 300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0, 20, 100, 200],
'To':[20, 100,200, 300],
'Value':[20, 17, 15, 12]}
B_pd=pd.DataFrame(B_pd)
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols)
spillover = False
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a_from = A_pd["from"].values[i]
a_to = A_pd["To"].values[i]
b_from = B_pd["from"].values[j]
b_to = B_pd["To"].values[j]
b_value = B_pd['Value'].values[j]
if (a_from >= b_to):
# a_from outside b range
continue # next b
elif (a_from >= b_from):
# a_from within b range
if a_to <= b_to:
C_dp = C_dp.append({"C_from": a_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
break # next a
else:
C_dp = C_dp.append({"C_from": a_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
if j < len(B_pd):
spillover = True
continue
if spillover:
if a_to <= b_to:
C_dp = C_dp.append({"C_from": b_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
spillover = False
break
else:
C_dp = C_dp.append({"C_from": b_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
spillover = True
continue
print(C_dp)
Output
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12

Gurobi Multiple Objective Function Hierarchical Degradation

I'm trying to implement a Gurobi model with multiple objective functions (specifically 2) that solves lexicographically (in a hierarchy) but I'm running into an issue where when optimizing the second objective function it degrades the solution to the first one, which should not happen with hierarchical optimizations. It is degrading the first solution up by 1, to decrease the second by 5, could this be an error in how I setup my model hierarchically? This is the code where I set up my model:
m = Model('lexMin Model')
m.ModelSense = GRB.MINIMIZE
variable = m.addVars(k.numVars, vtype=GRB.BINARY, name='variable')
m.setObjectiveN(LinExpr(quicksum([variable[j]*k.obj[0][j] for j in range(k.numVars)])),0)
m.setObjectiveN(LinExpr(quicksum([variable[j]*k.obj[1][j] for j in range(k.numVars)])),1)
for i in range(0,k.numConst):
m.addConstr(quicksum([k.const[i,j]*variable[j] for j in range(k.numVars)] <= k.constRHS[i]))
m.addConstr(quicksum([variable[j]*k.obj[0][j] for j in range(k.numVars)]) >= r2[0][0])
m.addConstr(quicksum([variable[j]*k.obj[0][j] for j in range(k.numVars)]) <= r2[1][0])
m.addConstr(quicksum([variable[j]*k.obj[1][j] for j in range(k.numVars)]) >= r2[1][1])
m.addConstr(quicksum([variable[j]*k.obj[1][j] for j in range(k.numVars)]) <= r2[0][1])
m.Params.ObjNumber = 0
m.ObjNPriority = 1
m.update()
m.optimize()
I've double checked and the priority of the second function is 0, the value for the objective functions are nowhere near where they'd be if I prioritized the wrong function. When optimizing the first function it finds the right value, even, but when it moves on to the second value it chooses values that degrade the first value.
The Gurobi output looks like this:
Optimize a model with 6 rows, 375 columns and 2250 nonzeros
Model fingerprint: 0xac5de9aa
Variable types: 0 continuous, 375 integer (375 binary)
Coefficient statistics:
Matrix range [1e+01, 1e+02]
Objective range [1e+01, 1e+02]
Bounds range [1e+00, 1e+00]
RHS range [1e+04, 1e+04]
---------------------------------------------------------------------------
Multi-objectives: starting optimization with 2 objectives ...
---------------------------------------------------------------------------
Multi-objectives: applying initial presolve ...
---------------------------------------------------------------------------
Presolve time: 0.00s
Presolved: 6 rows and 375 columns
---------------------------------------------------------------------------
Multi-objectives: optimize objective 1 () ...
---------------------------------------------------------------------------
Presolve time: 0.00s
Presolved: 6 rows, 375 columns, 2250 nonzeros
Variable types: 0 continuous, 375 integer (375 binary)
Root relaxation: objective -1.461947e+04, 10 iterations, 0.00 seconds
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
0 0 -14619.473 0 3 - -14619.473 - - 0s
H 0 0 -14569.00000 -14619.473 0.35% - 0s
H 0 0 -14603.00000 -14619.473 0.11% - 0s
H 0 0 -14608.00000 -14619.473 0.08% - 0s
H 0 0 -14611.00000 -14618.032 0.05% - 0s
0 0 -14617.995 0 5 -14611.000 -14617.995 0.05% - 0s
0 0 -14617.995 0 3 -14611.000 -14617.995 0.05% - 0s
H 0 0 -14613.00000 -14617.995 0.03% - 0s
0 0 -14617.995 0 5 -14613.000 -14617.995 0.03% - 0s
0 0 -14617.995 0 5 -14613.000 -14617.995 0.03% - 0s
0 0 -14617.995 0 7 -14613.000 -14617.995 0.03% - 0s
0 0 -14617.995 0 3 -14613.000 -14617.995 0.03% - 0s
0 0 -14617.995 0 4 -14613.000 -14617.995 0.03% - 0s
0 0 -14617.995 0 6 -14613.000 -14617.995 0.03% - 0s
0 0 -14617.995 0 6 -14613.000 -14617.995 0.03% - 0s
0 0 -14617.995 0 6 -14613.000 -14617.995 0.03% - 0s
0 0 -14617.720 0 7 -14613.000 -14617.720 0.03% - 0s
0 0 -14617.716 0 8 -14613.000 -14617.716 0.03% - 0s
0 0 -14617.697 0 8 -14613.000 -14617.697 0.03% - 0s
0 0 -14617.661 0 9 -14613.000 -14617.661 0.03% - 0s
0 2 -14617.661 0 9 -14613.000 -14617.661 0.03% - 0s
* 823 0 16 -14614.00000 -14616.351 0.02% 2.8 0s
Cutting planes:
Gomory: 6
Cover: 12
MIR: 4
StrongCG: 2
Inf proof: 6
Zero half: 1
Explored 1242 nodes (3924 simplex iterations) in 0.29 seconds
Thread count was 8 (of 8 available processors)
Solution count 6: -14614 -14613 -14611 ... -14569
No other solutions better than -14614
Optimal solution found (tolerance 1.00e-04)
Best objective -1.461400000000e+04, best bound -1.461400000000e+04, gap 0.0000%
---------------------------------------------------------------------------
Multi-objectives: optimize objective 2 () ...
---------------------------------------------------------------------------
Loaded user MIP start with objective -12798
Presolve removed 1 rows and 0 columns
Presolve time: 0.01s
Presolved: 6 rows, 375 columns, 2250 nonzeros
Variable types: 0 continuous, 375 integer (375 binary)
Root relaxation: objective -1.282967e+04, 28 iterations, 0.00 seconds
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
0 0 -12829.673 0 3 -12798.000 -12829.673 0.25% - 0s
0 0 -12829.378 0 4 -12798.000 -12829.378 0.25% - 0s
0 0 -12829.378 0 3 -12798.000 -12829.378 0.25% - 0s
0 0 -12828.688 0 4 -12798.000 -12828.688 0.24% - 0s
H 0 0 -12803.00000 -12828.688 0.20% - 0s
0 0 -12825.806 0 5 -12803.000 -12825.806 0.18% - 0s
0 0 -12825.193 0 5 -12803.000 -12825.193 0.17% - 0s
0 0 -12823.156 0 6 -12803.000 -12823.156 0.16% - 0s
0 0 -12822.694 0 7 -12803.000 -12822.694 0.15% - 0s
0 0 -12822.679 0 7 -12803.000 -12822.679 0.15% - 0s
0 2 -12822.679 0 7 -12803.000 -12822.679 0.15% - 0s
Cutting planes:
Cover: 16
MIR: 6
StrongCG: 3
Inf proof: 4
RLT: 1
Explored 725 nodes (1629 simplex iterations) in 0.47 seconds
Thread count was 8 (of 8 available processors)
Solution count 2: -12803 -12798
No other solutions better than -12803
Optimal solution found (tolerance 1.00e-04)
Best objective -1.280300000000e+04, best bound -1.280300000000e+04, gap 0.0000%
So it finds the values (-14613,-12803) instead of (-14614,-12798)

The default MIPGap is 1e-4. The first objective is degrading by less than that. (1/14614 =~ 0.7 e-4). If you lower the MIPGap, your issue should go away. In your code add
m.setObjective('MipGap', 1e-6)
before the optimize.
One way to reason about this behavior is that since you had a MIPGap of 1e-4, you would have accepted the a solution with value -14113, even if you didn't have a second objective.

Issue in Appending data using Pandas

Problem: - I want to build a logic that take data like Attendance data, In Time, Employee Id and return a data frame with employee id, in time, attendance date and basically in which slot the employee entered. (Suppose In time is 9:30:00 of date 14-10-2019 so that if employee came at 9:30 so for that date and for that column it insert a value one.)
Given Example below
I tried lots of time to build logic for this problem but failed to build.
I have a dataset that looks like this.
I want an output like this so that whatever the time (employee enter's) it only insert data to that time column only column only.:
This is my code but its only repeating last loop.
temp =[]
for date in nf['DaiGong']:
for en in nf['EnNo']:
for i in nf['DateTime']:
col=['EnNo','Date','InTime','9:30-10:30','10:30-11:00','11:00-11:30','11:30-12:30','12:30-13:00','13:00-13:30']
ndf=pd.DataFrame(columns=col)
if i < '10:30:00' and i > '09:30:00':
temp.append(1)
ndf['9:30-10:30'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '11:00:00' and i > '10:30:00':
temp.append(1)
ndf['10:30-11:00'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '11:30:00' and i > '11:00:00':
temp.append(1)
ndf['11:00-11:30'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '12:30:00' and i > '11:30:00':
temp.append(1)
ndf['11:30-12:30'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '13:00:00' and i > '12:30:00':
temp.append(1)
ndf['12:30-13:00'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
elif i < '13:30:00' and i > '13:00:00':
temp.append(1)
ndf['13:00-13:30'] = temp
ndf['InTime'] = i
ndf['Date'] = date
ndf['EnNo'] = en
This is the output of my code.

IIUC,
df = pd.DataFrame({'EnNo':[2,2,2,2,2,3,3,3,3],
'DaiGong':['2019-10-12', '2019-10-13', '2019-10-14', '2019-10-15', '2019-10-16', '2019-10-12', '2019-10-13', '2019-10-14', '2019-10-15'],
'DateTime':['09:53:56', '10:53:56', '09:23:56', '11:53:56', '11:23:56', '10:33:56', '12:53:56', '12:23:56', '09:53:56']})
df
DaiGong DateTime EnNo
0 2019-10-12 09:53:56 2
1 2019-10-13 10:53:56 2
2 2019-10-14 09:23:56 2
3 2019-10-15 11:53:56 2
4 2019-10-16 11:23:56 2
5 2019-10-12 10:33:56 3
6 2019-10-13 12:53:56 3
7 2019-10-14 12:23:56 3
8 2019-10-15 09:53:56 3
import datetime
df['DateTime'] = pd.to_datetime(df['DateTime']).dt.time #converting to datetime
def time_range(row): # I only wrote two conditions - add more
i = row['DateTime']
if i < datetime.time(10, 30, 0) and i > datetime.time(9, 30, 0):
return '9:30-10:30'
elif i < datetime.time(11, 0, 0) and i > datetime.time(10, 30, 0):
return '10:30-11:00'
else:
return 'greater than 11:00'
df['time range'] = df.apply(time_range, axis=1)
df1 = pd.concat([df[['EnNo', 'DaiGong', 'DateTime']], pd.get_dummies(df['time range'])], axis=1)
df1
EnNo DaiGong DateTime 10:30-11:00 9:30-10:30 greater than 11:00
0 2 2019-10-12 09:53:56 0 1 0
1 2 2019-10-13 10:53:56 1 0 0
2 2 2019-10-14 09:23:56 0 0 1
3 2 2019-10-15 11:53:56 0 0 1
4 2 2019-10-16 11:23:56 0 0 1
5 3 2019-10-12 10:33:56 1 0 0
6 3 2019-10-13 12:53:56 0 0 1
7 3 2019-10-14 12:23:56 0 0 1
8 3 2019-10-15 09:53:56 0 1 0
To get sum of count by employee,
df1.groupby(['EnNo'], as_index=False).sum()
Let me know if you have any questions

My test data:
df:
EnNo DaiGong DateTime
2 2019-10-12 09:53:56
2 2019-10-13 09:42:00
2 2019-10-14 12:00:01
1 2019-11-01 11:12:00
1 2019-11-02 10:13:45
Create helper datas:
tdr=pd.timedelta_range("09:00:00","12:30:00",freq="30T")
s=pd.Series( len(tdr)*["-"] )
s[0]=1
cls=[ t.rsplit(":",maxsplit=1)[0] for t in tdr.astype(str) ]
cols=[ t1+"-"+t2 for (t1,t2) in zip(cls,cls[1:]) ]
cols.append(cls[-1]+"-")
tdr:
TimedeltaIndex(['09:00:00', '09:30:00', '10:00:00', '10:30:00', '11:00:00', '11:30:00', '12:00:00', '12:30:00'], dtype='timedelta64[ns]', freq='30T')
cols:
['09:00-09:30', '09:30-10:00', '10:00-10:30', '10:30-11:00', '11:00-11:30', '11:30-12:00', '12:00-12:30', '12:30-']
s:
0 1
1 -
2 -
3 -
4 -
5 -
6 -
7 -
dtype: object
Use 'apply' and 'searchsorted' to get time slots:
df2= df.DateTime.apply(lambda t: \
s.shift(tdr.searchsorted(t)-1,fill_value="-"))
df2.columns=cols
df2:
09:00-09:30 09:30-10:00 10:00-10:30 10:30-11:00 11:00-11:30 11:30-12:00 12:00-12:30 12:30-
0 - 1 - - - - - -
1 - 1 - - - - - -
2 - - - - - - 1 -
3 - - - - 1 - - -
4 - - 1 - - - - -
Finally, concatenate the two data frames:
df_rslt= pd.concat([df,df2],axis=1)
df_rslt:
EnNo DaiGong DateTime 09:00-09:30 09:30-10:00 10:00-10:30 10:30-11:00 11:00-11:30 11:30-12:00 12:00-12:30 12:30-
0 2 2019-10-12 09:53:56 - 1 - - - - - -
1 2 2019-10-13 09:42:00 - 1 - - - - - -
2 2 2019-10-14 12:00:01 - - - - - - 1 -
3 1 2019-11-01 11:12:00 - - - - 1 - - -
4 1 2019-11-02 10:13:45 - - 1 - - - - -

Numpy Finding Matching number with Array

Any help is greatly appreciated!! I have been trying to solve this for the last few days....
I have two arrays:
import pandas as pd
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
The result that I am trying to get is:
Array 1 and Array 2 Match by closes difference, based on left over number from Array2
20 26.12 3000 25.03
30 43.12 4000 42.12
40 46.81 6000 46
50 56.23 7000 110.05
60 111.07 8000 165.41
70 166.38 0 0
Starting at Array 1, ID 20, find the nearest which in this case would be the first Number in Array 2 ID 3000 (26.12-25.03). so ID 20, gets matched to 3000.
Where it gets tricky is if one value in Array 2 is not the closest, then it is skipped. for example, ID 40 value 46.81 is compared to 45.74, 46 and the smallest value is .81 from 46 ID 6000. So ID 40--> ID 6000. ID 5000 in array 2 is now skipped for any future comparisons. So now when comparing array 1 ID 50, it is compared to the next available number in array 2, 110.05. array 1 ID 50 is matched to Array 2 ID 7000.
UPDATE
so here's the code that i have tried and it works. Yes, it is not the greatest, so if someone has another suggestion please let me know.
import pandas as pd
import operator
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
newPos = 1
CurrentNumber = 0
OldArrayLen = len(OldDataSetArray) -1
NewArrayLen = len(NewDataSetArray) -1
numberResults = []
for oldPos in range(len(OldDataSetArray)):
PreviousNumber = abs(OldDataSetArray[oldPos, 0]- NewDataSetArray[oldPos, 0])
while newPos <= len(NewDataSetArray) - 1:
CurrentNumber = abs(OldDataSetArray[oldPos, 0] - NewDataSetArray[newPos, 0])
#if it is the last row for the inner array, then match the next available
#in Array 1 to that last record
if newPos == NewArrayLen and oldPos < newPos and oldPos +1 <= OldArrayLen:
numberResults.append([OldDataSetArray[oldPos +1, 1],NewDataSetArray[newPos, 1],OldDataSetArray[oldPos +1, 0],NewDataSetArray[newPos, 0]])
if PreviousNumber < CurrentNumber:
numberResults.append([OldDataSetArray[oldPos, 1], NewDataSetArray[newPos - 1, 1], OldDataSetArray[oldPos, 0], NewDataSetArray[newPos - 1, 0]])
newPos +=1
break
elif PreviousNumber > CurrentNumber:
PreviousNumber = CurrentNumber
newPos +=1
#sort by array one values
numberResults = sorted(numberResults, key=operator.itemgetter(0))
numberResultsDf = pd.DataFrame(numberResults)

You can use NumPy broadcasting to build a distance matrix:
a = numpy.array([26.12, 43.12, 46.81, 56.23, 111.07, 166.38,])
b = numpy.array([25.03, 42.12, 45.74, 46, 110.05, 165.41,])
numpy.abs(a[:, None] - b[None, :])
# array([[ 1.09, 16. , 19.62, 19.88, 83.93, 139.29],
# [ 18.09, 1. , 2.62, 2.88, 66.93, 122.29],
# [ 21.78, 4.69, 1.07, 0.81, 63.24, 118.6 ],
# [ 31.2 , 14.11, 10.49, 10.23, 53.82, 109.18],
# [ 86.04, 68.95, 65.33, 65.07, 1.02, 54.34],
# [ 141.35, 124.26, 120.64, 120.38, 56.33, 0.97]])
of that matrix you can then find the closest elements using argmin, either row- or columnwise (depending of if you want to search in a or b).
numpy.argmin(numpy.abs(a[:, None] - b[None, :]), axis=1)
# array([0, 1, 3, 3, 4, 5])

Compute all the differences, and use `np.argmin to lookup the closest.
a,b=np.random.rand(2,10)
all_differences=np.abs(np.subtract.outer(a,b))
ia=all_differences.argmin(axis=1)
for i in range(10):
print(i,a[i],ia[i], b[ia[i]])
0 0.231603891949 8 0.21177584152
1 0.27810475456 7 0.302647382888
2 0.582133214953 2 0.548920922033
3 0.892858042793 1 0.872622982632
4 0.67293347218 6 0.677971552011
5 0.985227546492 1 0.872622982632
6 0.82431697833 5 0.83765895237
7 0.426992114791 4 0.451084369838
8 0.181147161752 8 0.21177584152
9 0.631139744522 3 0.653554586691
EDIT
with dataframes and indexes:
va,vb=np.random.rand(2,10)
na,nb=np.random.randint(0,100,(2,10))
dfa=pd.DataFrame({'id':na,'odo':va})
dfb=pd.DataFrame({'id':nb,'odo':vb})
all_differences=np.abs(np.subtract.outer(dfa.odo,dfb.odo))
ia=all_differences.argmin(axis=1)
dfc=dfa.merge(dfb.loc[ia].reset_index(drop=True),\
left_index=True,right_index=True)
Input :
In [337]: dfa
Out[337]:
id odo
0 72 0.426457
1 12 0.315997
2 96 0.623164
3 9 0.821498
4 72 0.071237
5 5 0.730634
6 45 0.963051
7 14 0.603289
8 5 0.401737
9 63 0.976644
In [338]: dfb
Out[338]:
id odo
0 95 0.333215
1 7 0.023957
2 61 0.021944
3 57 0.660894
4 22 0.666716
5 6 0.234920
6 83 0.642148
7 64 0.509589
8 98 0.660273
9 19 0.658639
Output :
In [339]: dfc
Out[339]:
id_x odo_x id_y odo_y
0 72 0.426457 64 0.509589
1 12 0.315997 95 0.333215
2 96 0.623164 83 0.642148
3 9 0.821498 22 0.666716
4 72 0.071237 7 0.023957
5 5 0.730634 22 0.666716
6 45 0.963051 22 0.666716
7 14 0.603289 83 0.642148
8 5 0.401737 95 0.333215
9 63 0.976644 22 0.666716

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas: rapidly calculating sum of column with certain values - python

Related

Attribute change with variable number of time steps

creating a dataframe and based on 2 dataframe sets that have different lengths

Gurobi Multiple Objective Function Hierarchical Degradation

Issue in Appending data using Pandas

Numpy Finding Matching number with Array

Categories

Resources