I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']
Related
I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:
IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)
IF 0 points & positive points, select the row with the highest positive point
IF Positive & Negative Points, select the row with the highest positive point
IF Positive, 0 points, and Negative points, select the row with the highest positive point
IF 0 Points mark, select any row with 0 points
IF All Negative, select the row with the highest negative point (i.e. -30 > -20)
1:M relationship between ParentCustomerID and ChildCustomerID
ParentCustomerID
ChildCustomerID
Points
101
1
0.0
101
2
-20.0
101
3
-30.50
102
4
20.86
102
5
0.0
102
6
50.0
103
7
10.0
103
8
50.0
103
9
-30.0
104
10
-30.0
104
11
0.0
104
12
60.80
104
13
40.0
105
14
0.0
105
15
0.0
105
16
0.0
106
17
-20.0
106
18
-30.80
106
19
-40.20
Output should be:
ParentCustomerID
ChildCustomerID
Points
101
3
-30.50
102
6
50.0
103
8
50.0
104
12
60.80
105
16
0.0
106
19
-40.20
Note: for the rows customer 105, any row can be chosen because they all have 0 points.
Note2: Points can be float and ChildCustomerID can be missing (np.nan)
I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.
Thank you in advance!
Code
df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)
(
df.sort_values(['pri', 'abs'])
.drop_duplicates('ParentCustomerID', keep='last')
.drop(['pri', 'abs'], axis=1)
.sort_index()
)
How this works
Assign a temporary column named abs with the absolute values of Points
Assign a temporary column named pri(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values in Points, Important hack: replace 0 with -2 so that zero always has least priority.
Sort the values by priority and absolute values
Drop the duplicates in sorted dataframe keeping the last row per ParentCustomerID
Result
ParentCustomerID ChildCustomerID Points
2 101 3 -30.5
5 102 6 50.0
7 103 8 50.0
11 104 12 60.8
15 105 16 0.0
18 106 19 -40.2
import pandas as pd
import numpy as np
df = pd.DataFrame([
[101, 1, 0.0],
[101, 2, -20.0],
[101, 3, -30.50],
[102, 4, 20.86],
[102, 5, 0.0],
[102, 6, 50.0],
[103, 7, 10.0],
[103, 8, 50.0],
[103, 9, -30.0],
[104, 10, -30.0],
[104, 11, 0.0],
[104, 12, 60.80],
[104, 13, 40.0],
[105, 14, 0.0],
[105, 15, 0.0],
[105, 16, 0.0],
[106, 17, -20.0],
[106, 18, -30.80],
[106, 19, -40.20]
],columns=['ParentCustomerID', 'ChildCustomerID', 'Points'])
data = df.groupby('ParentCustomerID').agg({
'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
'ChildCustomerID': list
})
pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
0: "ChildCustomerID",
1: "Points"
}).reset_index()
I have the following series and trying to find the index of the peaks which should be [1,8.5] or the peak value which should be [279,139]. the used threshold is 100. I tried many ways but, it always ignores the series index and returns [1,16].
0.5 0
1.0 279
1.5 256
2.0 84
2.5 23
3.0 11
3.5 3
4.0 2
4.5 7
5.0 5
5.5 4
6.0 4
6.5 10
7.0 30
7.5 88
8.0 133
8.5 139
9.0 84
9.5 55
10.0 26
10.5 10
11.0 8
11.5 4
12.0 4
12.5 1
13.0 0
13.5 0
14.0 1
14.5 0
I tried this code
thresh = 100
peak_idx, _ = find_peaks(out.value_counts(sort=False), height=thresh)
plt.plot(out.value_counts(sort=False).index[peak_idx], out.value_counts(sort=False)[peak_idx], 'r.')
out.value_counts(sort=False).plot.bar()
plt.show()
peak_idx
here is the output
array([ 1, 16], dtype=int64)
You are doing it right the only thing that you misunderstood is that find_peaks finds the indexes of the peaks, not peaks themselves.
Here is the documentation that clearly states that:
Returns
peaksndarray
Indices of peaks in x that satisfy all given conditions.
Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html
Try this code here:
thresh = 100
y = [0,279,256, 84, 23, 11, 3, 2, 7, 5, 4, 4, 10, 30, 88,133,139, 84, 55, 26, 10, 8, 4, 4, 1, 0, 0, 1, 0]
x = [0.5 ,1.0 ,1.5 ,2.0 ,2.5 ,3.0 ,3.5 ,4.0 ,4.5 ,5.0 ,5.5 ,6.0 ,6.5 ,7.0 ,7.5 ,8.0 ,8.5 ,9.0 ,9.5 ,10.0,10.5,11.0,11.5,12.0,12.5,13.0,13.5,14.0,14.5]
peak_idx, _ = find_peaks(x, height=thresh)
out_values = [x[peak] for peak in peak_idx]
Here out_vaules will contain what you want
I have a dataframe as follows,
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
Expected output,
value score
54 scaled value
74 scaled value
71 scaled value
78 50.000
12 600.00
I want to assign a score between 50 and 600 to all, but lowest value must have a highest score. Do you have an idea?
Not sure what you want to achieve, maybe you could provide the exact expected output for this input.
But if I understand well, maybe you could try
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
min = pd.DataFrame.min(df).value
max = pd.DataFrame.max(df).value
step = 550 / (max - min)
df['score'] = 600 - (df['value']-min) * step
print(df)
This will output
value score
0 54 250.000000
1 74 83.333333
2 71 108.333333
3 78 50.000000
4 12 600.000000
This is my idea. But I think you have a scale on your scores that is missing in your questions.
dfmin = df.min()[0]
dfmax = df.max()[0]
dfrange = dfmax - dfmin
score_value = (600-50)/dfrange
df.loc[:,'score'] = np.where(df['value'] == dfmin, 600,
np.where(df.value == dfmax,
50,
600 - ((df.value - dfmin)* (1/score_value))))
df
that produces:
value score
0 54 594.96
1 74 592.56
2 71 592.92
3 78 50.00
4 12 600.00
Not matching your output, because of the missing scale.
Any help is greatly appreciated!! I have been trying to solve this for the last few days....
I have two arrays:
import pandas as pd
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
The result that I am trying to get is:
Array 1 and Array 2 Match by closes difference, based on left over number from Array2
20 26.12 3000 25.03
30 43.12 4000 42.12
40 46.81 6000 46
50 56.23 7000 110.05
60 111.07 8000 165.41
70 166.38 0 0
Starting at Array 1, ID 20, find the nearest which in this case would be the first Number in Array 2 ID 3000 (26.12-25.03). so ID 20, gets matched to 3000.
Where it gets tricky is if one value in Array 2 is not the closest, then it is skipped. for example, ID 40 value 46.81 is compared to 45.74, 46 and the smallest value is .81 from 46 ID 6000. So ID 40--> ID 6000. ID 5000 in array 2 is now skipped for any future comparisons. So now when comparing array 1 ID 50, it is compared to the next available number in array 2, 110.05. array 1 ID 50 is matched to Array 2 ID 7000.
UPDATE
so here's the code that i have tried and it works. Yes, it is not the greatest, so if someone has another suggestion please let me know.
import pandas as pd
import operator
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
newPos = 1
CurrentNumber = 0
OldArrayLen = len(OldDataSetArray) -1
NewArrayLen = len(NewDataSetArray) -1
numberResults = []
for oldPos in range(len(OldDataSetArray)):
PreviousNumber = abs(OldDataSetArray[oldPos, 0]- NewDataSetArray[oldPos, 0])
while newPos <= len(NewDataSetArray) - 1:
CurrentNumber = abs(OldDataSetArray[oldPos, 0] - NewDataSetArray[newPos, 0])
#if it is the last row for the inner array, then match the next available
#in Array 1 to that last record
if newPos == NewArrayLen and oldPos < newPos and oldPos +1 <= OldArrayLen:
numberResults.append([OldDataSetArray[oldPos +1, 1],NewDataSetArray[newPos, 1],OldDataSetArray[oldPos +1, 0],NewDataSetArray[newPos, 0]])
if PreviousNumber < CurrentNumber:
numberResults.append([OldDataSetArray[oldPos, 1], NewDataSetArray[newPos - 1, 1], OldDataSetArray[oldPos, 0], NewDataSetArray[newPos - 1, 0]])
newPos +=1
break
elif PreviousNumber > CurrentNumber:
PreviousNumber = CurrentNumber
newPos +=1
#sort by array one values
numberResults = sorted(numberResults, key=operator.itemgetter(0))
numberResultsDf = pd.DataFrame(numberResults)
You can use NumPy broadcasting to build a distance matrix:
a = numpy.array([26.12, 43.12, 46.81, 56.23, 111.07, 166.38,])
b = numpy.array([25.03, 42.12, 45.74, 46, 110.05, 165.41,])
numpy.abs(a[:, None] - b[None, :])
# array([[ 1.09, 16. , 19.62, 19.88, 83.93, 139.29],
# [ 18.09, 1. , 2.62, 2.88, 66.93, 122.29],
# [ 21.78, 4.69, 1.07, 0.81, 63.24, 118.6 ],
# [ 31.2 , 14.11, 10.49, 10.23, 53.82, 109.18],
# [ 86.04, 68.95, 65.33, 65.07, 1.02, 54.34],
# [ 141.35, 124.26, 120.64, 120.38, 56.33, 0.97]])
of that matrix you can then find the closest elements using argmin, either row- or columnwise (depending of if you want to search in a or b).
numpy.argmin(numpy.abs(a[:, None] - b[None, :]), axis=1)
# array([0, 1, 3, 3, 4, 5])
Compute all the differences, and use `np.argmin to lookup the closest.
a,b=np.random.rand(2,10)
all_differences=np.abs(np.subtract.outer(a,b))
ia=all_differences.argmin(axis=1)
for i in range(10):
print(i,a[i],ia[i], b[ia[i]])
0 0.231603891949 8 0.21177584152
1 0.27810475456 7 0.302647382888
2 0.582133214953 2 0.548920922033
3 0.892858042793 1 0.872622982632
4 0.67293347218 6 0.677971552011
5 0.985227546492 1 0.872622982632
6 0.82431697833 5 0.83765895237
7 0.426992114791 4 0.451084369838
8 0.181147161752 8 0.21177584152
9 0.631139744522 3 0.653554586691
EDIT
with dataframes and indexes:
va,vb=np.random.rand(2,10)
na,nb=np.random.randint(0,100,(2,10))
dfa=pd.DataFrame({'id':na,'odo':va})
dfb=pd.DataFrame({'id':nb,'odo':vb})
all_differences=np.abs(np.subtract.outer(dfa.odo,dfb.odo))
ia=all_differences.argmin(axis=1)
dfc=dfa.merge(dfb.loc[ia].reset_index(drop=True),\
left_index=True,right_index=True)
Input :
In [337]: dfa
Out[337]:
id odo
0 72 0.426457
1 12 0.315997
2 96 0.623164
3 9 0.821498
4 72 0.071237
5 5 0.730634
6 45 0.963051
7 14 0.603289
8 5 0.401737
9 63 0.976644
In [338]: dfb
Out[338]:
id odo
0 95 0.333215
1 7 0.023957
2 61 0.021944
3 57 0.660894
4 22 0.666716
5 6 0.234920
6 83 0.642148
7 64 0.509589
8 98 0.660273
9 19 0.658639
Output :
In [339]: dfc
Out[339]:
id_x odo_x id_y odo_y
0 72 0.426457 64 0.509589
1 12 0.315997 95 0.333215
2 96 0.623164 83 0.642148
3 9 0.821498 22 0.666716
4 72 0.071237 7 0.023957
5 5 0.730634 22 0.666716
6 45 0.963051 22 0.666716
7 14 0.603289 83 0.642148
8 5 0.401737 95 0.333215
9 63 0.976644 22 0.666716
I have a large dataset of the form [t, y(t)] to which I want to apply an IIR low-pass filter (first- or second-order Butterworth should suffice) using scipy.signal (in particular scipy.filter.butter and scipy.filter.filtfilt). The problem is that t is not regularly spaced, which appears to be a requirement for the functions in scipy.signal.
For any "missing" points, I know that my signal remains unchanged from its previous value (so given two consecutive points t1 and t2 in my t-data and a point T not in the data, such that t1<T<t2, the "real" function Y(t) which I'm sampling would take the value Y(T)=Y(t1)). t is integer-valued, so I could simply add the missing points, but this would cause the size of my dataset to grow by a factor ~10, which is problematic given that it's already very large.
So the question is, is there a (sufficiently simple and low-overhead) way to filter my dataset without inserting all "missing" points?
You can efficiently "wrap" your data into a function.
If your data is in the form of a list of lists then you'll need to convert it into a dict and to create a sorted list of your t values. Then you can interpolate the missing values using the list bisection algorithm in the bisect module.
Here's some demo code written in Python 2, but it should be straight-forward to convert it to Python 3, if required.
from random import seed, sample
from bisect import bisect
#Create some fake data
seed(37)
data = dict((u, u/10.) for u in sample(xrange(50), 25))
keys = data.keys()
keys.sort()
print keys
def interp(t):
i = bisect(keys, t)
k = keys[max(0, i-1)]
return data[k]
for i in xrange(50):
print i, interp(i)
output
[2, 4, 8, 10, 14, 15, 19, 21, 22, 23, 26, 27, 29, 30,
32, 33, 34, 35, 37, 38, 39, 42, 43, 44, 48]
0 0.2
1 0.2
2 0.2
3 0.2
4 0.4
5 0.4
6 0.4
7 0.4
8 0.8
9 0.8
10 1.0
11 1.0
12 1.0
13 1.0
14 1.4
15 1.5
16 1.5
17 1.5
18 1.5
19 1.9
20 1.9
21 2.1
22 2.2
23 2.3
24 2.3
25 2.3
26 2.6
27 2.7
28 2.7
29 2.9
30 3.0
31 3.0
32 3.2
33 3.3
34 3.4
35 3.5
36 3.5
37 3.7
38 3.8
39 3.9
40 3.9
41 3.9
42 4.2
43 4.3
44 4.4
45 4.4
46 4.4
47 4.4
48 4.8
49 4.8
(I manually wrapped the output of keys to make it easier to read without horizontal scrolling).
You'll get a tiny speedup by re-writing the body of the interpolation function in one line:
def interp(t):
return data[keys[max(0, bisect(keys, t)-1)]]
It's much less readable, IMHO, but the speed difference may be worth it if the function gets called a lot.
The answer by PM 2Ring works, but assuming that your data are already ordered by t, it is less efficient than possible. It takes log-linear time and linear additional space. You can write a generator that produces a transformed dataset with regular sampling intervals in linear time and constant additional space:
# Assumes that dataset rows are lists as described in the question:
# [[t1, Y(t1)], [t2, Y(t2)], [t3, Y(t3)], ..., [tz, Y(tz)]]
# If this assumption is wrong, just extract t and Y(t) in another way.
# The generated range starts at t1 and ends directly after tz.
# Warning: will overgenerate points if the data are more densely sampled
# than the requested sampling interval.
def step_interpolate(dataset, interval):
left = next(dataset) # [t1, Y(t1)]
right = next(dataset) # [t2, Y(t2)]
t_regular = left[0]
while True:
if left is right: # same list object
right = next(dataset) # iteration stops when dataset stops
if right[0] <= t_regular:
left = right
yield [t_regular, left[1]]
t_regular += interval
Testing:
data = [[1, 10], [15, 2], [50, 100], [55, 17]]
for item in step_interpolate(iter(data), 10):
print item[0], item[1]
Output:
1 10
11 10
21 2
31 2
41 2
51 100
61 17