Applying cumulative correction factor across dataframe - python

I'm fairly new to Pandas so please forgive me if the answer to my question is rather obvious. I've got a dataset like this
Data Correction
0 100 Nan
1 104 Nan
2 108 Nan
3 112 Nan
4 116 Nan
5 120 0.5
6 124 Nan
7 128 Nan
8 132 Nan
9 136 0.4
10 140 Nan
11 144 Nan
12 148 Nan
13 152 0.3
14 156 Nan
15 160 Nan
What I want to is to calculate the correction factor for the data which accumulates upwards.
By that I mean that elements from 13 and below should have the factor 0.3 applied, with 9 and below applying 0.3*0.4 and 5 and below 0.3*0.4*0.5.
So the final correction column should look like this
Data Correction Factor
0 100 Nan 0.06
1 104 Nan 0.06
2 108 Nan 0.06
3 112 Nan 0.06
4 116 Nan 0.06
5 120 0.5 0.06
6 124 Nan 0.12
7 128 Nan 0.12
8 132 Nan 0.12
9 136 0.4 0.12
10 140 Nan 0.3
11 144 Nan 0.3
12 148 Nan 0.3
13 152 0.3 0.3
14 156 Nan 1
15 160 Nan 1
How can I do this?

I think you are looking for cumprod() after reversing the Correction column:
df=df.assign(Factor=df.Correction[::-1].cumprod().ffill().fillna(1))
Data Correction Factor
0 100 NaN 0.06
1 104 NaN 0.06
2 108 NaN 0.06
3 112 NaN 0.06
4 116 NaN 0.06
5 120 0.5 0.06
6 124 NaN 0.12
7 128 NaN 0.12
8 132 NaN 0.12
9 136 0.4 0.12
10 140 NaN 0.30
11 144 NaN 0.30
12 148 NaN 0.30
13 152 0.3 0.30
14 156 NaN 1.00
15 160 NaN 1.00

I can't think of a good pandas function that does this, however, you can create a for loop to do multiply an array with the values then put it as a column.
import numpy as np
import pandas as pd
lst = [np.nan,np.nan,np.nan,np.nan,np.nan,0.5,np.nan,np.nan,np.nan,np.nan,0.4,np.nan,np.nan,np.nan,0.3,np.nan,np.nan]
lst1 = [i + 100 for i in range(len(lst))]
newcol= [1.0 for i in range(len(lst))]
newcol = np.asarray(newcol)
df = pd.DataFrame({'Data' : lst1,'Correction' : lst})
for i in range(len(df['Correction'])):
if(~np.isnan(df.Correction[i])):
print(df.Correction[i])
newcol[0:i+1] = newcol[0:i+1] * df.Correction[i]
df['Factor'] = newcol
print(df)
This code prints
Data Correction Factor
0 100 NaN 0.06
1 101 NaN 0.06
2 102 NaN 0.06
3 103 NaN 0.06
4 104 NaN 0.06
5 105 0.5 0.06
6 106 NaN 0.12
7 107 NaN 0.12
8 108 NaN 0.12
9 109 NaN 0.12
10 110 0.4 0.12
11 111 NaN 0.30
12 112 NaN 0.30
13 113 NaN 0.30
14 114 0.3 0.30
15 115 NaN 1.00
16 116 NaN 1.00

Related

Creating row based on differences between dictionary and current Groupby

Ode Proceeds Pos Amount Positions Target Weighting Additions
0 676 30160 FPE 51741.25000 5 0.10 0.187636 NaN
1 676 30160 HFA 57299.63616 5 0.20 0.207794 NaN
2 676 30160 PFL 60437.40563 5 0.20 0.219173 NaN
3 676 30160 PSO 53053.57410 5 0.15 0.192396 NaN
4 676 30160 RNS 53220.36636 5 0.20 0.193001 NaN
5 953 34960 PFL 8506.19390 1 0.20 1.000000 NaN
6 637 14750 PFL 8341.21701 3 0.20 0.302517 NaN
7 637 14750 PSO 12669.65078 3 0.15 0.459499 NaN
8 637 14750 RNS 6561.85824 3 0.20 0.237984 NaN
9 673 12610 FPE 31220.47500 5 0.10 0.175041 NaN
10 673 12610 HFA 34020.29280 5 0.20 0.190738 NaN
11 673 12610 PFL 37754.00236 5 0.20 0.211672 NaN
12 673 12610 PSO 31492.56779 5 0.15 0.176566 NaN
13 673 12610 RNS 43873.58472 5 0.20 0.245982 NaN
14 318 93790 PFL 59859.39180 2 0.20 0.285266 NaN
15 318 93790 PSO 149977.71090 2 0.15 0.714734 NaN
16 222 75250 FPE 21000.00000 6 0.10 0.100000 7525.0
17 222 75250 HFA 42000.00000 6 0.20 0.200000 15050.0
18 222 75250 PFL 42000.00000 6 0.20 0.200000 15050.0
19 222 75250 PSO 31500.00000 6 0.15 0.150000 11287.5
20 222 75250 RNS 42000.00000 6 0.20 0.200000 15050.0
21 222 75250 CRD 31500.00000 6 0.15 0.150000 11287.5
Th information below is the desired output - simply a cut-out of the first 5 rows from above information that shows the new column ['Target Amount']as well as the creation of the last row - when you compare Ode 676 it has 5 out of the 6 Pos that are in the below dictionary. Since Ode 676 is missing CRD, I need a way to create a row and fill in the information
target_dict = {"PFL":.20,"RNS":.20,"HFA":.20,"PSO":.15,"CRD":.15,"FPE":.10}
Ode Proceeds Pos Amount Positions Target Weighting Target Amt Additions
0 676 30160 FPE 51741.25000 5 0.10 0.187636 30591.22 -21150.03
1 676 30160 HFA 57299.63616 5 0.20 0.207794 61182.45 3882.81
2 676 30160 PFL 60437.40563 5 0.20 0.219173 61182.45 745.04
3 676 30160 PSO 53053.57410 5 0.15 0.192396 45886.83 -7166.74
4 676 30160 RNS 53220.36636 5 0.20 0.193001 61182.45 7962.08
5 676 30160 CRD 0 0.15 0 45886.83 45886.83
CRD would be added to make the full 6 Positions then the ['Target Amt'] would be calculated based on the sum of all ['Amount']plus the ['Proceeds'] to get a total for Ode 676. I can figure out the calculations but I can't figure out how to add the row for Ode where ['Positions'] < 6 based on the differences between'target_dict'and the current ['Pos']for Ode 676.
You can use reindex with pd.MultiIndex.from_product that will create all the combination between unique values of 'Ode' and each key of target_dict such as:
df_all = (df.set_index(['Ode', 'Pos']) #first set index for reindex them after
.reindex(pd.MultiIndex.from_product([df.Ode.unique(), target_dict.keys()],
names = ['Ode','Pos']))
.reset_index()) # index back as columns
print (df_all) #note I took rows for Ode = 676 and 953 only
Ode Pos Proceeds Amount Positions Target Weighting
0 676 PFL 30160.0 60437.40563 5.0 0.20 0.219173
1 676 RNS 30160.0 53220.36636 5.0 0.20 0.193001
2 676 HFA 30160.0 57299.63616 5.0 0.20 0.207794
3 676 PSO 30160.0 53053.57410 5.0 0.15 0.192396
4 676 CRD NaN NaN NaN NaN NaN
5 676 FPE 30160.0 51741.25000 5.0 0.10 0.187636
6 953 PFL 34960.0 8506.19390 1.0 0.20 1.000000
7 953 RNS NaN NaN NaN NaN NaN
8 953 HFA NaN NaN NaN NaN NaN
9 953 PSO NaN NaN NaN NaN NaN
10 953 CRD NaN NaN NaN NaN NaN
11 953 FPE NaN NaN NaN NaN NaN
Now to complete the data as you look for you can try fillna, map, and groupby.transform:
# fillna some columns with 0
df_all.Amount = df_all.Amount.fillna(0)
df_all.Weighting = df_all.Weighting.fillna(0)
# map the dictionary to get the values in target column
df_all.Target = df_all.Pos.map(target_dict)
# create the groupby Ode
gr = df_all.groupby('Ode')
# fill Proceeds and Positions with the first not nan value in the group
df_all.Proceeds = gr.Proceeds.transform('first')
df_all.Positions = gr.Positions.transform('first')
# create the columns Target_amt and Additions according to your equation
df_all['Target_Amt'] = (gr.Amount.transform(sum) + df_all.Proceeds)*df_all.Target
df_all['Additions'] = df_all.Amount - df_all.Target_Amt
and you get:
print (df_all)
Ode Pos Proceeds Amount Positions Target Weighting \
0 676 PFL 30160.0 60437.40563 5.0 0.20 0.219173
1 676 RNS 30160.0 53220.36636 5.0 0.20 0.193001
2 676 HFA 30160.0 57299.63616 5.0 0.20 0.207794
3 676 PSO 30160.0 53053.57410 5.0 0.15 0.192396
4 676 CRD 30160.0 0.00000 5.0 0.15 0.000000
5 676 FPE 30160.0 51741.25000 5.0 0.10 0.187636
6 953 PFL 34960.0 8506.19390 1.0 0.20 1.000000
7 953 RNS 34960.0 0.00000 1.0 0.20 0.000000
8 953 HFA 34960.0 0.00000 1.0 0.20 0.000000
9 953 PSO 34960.0 0.00000 1.0 0.15 0.000000
10 953 CRD 34960.0 0.00000 1.0 0.15 0.000000
11 953 FPE 34960.0 0.00000 1.0 0.10 0.000000
Additions Target_Amt
0 -745.040820 61182.446450
1 -7962.080090 61182.446450
2 -3882.810290 61182.446450
3 7166.739262 45886.834837
4 -45886.834837 45886.834837
5 21150.026775 30591.223225
6 -187.044880 8693.238780
7 -8693.238780 8693.238780
8 -8693.238780 8693.238780
9 -6519.929085 6519.929085
10 -6519.929085 6519.929085
11 -4346.619390 4346.619390

Python Pandas-retrieving values in one column while they are less than the value of a second column

Suppose I have a df that looks like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 31.0 0.90
2 30 0.03 41.0 0.70
3 40 0.72 51.0 0.08
4 50 0.09 81.0 0.78
5 60 0.09 NaN NaN
6 70 0.01 NaN NaN
7 80 0.09 NaN NaN
8 90 0.08 NaN NaN
9 100 0.02 NaN NaN
In the posR column, we see that it jumps from 11 to 31, and there is not a value in the "20's". I want to insert a value to fill that space, which would essentially just be the posF value, and NA, so my resulting df would look like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 20 NaN
2 30 0.03 31.0 0.90
3 40 0.72 41.0 0.70
4 50 0.09 50 NaN
5 60 0.09 60 NaN
6 70 0.01 70 NaN
7 80 0.09 80 NaN
8 90 0.08 81.0 0.78
9 100 0.02 100 NaN
So I want to fill the NaN values in the position with the values from posF that are in between the values in posR.
What I have tried to do is just make a dummy list and add values to the list based on if they were less than a (I see the flaw here but I don't know how to fix it).
insert_rows = []
for x in df['posF']:
for a,b in zip(df['posR'], df['rfreq']):
if x<a:
insert_rows.append([x, 'NA'])
print(len(insert_rows))#21, should be 5
I realize that it is appending x several times until it reaches the condition of being >a.
After this I will just create a new df and add these values to the original 2 columns so they are the same length.
If you can think of a better title, feel free to edit.
My first thought was to retrieve the new indices for the entries in posR by interpolating with posF and then put the values to their new positions - but as you want to have 81 one row later than here, I'm afraid this is not exactly what you're searching for and I still don't really get the logic behind your task.
However, perhaps this is a starting point, let's see...
This approach would work like the following:
Retrieve the new index positions of the values in posR according to their order in posF:
import numpy as np
idx = np.interp(df.posR, df.posF, df.index).round()
Get rid of nan entries and cast to int:
idx = idx[np.isfinite(idx)].astype(int)
Create a new column by copying posF in the first step, and set newrfreq to nan respectively:
df['newposR'] = df.posF
df['newrfreq'] = np.nan
Then overwrite with the values from posR and rfreq, but now at the updated positions:
df.loc[idx, 'newposR'] = df.posR[:len(idx)].values
df.loc[idx, 'newrfreq'] = df.rfreq[:len(idx)].values
Result:
posF ffreq posR rfreq newposR newrfreq
0 10 0.50 11.0 0.08 11.0 0.08
1 20 0.20 31.0 0.90 20.0 NaN
2 30 0.03 41.0 0.70 31.0 0.90
3 40 0.72 51.0 0.08 41.0 0.70
4 50 0.09 81.0 0.78 51.0 0.08
5 60 0.09 NaN NaN 60.0 NaN
6 70 0.01 NaN NaN 70.0 NaN
7 80 0.09 NaN NaN 81.0 0.78
8 90 0.08 NaN NaN 90.0 NaN
9 100 0.02 NaN NaN 100.0 NaN

Plotting a heatmap for trajectory data from a pandas dataframe

I have a dataframe in pandas containing information that I would like display as a heatmap of sorts. The dataframe displays the x and y co-ordinates of several objects at varying points in time and includes other information in extra columns (eg:mass).
time object x y mass
3 1.0 216 12 12
4 1.0 218 13 12
5 1.0 217 12 12
6 1.0 234 13 13
1 2.0 361 289 23
2 2.0 362 287 22
3 2.0 362 286 22
5 3.0 124 56 18
6 3.0 126 52 17
I would like to create a heatmap with the x and y values corresponding to the x and y axes of the heatmap. The greater the number of objects at a particular x/y location, the more intense I would like the color to be. Any ideas on how you would accomplish this?
One idea is to use seaborn heatmap. First I would pivot your dataframe over your desired output, in this case x, y and say mass, with:
In [4]: df
Out[4]:
time object x y mass
0 3 1.0 216 12 12
1 4 1.0 218 13 12
2 5 1.0 217 12 12
3 6 1.0 234 13 13
4 1 2.0 361 289 23
5 2 2.0 362 287 22
6 3 2.0 362 286 22
7 5 3.0 124 56 18
8 6 3.0 126 52 17
In [5]: d = df.pivot('x','y','mass')
In [6]: d
Out[6]:
y 12 13 52 56 286 287 289
x
124 NaN NaN NaN 18.0 NaN NaN NaN
126 NaN NaN 17.0 NaN NaN NaN NaN
216 12.0 NaN NaN NaN NaN NaN NaN
217 12.0 NaN NaN NaN NaN NaN NaN
218 NaN 12.0 NaN NaN NaN NaN NaN
234 NaN 13.0 NaN NaN NaN NaN NaN
361 NaN NaN NaN NaN NaN NaN 23.0
362 NaN NaN NaN NaN 22.0 22.0 NaN
Then you can apply a simple heatmap with:
ax = sns.heatmap(d)
as a result you have the following image. In the case you need more complex attribute instead of the single mass, you can add a new column in the original dataframe. Finally here you can find some samples on how to define colormaps, style etc.

sorting a column with missing values

There are 6 columns of data , 4th column has same values as the first one but some values missing, I would like to know how to sort the 4th column such that same values fall on same row using python.
Sample data
255 12 0.1 255 12 0.1
256 13 0.1 259 15 0.15
259 15 0.15 272 18 0.12
272 18 0.12
290 19 0.09
Desired output
255 12 0.1 255 12 0.1
256 13 0.1
259 15 0.15 259 15 0.15
272 18 0.12 272 18 0.12
290 19 0.09
You can try merge:
print df
a b c d e f
0 255 12 0.10 255.0 12.0 0.10
1 256 13 0.10 259.0 15.0 0.15
2 259 15 0.15 272.0 18.0 0.12
3 272 18 0.12 NaN NaN NaN
4 290 19 0.09 NaN NaN NaN
print pd.merge(df[['a','b','c']],
df[['d','e','f']],
left_on=['a','b'],
right_on=['d','e'],
how='left')
a b c d e f
0 255 12 0.10 255.0 12.0 0.10
1 256 13 0.10 NaN NaN NaN
2 259 15 0.15 259.0 15.0 0.15
3 272 18 0.12 272.0 18.0 0.12
4 290 19 0.09 NaN NaN NaN

Pandas Pivot table nearest neighbor

SOLUTION
df = pd.read_csv('data.txt')
df['z-C+1'] = df.groupby(['a','b','d'])['z'].transform(lambda x:x.shift(+1))
df['z-C-1'] = df.groupby(['a','b','d'])['z'].transform(lambda x:x.shift(-1))
df['z-D+1'] = df.groupby(['a','b','c'])['z'].transform(lambda x:x.shift(+1))
df['z-D-1'] = df.groupby(['a','b','c'])['z'].transform(lambda x:x.shift(-1))
QUESTION
I have a CSV which is sorted by a few indexes. There is one index in particular I am interested in, and I want to keep the table the same. All I want to do is add extra columns which are a function of the table. So, lets say "v" is the column of interest. I want to take the "z" column, and add more "z" columns from other places in the table where "c" = "c+1" and "c-1" and "d+1", "d-1", and just join those on the end. In the end I want the same number of rows, but with the "Z" column expanded to columns that are "Z.C-1.D", "Z.C.D", "Z.C+1.D", "Z.C.D-1", "Z.C.D+1". If that makes any sense. I'm having difficulties. I've tried the pivot_table method, and that brought me somewhere, while also adding confusion.
If this helps: Think about it like a point in a matrix, and I have an independent variable & dependent variable. I want to extract the neighboring independent variables for every location I have an observation
Here is my example csv:
a b c d v z
10 1 15 42 0.90 5460
10 2 15 42 0.97 6500
10 1 16 42 1.04 7540
10 2 16 42 1.11 8580
10 1 15 43 1.18 9620
10 2 15 43 0.98 10660
10 1 16 43 1.32 3452
10 2 16 43 1.39 4561
11 1 15 42 0.54 5670
11 2 15 42 1.53 6779
11 1 16 42 1.60 7888
11 2 16 42 1.67 8997
11 1 15 43 1.74 10106
11 2 15 43 1.81 11215
11 1 16 43 1.88 12324
11 2 16 43 1.95 13433
And my desired output:
a b c d v z z[c-1] z[c+1] z[d-1] z[d+1]
10 1 15 42 0.90 5460 Nan 7540 Nan 9620
10 2 15 42 0.97 6500 Nan 8580 Nan 10660
10 1 16 42 1.04 7540 5460 Nan Nan 3452
10 2 16 42 1.11 8580 6500 Nan Nan 4561
10 1 15 43 1.18 9620 Nan 3452 5460 Nan
10 2 15 43 0.98 10660 Nan 4561 6500 Nan
10 1 16 43 1.32 3452 9620 Nan 7540 Nan
10 2 16 43 1.39 4561 10660 Nan 8580 Nan
11 1 15 42 0.54 5670 Nan 7888 Nan 10106
11 2 15 42 1.53 6779 Nan 8997 Nan 11215
11 1 16 42 1.60 7888 5670 Nan Nan 12324
11 2 16 42 1.67 8997 6779 Nan Nan 13433
11 1 15 43 1.74 10106 Nan 12324 5670 Nan
11 2 15 43 1.81 11215 Nan 13433 6779 Nan
11 1 16 43 1.88 12324 10106 Nan 7888 Nan
11 2 16 43 1.95 13433 11215 Nan 8997 Nan
Don't know if I understood you, but you can use shift() method to add shifted columns, like:
df['z-1'] = df.groupby('a')['z'].transform(lambda x:x.shift(-1))
update
If you want selection by values, you can use apply():
def lkp_data(c,d,v):
d = df[(df['c'] == c) & (df['d'] == d) & (df['v'] == v)]['z']
return None if len(d) == 0 else d.values[0]
df['z[c-1]'] = df.apply(lambda x: lkp_data(x['c'] - 1, x['d'], x['v']), axis=1)
df['z[c+1]'] = df.apply(lambda x: lkp_data(x['c'] + 1, x['d'], x['v']), axis=1)
df['z[d-1]'] = df.apply(lambda x: lkp_data(x['c'], x['d'] - 1, x['v']), axis=1)
df['z[d+1]'] = df.apply(lambda x: lkp_data(x['c'], x['d'] + 1, x['v']), axis=1)
c d z v z[c-1] z[c+1] z[d-1] z[d+1]
0 15 42 5460 1 NaN 7540 NaN 9620
1 15 42 6500 2 NaN 8580 NaN 10660
2 16 42 7540 1 5460 NaN NaN 3452
3 16 42 8580 2 6500 NaN NaN 4561
4 15 43 9620 1 NaN 3452 5460 NaN
5 15 43 10660 2 NaN 4561 6500 NaN
6 16 43 3452 1 9620 NaN 7540 NaN
7 16 43 4561 2 10660 NaN 8580 NaN
But I think, this one would be really inefficient

Categories