Creating a new column from two columns using a dictionary in Pandas - python

I want to create a column based on a group and threshold for cutoff from another column for each group of the grouped column.
The dataframe is below:
df_in ->
unique_id myvalue identif
0 CTA15 19.0 TOP
1 CTA15 22.0 TOP
2 CTA15 28.0 TOP
3 CTA15 18.0 TOP
4 CTA15 22.4 TOP
5 AC007 2.0 TOP
6 AC007 2.3 SDME
7 AC007 2.0 SDME
8 AC007 5.0 SDME
9 AC007 3.0 SDME
10 AC007 31.4 SDME
11 AC007 4.4 SDME
12 CGT6 9.7 BTME
13 CGT6 44.5 BTME
14 TVF5 6.7 BTME
15 TVF5 9.1 BTME
16 TVF5 10.0 BTME
17 BGD1 1.0 BTME
18 BGD1 1.6 NON
19 GHB 51.0 NON
20 GHB 54.0 NON
21 GHB 4.7 NON
So I have created a dictionary based on each group of the 'identif' column as :
md = {'TOP': 22, 'SDME': 10, 'BTME': 20, 'NON':20}
So my goal is to create a new column, say 'chk', based on the following condition:
If the "identif" column matches the key in the dictionary "md" and the value for that key is >= than the corresponding value in the "myvalue" column then
I will have 1, otherwise 0.
However, I am trying to find a good way using map/groupby/apply to create the new output data frame. I am now doing a very inefficient way ( which is taking considerable time on real data of million rows)
using a function as follows:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
for key, value in mydict.items():
if row[idCol] == key and row[valCol] >= value:
df['chk'] = 1
elif row[idCol] == key and row[valCol] < value:
df['chk'] = 0
return df
Getting the output via the following call:
df_out = myfilter(df_in, 'identif', 'myvalue', md)
So my output will be like:
df_out ->
unique_id myvalue identif chk
0 CTA15 19.0 TOP 0
1 CTA15 22.0 TOP 1
2 CTA15 28.0 TOP 1
3 CTA15 18.0 TOP 0
4 CTA15 22.4 TOP 1
5 AC007 2.0 TOP 0
6 AC007 2.3 SDME 0
7 AC007 2.0 SDME 0
8 AC007 5.0 SDME 0
9 AC007 3.0 SDME 0
10 AC007 31.4 SDME 1
11 AC007 4.4 SDME 0
12 CGT6 9.7 BTME 0
13 CGT6 44.5 BTME 1
14 TVF5 6.7 BTME 0
15 TVF5 9.1 BTME 0
16 TVF5 10.0 BTME 0
17 BGD1 1.0 BTME 0
18 BGD1 1.6 NON 0
19 GHB 51.0 NON 1
20 GHB 54.0 NON 1
21 GHB 4.7 NON 0
This works but extremely inefficient and would like a much better way to do it.

This should be faster:
def func(identif, value):
if identif in md:
if value >= md[identif]:
return 1.0
else:
return 0.0
else:
return np.NaN
df['chk'] = df.apply(lambda row: func(row['identif'], row['myvalue']), axis=1)
The timing on this little example:
CPU times: user 1.64 ms, sys: 73 µs, total: 1.71 ms
Wall time: 1.66 ms
Your version timing:
CPU times: user 8.6 ms, sys: 1.92 ms, total: 10.5 ms
Wall time: 8.79 ms
Although on such a small example it's not conclusive.

First, you're traversing your dataset four times total, for each row in the data frame you're traversing every element in your dictionary. You can change your function to traverse it once instead. This will speed it up your original function. Try something like:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
value = mydict.get(row[idCol])
if row[valCol] >= value:
df['chk'] = 1
else:
df['chk'] = 0
return df

Related

PatsyError: numbers besides '0' and '1' are only allowed with ** doesnt' not resolve when using Q

I'm trying to run anova test to dataframe that looks like this:
>>>code 2020-11-01 2020-11-02 2020-11-03 2020-11-04 ...
0 1 22.5 73.1 12.2 77.5
1 1 23.1 75.4 12.4 78.3
2 2 43.1 72.1 13.4 85.4
3 2 41.6 85.1 34.1 96.5
4 3 97.3 43.2 31.1 55.3
5 3 12.1 44.4 32.2 52.1
...
I want to calculate one way anova for each column based on the code. I have used for that statsmodel and for loop :
keys = []
tables = []
for variable in df.columns[1:]:
model = ols('{} ~ code'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
df_anova
The problem is that I keep getting error for the 4th line:
PatsyError: numbers besides '0' and '1' are only allowed with **
2020-11-01 ~ code
^^^^
I have tried to use the Q argument as suggested here:
...
model = ols('{Q(x)} ~ code'.format(x=variable), data=df).fit()
KeyError: 'Q(x)'
I have also tried to locate the Q outside but got the same error.
My end goal: to calculate one-way anove for each day (each column) based on the "code" column.
You can try to pivot it long and skip the iteration through columns:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.DataFrame({"code":[1,1,2,2,3,3],
"2020-11-01":[22.5,23.1,43.1,41.6,97.3,12.1],
"2020-11-02":[73.1,75.4,72.1,85.1,43.2,44.4]})
df_long = df.melt(id_vars="code")
df_long
code variable value
0 1 2020-11-01 22.5
1 1 2020-11-01 23.1
2 2 2020-11-01 43.1
3 2 2020-11-01 41.6
4 3 2020-11-01 97.3
5 3 2020-11-01 12.1
6 1 2020-11-02 73.1
7 1 2020-11-02 75.4
8 2 2020-11-02 72.1
9 2 2020-11-02 85.1
10 3 2020-11-02 43.2
11 3 2020-11-02 44.4
Then applying your code:
tables = []
keys = df_long.variable.unique()
for D in keys:
model = ols('value ~ code', data=df_long[df_long.variable == D]).fit()
anova_table = sm.stats.anova_lm(model)
tables.append(anova_table)
pd.concat(tables,keys=keys)
Or simply:
def aov_func(x):
model = ols('value ~ code', data=x).fit()
return sm.stats.anova_lm(model)
df_long.groupby("variable").apply(aov_func)
Gives this result:
df sum_sq mean_sq F PR(>F)
variable
2020-11-01 code 1.0 1017.6100 1017.610000 1.115768 0.350405
Residual 4.0 3648.1050 912.026250 NaN NaN
2020-11-02 code 1.0 927.2025 927.202500 6.194022 0.067573
Residual 4.0 598.7725 149.693125 NaN NaN

pandas split cumsum to upper limits and then continue with remainder in different column

I have a DataFrame such that:
Amt
Date
01/01/2000 10
01/02/2000 10
01/03/2000 10
01/04/2000 10
01/05/2000 10
01/06/2000 10
01/07/2000 10
Now suppose I have two storage facilities to store the Amt of product that I purchase; Storage 1 which has a cap of 22.5, and Storage 2 which has a capacity of 30. I would like to add both of these as columns and have them each sum cumulatively at a SPLIT quantity of Amt (for every 10, 5 goes into each). Once Storage 1 reaches capacity, I would like the remainder to go into Storage 2 until it becomes full, at which point the remainder would go into a third column, Sell. After this, the Amt can continue to accumulate in the Sell column for the remainder of the DataFrame, such that the output would look like:
Amt | Storage 1 | Storage 2 | Sell |
Date
01/01/2000 10 5 5 0
01/02/2000 10 10 10 0
01/03/2000 10 15 15 0
01/04/2000 10 20 20 0
01/05/2000 10 22.5 27.5 0
01/06/2000 10 22.5 30 7.5
01/07/2000 10 22.5 30 17.5
I am aware of cumsum but I am not sure how to set conditions on it, nor do I know how to retrieve the remainder value in the case the storage fills up.
I apologize if this is unclear. If I am missing any necessary information, please let me know. Thanks in advance.
Use np.select to get the storage amount:
s = df["Amt"].cumsum()
df["Storage 1"] = np.select([s<=45, s>45],
[s/2,22.5])
df["Storage 2"] = np.select([s<=52.5, s>52.5],
[s-df["Storage 1"], 30])
df["Sell"] = s-df["Storage 1"]-df["Storage 2"]
print (df)
#
Amt Storage 1 Storage 2 Sell
Date
01/01/2000 10 5.0 5.0 0.0
01/02/2000 10 10.0 10.0 0.0
01/03/2000 10 15.0 15.0 0.0
01/04/2000 10 20.0 20.0 0.0
01/05/2000 10 22.5 27.5 0.0
01/06/2000 10 22.5 30.0 7.5
01/07/2000 10 22.5 30.0 17.5
Since all values in Amt are the same, You may create each column by using cumsum and clip as follows
s = df.Amt.cumsum()
df['Storage 1'] = (s/2).clip(upper=22.5)
df['Storage 2'] = (s - df['Storage 1']).clip(upper=30)
df['sell'] = s - df['Storage 1'] - df['Storage 2']
Out[556]:
Amt Storage 1 Storage 2 sell
Date
01/01/2000 10 5.0 5.0 0.0
01/02/2000 10 10.0 10.0 0.0
01/03/2000 10 15.0 15.0 0.0
01/04/2000 10 20.0 20.0 0.0
01/05/2000 10 22.5 27.5 0.0
01/06/2000 10 22.5 30.0 7.5
01/07/2000 10 22.5 30.0 17.5

Array into dataframe interpolation

I have the following array:
[299.13953679 241.1902389 192.58645951 ... 8.53750551 24.38822528
71.61117789]
For each value in the array I want to get the interpolated wind speed based on the values in the column power in the following pd.DataFrame:
wind speed power
5 2.5 0
6 3.0 25
7 3.5 82
8 4.0 154
9 4.5 244
10 5.0 354
11 5.5 486
12 6.0 643
13 6.5 827
14 7.0 1038
15 7.5 1272
16 8.0 1525
17 8.5 1794
18 9.0 2037
19 9.5 2211
20 10.0 2362
21 10.5 2386
22 11.0 2400
So basically I'd like to retreive the following array:
[4.7 4.5 4.3 ... 2.6 3.0 3.4]
Any suggestions on where to start? I was looking at the pd.DataFrame.interpolate function but reading through its functionalities it does not seem to be helpful in my problem. Or am I wrong?
Using interp from numpy
np.interp(ary,df['power'].values,df['wind speed'].values)
Out[202]:
array([4.75063426, 4.48439022, 4.21436922, 2.67075011, 2.98776451,
3.40886998])

Difference between dates in Pandas dataframe

This is related to this question, but now I need to find the difference between dates that are stored in 'YYYY-MM-DD'. Essentially the difference between values in the count column is what we need, but normalized by the number of days between each row.
My dataframe is:
date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,53.0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,53.0
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,53.0
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,54.0
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,54.0
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,54.0
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,58.0
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,521.0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,524.0
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,531.0
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,533.0
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,534.0
And I'd like to find the difference between each date after grouping by date+site+country+kind+ID tuples.
[date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count,day_diff
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,0,0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,0,1
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,0,1
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,0,1
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,0,1
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,0,1
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,4,2
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,0,0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,3,1
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,7,4
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,3,1
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,1,1]
One option would be to convert the date column to a Pandas datetime one using pd.to_datetime() and use the diff function but that results in values of "x days", of type timetelda64. I'd like to use this difference to find the daily average count so if this can be accomplished in even a single/less painful step, that would work well.
you can use .dt.days accessor:
In [72]: df['date'] = pd.to_datetime(df['date'])
In [73]: df['day_diff'] = df.groupby(['site','country_code','kind','ID'])['date'] \
.diff().dt.days.fillna(0)
In [74]: df
Out[74]:
date site country_code kind ID rank votes sessions avg_score count day_diff
0 2017-03-20 website1 US 0 84 226 0.0 15.0 3.370812 53.0 0.0
1 2017-03-21 website1 US 0 84 214 0.0 15.0 3.370812 53.0 1.0
2 2017-03-22 website1 US 0 84 226 0.0 16.0 3.370812 53.0 1.0
3 2017-03-23 website1 US 0 84 234 0.0 16.0 3.369048 54.0 1.0
4 2017-03-24 website1 US 0 84 226 0.0 16.0 3.369048 54.0 1.0
5 2017-03-25 website1 US 0 84 212 0.0 16.0 3.369048 54.0 1.0
6 2017-03-27 website1 US 0 84 228 0.0 16.0 3.369048 58.0 2.0
7 2017-02-15 website2 AU 1 91 144 4.0 148.0 4.727272 521.0 0.0
8 2017-02-16 website2 AU 1 91 144 3.0 147.0 4.727272 524.0 1.0
9 2017-02-20 website2 AU 1 91 100 4.0 148.0 4.727272 531.0 4.0
10 2017-02-21 website2 AU 1 91 118 6.0 149.0 4.727272 533.0 1.0
11 2017-02-22 website2 AU 1 91 114 4.0 151.0 4.727272 534.0 1.0

Need to compare one Pandas (Python) dataframe with values from another dataframe

So I've pulled data from an sql server, and inputted into a dataframe. All the data is of discrete form, and increases in a 0.1 step in one direction (0.0, 0.1, 0.2... 9.8, 9.9, 10.0), with multiple power values for each step (e.g. 1000, 1412, 134.5, 657.1 at 0.1), (14.5, 948.1, 343.8 at 5.5) - hopefully you see what I'm trying to say.
I've managed to group the data into these individual steps using the following, and have then taken the mean and standard deviation for each group.
group = df.groupby('step').power.mean()
group2 = df.groupby('step').power.std().fillna(0)
This results in two data frames (group and group2) which have the mean and standard deviation for each of the 0.1 steps. It's then easy to create an upper and lower limit for each step using the following:
upperlimit = group + 3*group2
lowerlimit = group - 3*group2
lowerlimit[lowerlimit<0] = 0
Now comes the bit I'm confused about! I need to go back into the original dataframe and remove rows/instances where the power value is outside these calculated limits (note there is a different upper and lower limit for each 0.1 step).
Here's 50 lines of the sample data:
Index Power Step
0 106.0 5.0
1 200.4 5.5
2 201.4 5.6
3 226.9 5.6
4 206.8 5.6
5 177.5 5.3
6 124.0 4.9
7 121.0 4.8
8 93.9 4.7
9 135.6 5.0
10 211.1 5.6
11 265.2 6.0
12 281.4 6.2
13 417.9 6.9
14 546.0 7.4
15 619.9 7.9
16 404.4 7.1
17 241.4 5.8
18 44.3 3.9
19 72.1 4.6
20 21.1 3.3
21 6.3 2.3
22 0.0 0.8
23 0.0 0.9
24 0.0 3.2
25 0.0 4.6
26 33.3 4.2
27 97.7 4.7
28 91.0 4.7
29 105.6 4.8
30 97.4 4.6
31 126.7 5.0
32 134.3 5.0
33 133.4 5.1
34 301.8 6.3
35 298.5 6.3
36 312.1 6.5
37 505.3 7.5
38 491.8 7.3
39 404.6 6.8
40 324.3 6.6
41 347.2 6.7
42 365.3 6.8
43 279.7 6.3
44 351.4 6.8
45 350.1 6.7
46 573.5 7.9
47 490.1 7.5
48 520.4 7.6
49 548.2 7.9
To put you goal another way, you want to perform some manipulations on grouped data, and then project the results of those manipulations back to the ungrouped rows so you can use them for filtering those rows. One way to do this is with transform:
The transform method returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk.
You can then create the new rows directly:
df['upper'] = df.groupby('step').power.transform(lambda p: p.mean() + 3*p.std().fillna(0))
df['lower'] = df.groupby('step').power.transform(lambda p: p.mean() - 3*p.std().fillna(0))
df.loc[df['lower'] < 0, 'lower'] = 0
And sort accordingly:
df = df[(df.power <= df.upper) & (df.power >= df.lower())]

Categories