Creating a new column from two columns using a dictionary in Pandas - python
I want to create a column based on a group and threshold for cutoff from another column for each group of the grouped column.
The dataframe is below:
df_in ->
unique_id myvalue identif
0 CTA15 19.0 TOP
1 CTA15 22.0 TOP
2 CTA15 28.0 TOP
3 CTA15 18.0 TOP
4 CTA15 22.4 TOP
5 AC007 2.0 TOP
6 AC007 2.3 SDME
7 AC007 2.0 SDME
8 AC007 5.0 SDME
9 AC007 3.0 SDME
10 AC007 31.4 SDME
11 AC007 4.4 SDME
12 CGT6 9.7 BTME
13 CGT6 44.5 BTME
14 TVF5 6.7 BTME
15 TVF5 9.1 BTME
16 TVF5 10.0 BTME
17 BGD1 1.0 BTME
18 BGD1 1.6 NON
19 GHB 51.0 NON
20 GHB 54.0 NON
21 GHB 4.7 NON
So I have created a dictionary based on each group of the 'identif' column as :
md = {'TOP': 22, 'SDME': 10, 'BTME': 20, 'NON':20}
So my goal is to create a new column, say 'chk', based on the following condition:
If the "identif" column matches the key in the dictionary "md" and the value for that key is >= than the corresponding value in the "myvalue" column then
I will have 1, otherwise 0.
However, I am trying to find a good way using map/groupby/apply to create the new output data frame. I am now doing a very inefficient way ( which is taking considerable time on real data of million rows)
using a function as follows:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
for key, value in mydict.items():
if row[idCol] == key and row[valCol] >= value:
df['chk'] = 1
elif row[idCol] == key and row[valCol] < value:
df['chk'] = 0
return df
Getting the output via the following call:
df_out = myfilter(df_in, 'identif', 'myvalue', md)
So my output will be like:
df_out ->
unique_id myvalue identif chk
0 CTA15 19.0 TOP 0
1 CTA15 22.0 TOP 1
2 CTA15 28.0 TOP 1
3 CTA15 18.0 TOP 0
4 CTA15 22.4 TOP 1
5 AC007 2.0 TOP 0
6 AC007 2.3 SDME 0
7 AC007 2.0 SDME 0
8 AC007 5.0 SDME 0
9 AC007 3.0 SDME 0
10 AC007 31.4 SDME 1
11 AC007 4.4 SDME 0
12 CGT6 9.7 BTME 0
13 CGT6 44.5 BTME 1
14 TVF5 6.7 BTME 0
15 TVF5 9.1 BTME 0
16 TVF5 10.0 BTME 0
17 BGD1 1.0 BTME 0
18 BGD1 1.6 NON 0
19 GHB 51.0 NON 1
20 GHB 54.0 NON 1
21 GHB 4.7 NON 0
This works but extremely inefficient and would like a much better way to do it.
This should be faster:
def func(identif, value):
if identif in md:
if value >= md[identif]:
return 1.0
else:
return 0.0
else:
return np.NaN
df['chk'] = df.apply(lambda row: func(row['identif'], row['myvalue']), axis=1)
The timing on this little example:
CPU times: user 1.64 ms, sys: 73 µs, total: 1.71 ms
Wall time: 1.66 ms
Your version timing:
CPU times: user 8.6 ms, sys: 1.92 ms, total: 10.5 ms
Wall time: 8.79 ms
Although on such a small example it's not conclusive.
First, you're traversing your dataset four times total, for each row in the data frame you're traversing every element in your dictionary. You can change your function to traverse it once instead. This will speed it up your original function. Try something like:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
value = mydict.get(row[idCol])
if row[valCol] >= value:
df['chk'] = 1
else:
df['chk'] = 0
return df
Related
PatsyError: numbers besides '0' and '1' are only allowed with ** doesnt' not resolve when using Q
I'm trying to run anova test to dataframe that looks like this: >>>code 2020-11-01 2020-11-02 2020-11-03 2020-11-04 ... 0 1 22.5 73.1 12.2 77.5 1 1 23.1 75.4 12.4 78.3 2 2 43.1 72.1 13.4 85.4 3 2 41.6 85.1 34.1 96.5 4 3 97.3 43.2 31.1 55.3 5 3 12.1 44.4 32.2 52.1 ... I want to calculate one way anova for each column based on the code. I have used for that statsmodel and for loop : keys = [] tables = [] for variable in df.columns[1:]: model = ols('{} ~ code'.format(variable), data=df).fit() anova_table = sm.stats.anova_lm(model) keys.append(variable) tables.append(anova_table) df_anova = pd.concat(tables, keys=keys, axis=0) df_anova The problem is that I keep getting error for the 4th line: PatsyError: numbers besides '0' and '1' are only allowed with ** 2020-11-01 ~ code ^^^^ I have tried to use the Q argument as suggested here: ... model = ols('{Q(x)} ~ code'.format(x=variable), data=df).fit() KeyError: 'Q(x)' I have also tried to locate the Q outside but got the same error. My end goal: to calculate one-way anove for each day (each column) based on the "code" column.
You can try to pivot it long and skip the iteration through columns: import pandas as pd import statsmodels.api as sm from statsmodels.formula.api import ols df = pd.DataFrame({"code":[1,1,2,2,3,3], "2020-11-01":[22.5,23.1,43.1,41.6,97.3,12.1], "2020-11-02":[73.1,75.4,72.1,85.1,43.2,44.4]}) df_long = df.melt(id_vars="code") df_long code variable value 0 1 2020-11-01 22.5 1 1 2020-11-01 23.1 2 2 2020-11-01 43.1 3 2 2020-11-01 41.6 4 3 2020-11-01 97.3 5 3 2020-11-01 12.1 6 1 2020-11-02 73.1 7 1 2020-11-02 75.4 8 2 2020-11-02 72.1 9 2 2020-11-02 85.1 10 3 2020-11-02 43.2 11 3 2020-11-02 44.4 Then applying your code: tables = [] keys = df_long.variable.unique() for D in keys: model = ols('value ~ code', data=df_long[df_long.variable == D]).fit() anova_table = sm.stats.anova_lm(model) tables.append(anova_table) pd.concat(tables,keys=keys) Or simply: def aov_func(x): model = ols('value ~ code', data=x).fit() return sm.stats.anova_lm(model) df_long.groupby("variable").apply(aov_func) Gives this result: df sum_sq mean_sq F PR(>F) variable 2020-11-01 code 1.0 1017.6100 1017.610000 1.115768 0.350405 Residual 4.0 3648.1050 912.026250 NaN NaN 2020-11-02 code 1.0 927.2025 927.202500 6.194022 0.067573 Residual 4.0 598.7725 149.693125 NaN NaN
pandas split cumsum to upper limits and then continue with remainder in different column
I have a DataFrame such that: Amt Date 01/01/2000 10 01/02/2000 10 01/03/2000 10 01/04/2000 10 01/05/2000 10 01/06/2000 10 01/07/2000 10 Now suppose I have two storage facilities to store the Amt of product that I purchase; Storage 1 which has a cap of 22.5, and Storage 2 which has a capacity of 30. I would like to add both of these as columns and have them each sum cumulatively at a SPLIT quantity of Amt (for every 10, 5 goes into each). Once Storage 1 reaches capacity, I would like the remainder to go into Storage 2 until it becomes full, at which point the remainder would go into a third column, Sell. After this, the Amt can continue to accumulate in the Sell column for the remainder of the DataFrame, such that the output would look like: Amt | Storage 1 | Storage 2 | Sell | Date 01/01/2000 10 5 5 0 01/02/2000 10 10 10 0 01/03/2000 10 15 15 0 01/04/2000 10 20 20 0 01/05/2000 10 22.5 27.5 0 01/06/2000 10 22.5 30 7.5 01/07/2000 10 22.5 30 17.5 I am aware of cumsum but I am not sure how to set conditions on it, nor do I know how to retrieve the remainder value in the case the storage fills up. I apologize if this is unclear. If I am missing any necessary information, please let me know. Thanks in advance.
Use np.select to get the storage amount: s = df["Amt"].cumsum() df["Storage 1"] = np.select([s<=45, s>45], [s/2,22.5]) df["Storage 2"] = np.select([s<=52.5, s>52.5], [s-df["Storage 1"], 30]) df["Sell"] = s-df["Storage 1"]-df["Storage 2"] print (df) # Amt Storage 1 Storage 2 Sell Date 01/01/2000 10 5.0 5.0 0.0 01/02/2000 10 10.0 10.0 0.0 01/03/2000 10 15.0 15.0 0.0 01/04/2000 10 20.0 20.0 0.0 01/05/2000 10 22.5 27.5 0.0 01/06/2000 10 22.5 30.0 7.5 01/07/2000 10 22.5 30.0 17.5
Since all values in Amt are the same, You may create each column by using cumsum and clip as follows s = df.Amt.cumsum() df['Storage 1'] = (s/2).clip(upper=22.5) df['Storage 2'] = (s - df['Storage 1']).clip(upper=30) df['sell'] = s - df['Storage 1'] - df['Storage 2'] Out[556]: Amt Storage 1 Storage 2 sell Date 01/01/2000 10 5.0 5.0 0.0 01/02/2000 10 10.0 10.0 0.0 01/03/2000 10 15.0 15.0 0.0 01/04/2000 10 20.0 20.0 0.0 01/05/2000 10 22.5 27.5 0.0 01/06/2000 10 22.5 30.0 7.5 01/07/2000 10 22.5 30.0 17.5
Array into dataframe interpolation
I have the following array: [299.13953679 241.1902389 192.58645951 ... 8.53750551 24.38822528 71.61117789] For each value in the array I want to get the interpolated wind speed based on the values in the column power in the following pd.DataFrame: wind speed power 5 2.5 0 6 3.0 25 7 3.5 82 8 4.0 154 9 4.5 244 10 5.0 354 11 5.5 486 12 6.0 643 13 6.5 827 14 7.0 1038 15 7.5 1272 16 8.0 1525 17 8.5 1794 18 9.0 2037 19 9.5 2211 20 10.0 2362 21 10.5 2386 22 11.0 2400 So basically I'd like to retreive the following array: [4.7 4.5 4.3 ... 2.6 3.0 3.4] Any suggestions on where to start? I was looking at the pd.DataFrame.interpolate function but reading through its functionalities it does not seem to be helpful in my problem. Or am I wrong?
Using interp from numpy np.interp(ary,df['power'].values,df['wind speed'].values) Out[202]: array([4.75063426, 4.48439022, 4.21436922, 2.67075011, 2.98776451, 3.40886998])
Difference between dates in Pandas dataframe
This is related to this question, but now I need to find the difference between dates that are stored in 'YYYY-MM-DD'. Essentially the difference between values in the count column is what we need, but normalized by the number of days between each row. My dataframe is: date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count 2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,53.0 2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,53.0 2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,53.0 2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,54.0 2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,54.0 2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,54.0 2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,58.0 2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,521.0 2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,524.0 2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,531.0 2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,533.0 2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,534.0 And I'd like to find the difference between each date after grouping by date+site+country+kind+ID tuples. [date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count,day_diff 2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,0,0 2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,0,1 2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,0,1 2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,0,1 2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,0,1 2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,0,1 2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,4,2 2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,0,0 2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,3,1 2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,7,4 2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,3,1 2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,1,1] One option would be to convert the date column to a Pandas datetime one using pd.to_datetime() and use the diff function but that results in values of "x days", of type timetelda64. I'd like to use this difference to find the daily average count so if this can be accomplished in even a single/less painful step, that would work well.
you can use .dt.days accessor: In [72]: df['date'] = pd.to_datetime(df['date']) In [73]: df['day_diff'] = df.groupby(['site','country_code','kind','ID'])['date'] \ .diff().dt.days.fillna(0) In [74]: df Out[74]: date site country_code kind ID rank votes sessions avg_score count day_diff 0 2017-03-20 website1 US 0 84 226 0.0 15.0 3.370812 53.0 0.0 1 2017-03-21 website1 US 0 84 214 0.0 15.0 3.370812 53.0 1.0 2 2017-03-22 website1 US 0 84 226 0.0 16.0 3.370812 53.0 1.0 3 2017-03-23 website1 US 0 84 234 0.0 16.0 3.369048 54.0 1.0 4 2017-03-24 website1 US 0 84 226 0.0 16.0 3.369048 54.0 1.0 5 2017-03-25 website1 US 0 84 212 0.0 16.0 3.369048 54.0 1.0 6 2017-03-27 website1 US 0 84 228 0.0 16.0 3.369048 58.0 2.0 7 2017-02-15 website2 AU 1 91 144 4.0 148.0 4.727272 521.0 0.0 8 2017-02-16 website2 AU 1 91 144 3.0 147.0 4.727272 524.0 1.0 9 2017-02-20 website2 AU 1 91 100 4.0 148.0 4.727272 531.0 4.0 10 2017-02-21 website2 AU 1 91 118 6.0 149.0 4.727272 533.0 1.0 11 2017-02-22 website2 AU 1 91 114 4.0 151.0 4.727272 534.0 1.0
Need to compare one Pandas (Python) dataframe with values from another dataframe
So I've pulled data from an sql server, and inputted into a dataframe. All the data is of discrete form, and increases in a 0.1 step in one direction (0.0, 0.1, 0.2... 9.8, 9.9, 10.0), with multiple power values for each step (e.g. 1000, 1412, 134.5, 657.1 at 0.1), (14.5, 948.1, 343.8 at 5.5) - hopefully you see what I'm trying to say. I've managed to group the data into these individual steps using the following, and have then taken the mean and standard deviation for each group. group = df.groupby('step').power.mean() group2 = df.groupby('step').power.std().fillna(0) This results in two data frames (group and group2) which have the mean and standard deviation for each of the 0.1 steps. It's then easy to create an upper and lower limit for each step using the following: upperlimit = group + 3*group2 lowerlimit = group - 3*group2 lowerlimit[lowerlimit<0] = 0 Now comes the bit I'm confused about! I need to go back into the original dataframe and remove rows/instances where the power value is outside these calculated limits (note there is a different upper and lower limit for each 0.1 step). Here's 50 lines of the sample data: Index Power Step 0 106.0 5.0 1 200.4 5.5 2 201.4 5.6 3 226.9 5.6 4 206.8 5.6 5 177.5 5.3 6 124.0 4.9 7 121.0 4.8 8 93.9 4.7 9 135.6 5.0 10 211.1 5.6 11 265.2 6.0 12 281.4 6.2 13 417.9 6.9 14 546.0 7.4 15 619.9 7.9 16 404.4 7.1 17 241.4 5.8 18 44.3 3.9 19 72.1 4.6 20 21.1 3.3 21 6.3 2.3 22 0.0 0.8 23 0.0 0.9 24 0.0 3.2 25 0.0 4.6 26 33.3 4.2 27 97.7 4.7 28 91.0 4.7 29 105.6 4.8 30 97.4 4.6 31 126.7 5.0 32 134.3 5.0 33 133.4 5.1 34 301.8 6.3 35 298.5 6.3 36 312.1 6.5 37 505.3 7.5 38 491.8 7.3 39 404.6 6.8 40 324.3 6.6 41 347.2 6.7 42 365.3 6.8 43 279.7 6.3 44 351.4 6.8 45 350.1 6.7 46 573.5 7.9 47 490.1 7.5 48 520.4 7.6 49 548.2 7.9
To put you goal another way, you want to perform some manipulations on grouped data, and then project the results of those manipulations back to the ungrouped rows so you can use them for filtering those rows. One way to do this is with transform: The transform method returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk. You can then create the new rows directly: df['upper'] = df.groupby('step').power.transform(lambda p: p.mean() + 3*p.std().fillna(0)) df['lower'] = df.groupby('step').power.transform(lambda p: p.mean() - 3*p.std().fillna(0)) df.loc[df['lower'] < 0, 'lower'] = 0 And sort accordingly: df = df[(df.power <= df.upper) & (df.power >= df.lower())]