I'm trying to create a new column in an R dataframe based on a set of conditions that are mutually exclusive. There is a clever way to achieve this on python using np.select(conditions, choices), instead of np.where (See this solved question). I've been looking for an equivalent on R that allows me to avoid writing a gigantic nested ifelse (which is the equivalent of np.where) without any success.
The amount of conditions that I have can change and I'm implementing a function for this. Therefore, and equivalent could be really helpful. Is there any option to do this? I'm new in R and come from python.
Thank you!
Yes, you can use case_when in R:
library(dplyr)
mtcars%>%
mutate(cyl2=case_when(cyl>7~"High",
cyl==6~"Medium",
TRUE~"Low"))
mpg cyl disp hp drat wt qsec vs am gear carb cyl2
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Medium
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Medium
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Low
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Medium
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 High
There's also cut(), Convert Numeric to Factor, with or without your own labels:
df <- data.frame(a = 1:10)
df$b <- cut(df$a,
breaks = c(-Inf,3,7,Inf),
labels = c("lo", "med", "hi"))
df$c <- cut(df$a,
breaks = c(-Inf,3,7,Inf))
df
#> a b c
#> 1 1 lo (-Inf,3]
#> 2 2 lo (-Inf,3]
#> 3 3 lo (-Inf,3]
#> 4 4 med (3,7]
#> 5 5 med (3,7]
#> 6 6 med (3,7]
#> 7 7 med (3,7]
#> 8 8 hi (7, Inf]
#> 9 9 hi (7, Inf]
#> 10 10 hi (7, Inf]
Related
dataframe = pd.DataFrame(data={'user': [1,1,1,1,1,2,2,2,2,2], 'usage':
[12,18,76,32,43,45,19,42,9,10]})
dataframe['mean'] = dataframe.groupby('user'['usage'].apply(pd.rolling_mean, 2))
Why this code is not working?
i am getting an error of rolling mean attribute is not found in pandas
Use groupby with rolling, docs:
dataframe['mean'] = (dataframe.groupby('user')['usage']
.rolling(2)
.mean()
.reset_index(level=0, drop=True))
print (dataframe)
user usage mean
0 1 12 NaN
1 1 18 15.0
2 1 76 47.0
3 1 32 54.0
4 1 43 37.5
5 2 45 NaN
6 2 19 32.0
7 2 42 30.5
8 2 9 25.5
9 2 10 9.5
I would like to populate the 'Indicator' column based on both charge columns. If 'Charge1' is within plus or minus 5% of the 'Charge2' value, set the 'Indicator' to RFP, otherwise leave it blank (see example below).
ID Charge1 Charge2 Indicator
1 9.5 10 RFP
2 22 20
3 41 40 RFP
4 65 80
5 160 160 RFP
6 315 320 RFP
7 613 640 RFP
8 800 700
9 759 800
10 1480 1500 RFP
I tried using a .loc approach, but struggled to establish if 'Charge1' was within +/- 5% of 'Charge2'.
In [190]: df.loc[df.eval("Charge2*0.95 <= Charge1 <= Charge2*1.05"), 'RFP'] = 'REP'
In [191]: df
Out[191]:
ID Charge1 Charge2 RFP
0 1 9.5 10 REP
1 2 22.0 20 NaN
2 3 41.0 40 REP
3 4 65.0 80 NaN
4 5 160.0 160 REP
5 6 315.0 320 REP
6 7 613.0 640 REP
7 8 800.0 700 NaN
8 9 759.0 800 NaN
9 10 1480.0 1500 REP
Pretty simple, create an 'indicator' series of booleans which depends on the percentage difference between Charge1 and Charge2.
df = pd.read_clipboard()
threshold = 0.05
indicator = ( (df['Charge1'] / df['Charge2']) - 1).abs() <= threshold
df.loc[indicator]
Set a threshold figure and compare the values against that.
Wherever the value is within the threshold, return true, and so you can directly use the indicator (boolean series) as an input into .loc.
Try
cond = ((df['Charge2'] - df['Charge1'])/df['Charge2']*100).abs() <= 5
df['Indicator'] = np.where(cond, 'RFP', np.nan)
ID Charge1 Charge2 Indicator
0 1 9.5 10 RFP
1 2 22.0 20 nan
2 3 41.0 40 RFP
3 4 65.0 80 nan
4 5 160.0 160 RFP
5 6 315.0 320 RFP
6 7 613.0 640 RFP
7 8 800.0 700 nan
8 9 759.0 800 nan
9 10 1480.0 1500 RFP
You can using pct_change
df[['Charge2','Charge1']].T.pct_change().dropna().T.abs().mul(100).astype(int)<=(5)
Out[245]:
Charge1
0 True
1 False
2 True
3 False
4 True
5 True
6 True
7 False
8 True
9 True
Be very careful!
In Python / float counting, 9.5/10 - 1 == -0.050000000000000044
This is one way to explicitly account for this issue via numpy.
import numpy as np
vals = np.abs(df.Charge1.values / df.Charge2.values - 1)
cond1 = vals <= 0.05
cond2 = np.isclose(vals, 0.05, atol=1e-08)
df['Indicator'] = np.where(cond1 | cond2, 'RFP', '')
Let's say I have a dataframe comprised of 14 columns with all cells as strings.
Some of these strings are actual words with letters. I would like to keep these columns as strings (indexes 0, 3, 4, and 13).
Some of these strings are whole numbers with no decimal place.. I would like to convert some columns into ints (indexes 1:2, 5:7, 9:10, 12).
Finally, the remaining strings are numbers with decimal places. I want to convert these remaining columns to floats (indexes 6, 8, 11)
Here's a sample from the dataframe:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 Joe Blow 1947 29 CLE Q 4 1 0.3 4 5 6.5 2.4 Joe.htm
1 Ed Blow 1972 24 HOU Q 18 1 0.8 4 2 2.5 Ed.htm
2 Jim Blow 1974 23 CHI Q 18 3 2.2 2 0.8 3.83 Jim.htm
3 Al Blow 1995 STL Q 16 2 5 1 3.1 4.5 Frank.htm
4 Tom Blow 1969 23 DET Q 14 1 0.8 3 0 2.4 4.0 Tom.htm
[5 rows x 14 columns]
You can use to_numeric with combine_first, but if NaNs, int columns are converted to floats because per design:
df = df.apply(pd.to_numeric, errors='coerce').combine_first(df)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 Joe Blow 1947 29.0 CLE Q 4 1 0.3 4 5.0 6.5 2.40 Joe.htm
1 Ed Blow 1972 24.0 HOU Q 18 1 0.8 4 2.0 2.5 NaN Ed.htm
2 Jim Blow 1974 23.0 CHI Q 18 3 2.2 2 NaN 0.8 3.83 Jim.htm
3 Al Blow 1995 NaN STL Q 16 2 NaN 5 1.0 3.1 4.50 Frank.htm
4 Tom Blow 1969 23.0 DET Q 14 1 0.8 3 0.0 2.4 4.00 Tom.htm
print (df.dtypes)
0 object
1 object
2 int64
3 float64
4 object
5 object
6 int64
7 int64
8 float64
9 int64
10 float64
11 float64
12 float64
13 object
dtype: object
I'm creating a dataframe by pairing down a very large dataframe (approximately 400 columns) based on a choices an enduser makes on a picklist. One of the picklist choices is the type of denominator that the enduser would like. Here is one example table with all the information before the final calculation is made.
county _tcount _tvote _f_npb_18_count _f_npb_18_vote
countycode
35 San Benito 28194 22335 2677 1741
36 San Bernardino 912653 661838 108724 61832
countycode _f_npb_30_count _f_npb_30_vote
35 384 288
36 76749 53013
However, I am trouble creating code that will automatically divide every column starting with the 5th (not including the index) by the column before it (skipping every other column). I've seen examples (Divide multiple columns by another column in pandas), but they all use fixed column names which is not achievable for this aspect. I've able to variable columns (based on positions) by fixed columns, but not variable columns by other variable columns based on position. I've tried modifying the code in the above link based on the column positions:
calculated_frame = [county_select_frame[county_select_frame.columns[5: : 2]].div(county_select_frame[4: :2], axis=0)]
output:
[ county _tcount _tvote _f_npb_18_count _f_npb_18_vote \
countycode
35 NaN NaN NaN NaN NaN
36 NaN NaN NaN NaN NaN]
RuntimeWarning: invalid value encountered in greater
(abs_vals > 0)).any()
The use of [5: :2] does work when the dividend is a fixed field.If I can't get this to work, it's not a big deal (But it would be great to have all options I wanted).
My preference would be to organize it by setting the index and using filter to split out a counts and votes dataframes separately. Then use join
d1 = df.set_index('county', append=True)
counts = d1.filter(regex='.*_\d+_count$').rename(columns=lambda x: x.replace('_count', ''))
votes = d1.filter(regex='.*_\d+_vote$').rename(columns=lambda x: x.replace('_vote', ''))
d1[['_tcount', '_tvote']].join(votes / counts)
_tcount _tvote _f_npb_18 _f_npb_30
countycode county
35 San Benito 28194 22335 0.650355 0.750000
36 San Bernardino 912653 661838 0.568706 0.690732
I think you can divide by numpy arrays created by values, because then not align columns names. Last create new DataFrame by constructor:
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
Sample:
np.random.seed(10)
county_select_frame = pd.DataFrame(np.random.randint(10, size=(10,10)),
columns=list('abcdefghij'))
print (county_select_frame)
a b c d e f g h i j
0 9 4 0 1 9 0 1 8 9 0
1 8 6 4 3 0 4 6 8 1 8
2 4 1 3 6 5 3 9 6 9 1
3 9 4 2 6 7 8 8 9 2 0
4 6 7 8 1 7 1 4 0 8 5
5 4 7 8 8 2 6 2 8 8 6
6 6 5 6 0 0 6 9 1 8 9
7 1 2 8 9 9 5 0 2 7 3
8 0 4 2 0 3 3 1 2 5 9
9 0 1 0 1 9 0 9 2 1 1
arr = county_select_frame.values
df1 = pd.DataFrame(arr[:,5::2] / arr[:,4::2], columns = county_select_frame.columns[5::2])
print (df1)
f h j
0 0.000000 8.000000 0.000000
1 inf 1.333333 8.000000
2 0.600000 0.666667 0.111111
3 1.142857 1.125000 0.000000
4 0.142857 0.000000 0.625000
5 3.000000 4.000000 0.750000
6 inf 0.111111 1.125000
7 0.555556 inf 0.428571
8 1.000000 2.000000 1.800000
9 0.000000 0.222222 1.000000
How about something like
cols = my_df.columns
for i in range(2, 6):
print(u'Creating new col %s', cols[i])
my_df['new_{0}'.format(cols[i]) = my_df[cols[i]] / my_df[cols[i-1]
I am doing som routines that acces scalars and vectors from a pandas dataframe, and then sets the results after some calculations.
Initially I used the form df[var][index] to do this, but encountered problems with chained assaignment (http://pandas.pydata.org/pandas-docs/dev/indexing.html%23indexing-view-versus-copy)
So I change it to use the df.loc[index,var]. Which solved the view/copy problem but it is very slow. For arrays I convert it to a pandas series and uses the builtin df.update(). I am now searching for the fastest/best way of doing this, without having to worry about chained assaingment. In the documentation they say that for example df.at[] is the quickest way to access scalars. Does anyone have any experience with this ? Or can point at some literature that can help ?
Thanks
Edit: Code looks like this, which I think is pretty standard.
def set_var(self,name,periode,value):
try:
if navn.upper() not in self.data:
self.data[name.upper()]=num.NaN
self.data.loc[periode,name.upper()]=value
except:
print('Fail to set'+navn])
def get_var(self,navn,periode):
''' Get value '''
try:
value=self.data.loc[periode,navn.upper()]
def set_series(data, index):
outputserie=pd.Series(data,index)
self.data.update(outputserie)
dataframe looks like this:
SC0.data
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 148 entries, 1980Q1 to 2016Q4
Columns: 3111 entries, CAP1 to CHH_DRD
dtypes: float64(3106), int64(2), object(3)
edit2:
a df could look like
var var1
2012Q4 0.462015 0.01585
2013Q1 0.535161 0.01577
2013Q2 0.735432 0.01401
2013Q3 0.845959 0.01638
2013Q4 0.776809 0.01657
2014Q1 0.000000 0.01517
2014Q2 0.000000 0.01593
and I basically want to perform two operations:
1) perhaps update var1 with the same scalar over all periodes
2) solve var in 2014Q1 as var,2013Q4 = var1,2013Q3/var2013Q4*var,2013Q4
This is done as part of a bigger model setup, which is read from a txt file. Since I doing loads of these calculations, the speed og setting and reading data matter
The example you gave above can be vectorized.
In [3]: df = DataFrame(dict(A = np.arange(10), B = np.arange(10)),index=pd.period_range('2012',freq='Q',periods=10))
In [4]: df
Out[4]:
A B
2012Q1 0 0
2012Q2 1 1
2012Q3 2 2
2012Q4 3 3
2013Q1 4 4
2013Q2 5 5
2013Q3 6 6
2013Q4 7 7
2014Q1 8 8
2014Q2 9 9
Assign a scalar
In [5]: df['A'] = 5
In [6]: df
Out[6]:
A B
2012Q1 5 0
2012Q2 5 1
2012Q3 5 2
2012Q4 5 3
2013Q1 5 4
2013Q2 5 5
2013Q3 5 6
2013Q4 5 7
2014Q1 5 8
2014Q2 5 9
Perform a shifted operation
In [8]: df['C'] = df['B'].shift()/df['B'].shift(2)
In [9]: df
Out[9]:
A B C
2012Q1 5 0 NaN
2012Q2 5 1 NaN
2012Q3 5 2 inf
2012Q4 5 3 2.000000
2013Q1 5 4 1.500000
2013Q2 5 5 1.333333
2013Q3 5 6 1.250000
2013Q4 5 7 1.200000
2014Q1 5 8 1.166667
2014Q2 5 9 1.142857
Using a vectorized assignment
In [10]: df.loc[df['B']>5,'D'] = 'foo'
In [11]: df
Out[11]:
A B C D
2012Q1 5 0 NaN NaN
2012Q2 5 1 NaN NaN
2012Q3 5 2 inf NaN
2012Q4 5 3 2.000000 NaN
2013Q1 5 4 1.500000 NaN
2013Q2 5 5 1.333333 NaN
2013Q3 5 6 1.250000 foo
2013Q4 5 7 1.200000 foo
2014Q1 5 8 1.166667 foo
2014Q2 5 9 1.142857 foo