Create dummy variable column from value column - python

I know that Pandas has a get_dummy function which you can use to convert categorical variables to dummy variables in a DataFrame. What I'm trying to do is slightly different.
I have a column containing percentage values from 0.0 to 100.0. I need to convert this to a column that has 1's for any value >= 10.0 and 0's for any value < 10.0. Is there a good way to do this repurposing get_dummy here or will I have to construct a loop to do it?

You can can convert bools to ints directly:
(df.column_of_interest >= 10).astype(int)

I assume you're discussing pandas.get_dummies here, and I don't think that this is a use case for it. You are attempting to set two values on a boolean condition. One approach would be to get a boolean Series and take the integer representations for indicators, with
df['indicators'] = (df.percentages >= 10.).astype('int')
Demo
>>> df
percentages
0 70.176341
1 70.638246
2 55.078803
3 42.586290
4 73.340089
5 53.308670
6 3.059331
7 49.494812
8 10.379713
9 7.676286
10 55.023261
11 4.417545
12 51.744169
13 49.513638
14 39.189640
15 90.521703
16 29.696734
17 11.546118
18 5.737921
19 83.258049
>>> df['indicators'] = (df.percentages >= 10.).astype('int')
>>> df
percentages indicators
0 70.176341 1
1 70.638246 1
2 55.078803 1
3 42.586290 1
4 73.340089 1
5 53.308670 1
6 3.059331 0
7 49.494812 1
8 10.379713 1
9 7.676286 0
10 55.023261 1
11 4.417545 0
12 51.744169 1
13 49.513638 1
14 39.189640 1
15 90.521703 1
16 29.696734 1
17 11.546118 1
18 5.737921 0
19 83.258049 1

Let's assume you have a dataframe df, with a column Perc that contains your percentages:
import pandas as pd
pd.np.random.seed(111)
df = pd.DataFrame({"Perc": pd.np.random.uniform(1, 100, 20)})
Now, you can easily form a new column by using a lambda function that recodes your percentages, like so:
df["Category"] = df.Perc.apply(lambda x: 0 if x < 10.0 else 1)

Related

How to create a new column based on a condition in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1

Filtering rows that have unique value in a column using pandas

I have a df:
id value
1 10
2 15
1 10
1 10
2 13
3 10
3 20
I am trying to keep only rows that have 1 unique value in column value so that the result df looks like this:
id value
1 10
1 10
1 10
I dropped id = 2, 3 because it has more than 1 unique value in column value, 15, 13 & 10, 20 respectively.
I read this answer.
But this simply removes duplicates whereas I want to check if a given column - in this case column value has more than 1 unique value.
I tried:
df['uniques'] = pd.Series(df.groupby('id')['value'].nunique())
But this returns nan for every row since I am trying to fit n returns on n+m rows after grouping. I can write a function and apply it to every row but I was wondering if there is a smart quick filter that achieves my goal.
Use transform with groupby to align the group values to the individual rows:
df['nuniques'] = df.groupby('id')['value'].transform('nunique')
Output:
id value nuniques
0 1 10 1
1 2 15 2
2 1 10 1
3 1 10 1
4 2 13 2
5 3 10 2
6 3 20 2
If you only need to filter your data, you don't need to assign the new column:
df[df.groupby('id')['value'].transform('nunique') == 1]
Let us do filter
out = df.groupby('id').filter(lambda x : x['value'].nunique()==1)
Out[6]:
id value
0 1 10
2 1 10
3 1 10

Pandas - groupby ints that are contigous

So I have a df that looks like this:
some_int another_int
0 1 5
1 2 6
2 10 7
3 11 8
4 15 9
So I want to perform a groupby operation on those elements that have a diff of only 1 between each other. Let's say I want to groupby some_int (diff of 1) and perform a sum on another_int By doing that I would obtain something like:
some_int another_int
0 1 5
1 2 6
2 10 7
3 11 8
4 15 9
sum
0 5 + 6 = 11
1 7 + 8 = 15
2 15 = 15
What is the best pythonic way to do so? I tried creating a diff mask then shift it and perform or amongst those. However, it seems kind of verbose. What do you think?
I suggest making a new column called group
df['group'] = (df.some_int.diff() > 1).cumsum()
then you can groupby this column and apply a custom function that returns the sum of another_int or the single values in some_int:
def sum_or_val(x):
print(len(x))
if len(x) > 1:
return sum(x['another_int'])
return x['some_int'].values[0]
grouped = df.groupby('group').apply(sum_or_val)

How to fill an alphanumeric series in a column in a pandas dataframe?

I have certain pandas dataframe which has a structure like this
A B C
1 2 2
2 2 2
...
I want to create a new column called ID and fill it with an alphanumeric series which looks somewhat like this
ID A B C
GT001 1 2 2
GT002 2 2 2
GT003 2 2 2
...
I know how to fill it with either alphabets or numerals but I couldn't figure out if there is a "Pandas native" method which would allow me to fill an alphanumeric series.What would be the best way to do this?
Welcome to Stack Overflow!
If you want a custom ID, then you have to create a list with the desired index:
list = []
for i in range(1, df.shape[0] + 1): # gets the length of the DataFrame.
list.append(f'GT{i:03d}') # Using f-string for format and 03d for leading zeros.
df['ID'] = list
And if you want to set that as an index do df.set_index('ID', inplace=True)
import pandas as pd
import numpy as np
df = pd.DataFrame({'player': np.linspace(0,20,20)})
n = 21
data = ['GT' + '0'*(3-len(str(i))) + str(i) for i in range(1, n)]
df['ID'] = data
Output:
player ID
0 0.000000 GT001
1 1.052632 GT002
2 2.105263 GT003
3 3.157895 GT004
4 4.210526 GT005
5 5.263158 GT006
6 6.315789 GT007
7 7.368421 GT008
8 8.421053 GT009
9 9.473684 GT010
10 10.526316 GT011
11 11.578947 GT012
12 12.631579 GT013
13 13.684211 GT014
14 14.736842 GT015
15 15.789474 GT016
16 16.842105 GT017
17 17.894737 GT018
18 18.947368 GT019
19 20.000000 GT020

Pandas: Replace/ Change Duplicate values within a Time Range

I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.
You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Categories