Pandas DataFrame - assign 1,0 values based on other column

Pandas DataFrame - assign 1,0 values based on other column - python

I've got a dataframe containing country names & their percentage of energy output.
I need to add a new column that assigns a 1 or 0, based on whether the country's energy output is above or below the median of energy output. Some dummy code is:
import pandas as pd
def answer():
df = pd.DataFrame({'name':['china', 'america', 'canada'], 'output': [33.2, 15.0, 5.0]})
df['newcol'] = df.where(df['output'] > df['output'].median(), 1, 0)
return df['newcol']
answer()
the code returns
ValueError: Wrong number of items passed 2, placement implies 1
I feel like this is an incredibly simple fix but I'm new to working with Pandas.
Please help end my frustration

#Vaishali explains why pd.DataFrame.where didn't work as you expected and suggested you use np.where instead, which is very good advice.
I'll offer up that you could have simply converted your boolean result to integers.
Setup
df = pd.DataFrame({
'name':['china', 'america', 'canada'],
'output': [33.2, 15.0, 5.0]
})
Option 1
df['newcol'] = (df['output'] > df['output'].median()).astype(int)
Option 2
Or faster yet by using the underlying numpy arrays
o = df['output'].values
df['newcol'] = (o > np.median(o)).astype(int)

You don't need loop as the solution is vectorized.
df['newcol'] = np.where((df['output'] > df['output'].median()), 1, 0)
name output newcol
0 china 33.2 1
1 america 15.0 0
2 canada 5.0 0
For the error wrong number of items passed, df.where works a little different from np.where. It Returns an object of same shape as self whose corresponding entries are from self where cond is True and otherwise are from other. So its returning a dataframe in your case with two columns instead of a series and hence when you try to assign that dataframe to a series, you get the error message.

Related

Assigning new column based on other columns in Python

In Python I am trying to create a new column(degree) within a dataframe and to set its value based on if logic based on two other columns in the dataframe (whether single rows of one or both these columns are null values or not..). Per row it should assign to the new column the value of either one of these columns based on the presence of null values in the column.
I have tried the below code, which gives me the following error message:
KeyError: 'degree'
The code is -
for i in basicdataframe.index:
if pd.isnull(basicdataframe['section_degree'][i]) and pd.isnull(basicdataframe['model_degree'][i]):
basicdataframe['degree'][i] = basicdataframe['model_degree'][i]
elif pd.notnull(basicdataframe['section_degree'][i]) and pd.isnull(basicdataframe['model_degree'][i]):
basicdataframe['degree'][i] = basicdataframe['section_degree'][i]
elif pd.isnull(basicdataframe['section_degree'][i]) and pd.notnull(basicdataframe['model_degree'][i]):
basicdataframe['degree'][i] = basicdataframe['model_degree'][i]
elif pd.notnull(basicdataframe['section_degree'][i]) and pd.notnull(basicdataframe['model_degree'][i]):
basicdataframe['degree'][i] = basicdataframe['model_degree'][i]
Does anybody know how to achieve this?

Let's say you have pandas Dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"section_degree": [1, 2, np.nan, np.nan],
"model_degree": [np.nan, np.nan, np.nan, 3]
})
You can define function that will be applied to DataFrame:
def define_degree(x):
if pd.isnull(x["section_degree"]) and pd.isnull(x["model_degree"]):
return x["model_degree"]
elif pd.notnull(x['section_degree']) and pd.isnull(x['model_degree']):
return x["section_degree"]
elif pd.isnull(x['section_degree']) and pd.notnull(x['model_degree']):
return x["model_degree"]
elif pd.notnull(x['section_degree']) and pd.notnull(x['model_degree']):
return x["model_degree"]
df["degree"] = df.apply(define_degree, axis=1)
df
# output
section_degree model_degree degree
0 1.0 NaN 1.0
1 2.0 NaN 2.0
2 NaN NaN NaN
3 NaN 3.0 3.0

The error is because you are trying to assign values inside a column which does not exist yet.
Since you are setting a new column as degree, it makes sense if you add the column first with some default value.
basicdataframe['degree'] = ''
This would set an empty string for all rows of the dataframe for this column.
After that, you can set the values.
P.S. Your code is likely to give you warnings about
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
To fix that, you could take help from https://stackoverflow.com/a/20627316/1388513

How to modify DataFrame column without getting SettingWithCopyWarning?

I have a DataFrame object df. And I would like to modify job column so that all retired people are 1 and rest 0 (like shown here):
df['job'] = df['job'].apply(lambda x: 1 if x == "retired" else 0)
But I get a warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Why did I get it here though? From what I read it applies to situations where I take a slice of rows and then a column, but here I am just modyfing elements in a row. Is there a better way to do that?

Use:
df['job']=df['job'].eq('retired').astype(int)
or
df['job']=np.where(df['job'].eq('retired'),1,0)

So here's an example dataframe:
import pandas as pd
import numpy as np
data = {'job':['retired', 'a', 'b', 'retired']}
df = pd.DataFrame(data)
print(df)
job
0 retired
1 a
2 b
3 retired
Now, you can make use of numpy's where function:
df['job'] = np.where(df['job']=='retired', 1, 0)
print(df)
job
0 1
1 0
2 0
3 1

I would not suggest using apply here, as in the case of large data frame it could lower your performance.
I would prefer using numpy.select or numpy.where.
See This And This

Splitting a dataframe based on condition

I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None

Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7

Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1

cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)

Fillna in multiple columns in place in Python Pandas

I have a pandas dataFrame of mixed types, some are strings and some are numbers. I would like to replace the NAN values in string columns by '.', and the NAN values in float columns by 0.
Consider this small fictitious example:
df = pd.DataFrame({'Name':['Jack','Sue',pd.np.nan,'Bob','Alice','John'],
'A': [1, 2.1, pd.np.nan, 4.7, 5.6, 6.8],
'B': [.25, pd.np.nan, pd.np.nan, 4, 12.2, 14.4],
'City':['Seattle','SF','LA','OC',pd.np.nan,pd.np.nan]})
Now, I can do it in 3 lines:
df['Name'].fillna('.',inplace=True)
df['City'].fillna('.',inplace=True)
df.fillna(0,inplace=True)
Since this is a small dataframe, 3 lines is probably ok. In my real example (which I cannot share here due to data confidentiality reasons), I have many more string columns and numeric columns. SO I end up writing many lines just for fillna. Is there a concise way of doing this?

Came across this page while looking for an answer to this problem, but didn't like the existing answers. I ended up finding something better in the DataFrame.fillna documentation, and figured I'd contribute for anyone else that happens upon this.
If you have multiple columns, but only want to replace the NaN in a subset of them, you can use:
df.fillna({'Name':'.', 'City':'.'}, inplace=True)
This also allows you to specify different replacements for each column. And if you want to go ahead and fill all remaining NaN values, you can just throw another fillna on the end:
df.fillna({'Name':'.', 'City':'.'}, inplace=True).fillna(0, inplace=True)
Edit (22 Apr 2021)
Functionality (presumably / apparently) changed since original post, and you can no longer chain 2 inplace fillna() operations. You can still chain, but now must assign that chain to the df instead of modifying in place, e.g. like so:
df = df.fillna({'Name':'.', 'City':'.'}).fillna(0)

You could use apply for your columns with checking dtype whether it's numeric or not by checking dtype.kind:
res = df.apply(lambda x: x.fillna(0) if x.dtype.kind in 'biufc' else x.fillna('.'))
print(res)
A B City Name
0 1.0 0.25 Seattle Jack
1 2.1 0.00 SF Sue
2 0.0 0.00 LA .
3 4.7 4.00 OC Bob
4 5.6 12.20 . Alice
5 6.8 14.40 . John

You can either list the string columns by hand or glean them from df.dtypes. Once you have the list of string/object columns, you can call fillna on all those columns at once.
# str_cols = ['Name','City']
str_cols = df.columns[df.dtypes==object]
df[str_cols] = df[str_cols].fillna('.')
df = df.fillna(0)

define a function:
def myfillna(series):
if series.dtype is pd.np.dtype(float):
return series.fillna(0)
elif series.dtype is pd.np.dtype(object):
return series.fillna('.')
else:
return series
you can add other elif statements if you want to fill a column of a different dtype in some other way. Now apply this function over all columns of the dataframe
df = df.apply(myfillna)
this is the same as 'inplace'

The most concise and readable way to accomplish this, especially with many columns is to use df.select_dtypes.columns.
(df.select_dtypes, df.columns)
df.select_dtypes returns a new df containing only the columns that match the dtype you need.
df.columns returns a list of the column names in your df.
Full code:
float_column_names = df.select_dtypes(float).columns
df[float_column_names] = df[float_column_names].fillna(0)
string_column_names = df.select_dtypes(object).columns
df[string_column_names] df[string_column_names].fillna('.')

There is a simpler way, that can be done in one line:
df.fillna({'Name':0,'City':0},inplace=True)
Not an awesome improvement but if you multiply it by 100, writting only the column names + ':0' is way faster than copying and pasting everything 100 times.

If you want to replace a list of columns ("lst") with the same value ("v")
def nan_to_zero(df, lst, v):
d = {x:v for x in lst}
df.fillna(d, inplace=True)
return df

If you don't want to specify individual per-column replacement values, you can do it this way:
df[['Name', 'City']].fillna('.',inplace=True)
If you don't like inplace (like me) you can do it like this:
columns = ['Name', 'City']
df[columns] = df.copy()[columns].fillna('.')
The .copy() is added to avoid the SettingWithCopyWarning, which is designed to warn you that the original values of a dataframe is overwritten, which is what we want.
If you don't like that syntax, you can see this question to see other ways of dealing with this: How to deal with SettingWithCopyWarning in Pandas

Much easy way is :dt.replace(pd.np.nan, "NA").
In case you want other replacement, you should use the next:dt.replace("pattern", "replaced by (new pattern)")

iloc to set values with pandas float64 vs int64

I am basically trying to set a value in a DataFrame using iloc with an index and a column name. Here is the test code:
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
print type(df['a'][0]) # numpy.int64
df.iloc[-1]['a'] = 2000
print df # value changed to 2000
df['c'] = [3.5, 4.5]
print type(df['a'][0]) # numpy.float64 -> why does this change automatically?
print type(df['c'][0]) # numpy.float64
df.iloc[-1]['c'] = 2000 # yields warning, no value change
print df
df.iloc[-1]['a'] = 4000 # yields warning, no value change
print df
With Int64, I can do it, but not with Float64. Is there an alternative? Or is this a bug?
Thanks

Jeff's answer is correct, but not optimal pandas syntax. It is considerably faster to to use .at instead of .loc (and likewise .iat instead of .iloc). If you're using this to get or set a lot of values, the time savings can really add up. The catch is that .at and .iat are meant only for getting and setting individual values, so you can't get or set a range of values or retrieve an entire column or row. Since all you want to do is change one value I would recommend:
df.at[df.index[-1], 'a'] = 4000
If for whatever reason you only have the label-based location for one axis and the integer-based location on the other, you can also use df.get_loc('Label-based name of column') to get the integer-based location of that column or likewise df.index.get_loc('Label-based name of row') to get the integer-based location of that row.

The warning is telling you that what you are doing is unsafe and it is because you
have a mixed-type frame. Instead use loc for this. See the docs on why this ia bad idea and may not work (which it doesn't here), http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [12]: df.loc[df.index[-1],'a'] = 4000
In [13]: df
Out[13]:
a b c
0 1 3 3.5
1 4000 4 4.5
[2 rows x 3 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame - assign 1,0 values based on other column - python

Related

Assigning new column based on other columns in Python

How to modify DataFrame column without getting SettingWithCopyWarning?

Splitting a dataframe based on condition

Fillna in multiple columns in place in Python Pandas

iloc to set values with pandas float64 vs int64

Categories

Resources