Some columns in my data set have missing values that are represented as None (Nonetype, not a string). Some other missing values are represented as 'N/A' or 'No'. I want to be able to handle these missing values in below method.
df.loc[df.col1.isin('None', 'Yes', 'No'), col1] = 'N/A'
Now my problem is, None is a value not a string and so I can not use none as 'None'. I have read somewhere we can convert that none value to a string 'None'.
Can anyone kindly give me any clue how to go about it ?
Note 1:
Just for clarity of explanation if I run below code:
df.col1.unique()
I get this output:
array([None, 'No', 'Yes'], dtype=object)
Note 2:
I know I can handle missing or None value with isnull() but in this case I need to use .isin() method
Sample dataframe:
f = {'name': ['john', 'tom', None, 'rock', 'dick'], 'DoB': [None, '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', None]}
df1 = pd.DataFrame(data = f)
When you run below code you will see None as a value.
df1.Address.unique()
output: array(['NY', 'NJ', 'PA', None], dtype=object)
I want the None to be displayed as 'None'
There is a different between a null/None and 'None'. So you can change your original statement to
df.loc[df.col1.isin([None, 'Yes', 'No']), col1] = 'N/A'
That is, take out the apostrophes for None
Or you can first find all the indices where a null's or none's exist and then select all those rows based on the index. And then you can use your original statement.
df["col1"].loc[df["col1"].isnull()] = 'None'
Create an example df:
df = pd.DataFrame({"A": [None, 'Yes', 'No', 1, 3, 5]})
which looks like:
A
0 None
1 Yes
2 No
3 1
4 3
5 5
Replace your 'None' by None and make the to be replaced arguments a list (that's how isin works):
df.loc[df.A.isin([None, 'Yes', 'No']), 'A'] = 'N/A'
which returns:
A
0 N/A
1 N/A
2 N/A
3 1
4 3
5 5
Related
Imagine I have the follow Pandas.DataFrame:
df = pd.DataFrame({
'type': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [1, 2, 3, 4, 5, 6]
})
I want to adjust the first value when type == 'B' to 999, i.e. the fourth row's value should become 999.
Initially I imagined that
df.loc[df['type'] == 'B'].iloc[0, -1] = 999
or something similar would work. But as far as I can see, slicing the df twice does not point to the original df anymore so the value of the df is not updated.
My other attempt is
df.loc[df.loc[df['type'] == 'B'].index[0], df.columns[-1]] = 999
which works, but is quite ugly.
So I'm wondering -- what would be the best approach in such situation?
You can use idxmax which returns the index of the first occurrence of a max value. Like this using a boolean series:
df.loc[(df['type'] == 'B').idxmax(), 'value'] = 999
Output:
type value
0 A 1
1 A 2
2 A 3
3 B 999
4 B 5
5 B 6
I have a dataset similar to this one:
Mother ID ChildID ethnicity
0 1 1 White Other
1 2 2 Indian
2 3 3 Black
3 4 4 Other
4 4 5 Other
5 5 6 Mixed-White and Black
To simplify my dataset and make it more relevant to the classification I am performing, I want to categorise ethnicities into 3 categories as such:
White: within this category I will include 'White British' and 'White Other' values
South Asian: the category will include 'Pakistani', 'Indian', 'Bangladeshi'
Other: 'Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian' values
So I want the above dataset to be transformed to:
Mother ID ChildID ethnicity
0 1 1 White
1 2 2 South Asian
2 3 3 Other
3 4 4 Other
4 4 5 Other
5 5 6 Other
To do this I have run the following code, similar to the one provided in this answer:
col = 'ethnicity'
conditions = [ (df[col] in ('White British', 'White Other')),
(df[col] in ('Indian', 'Pakistani', 'Bangladeshi')),
(df[col] in ('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian'))]
choices = ['White', 'South Asian', 'Other']
df["ethnicity"] = np.select(conditions, choices, default=np.nan)
But when running this, I get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any idea why I am getting this error? Am I not handling the string comparison correctly? I am using a similar technique to manipulate other features in my dataset and it is working fine there.
I can not find why in is not working, but isin definitely solve the problem, maybe someone else can tell why in has a problem.
conditions = [ (df[col].isin(('White British', 'White Other'))),
(df[col].isin(('Indian', 'Pakistani', 'Bangladeshi'))),
(df[col].isin(('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian')))]
print(conditions)
choices = ['White', 'South Asian', 'Other']
df["ethnicity"] = np.select(conditions, choices, default=np.nan)
print(df)
output
Mother ID ChildID ethnicity
0 1 1 White
1 2 2 South Asian
2 3 3 Other
3 4 4 Other
4 4 5 Other
5 5 6 nan
With df[col] in some_tuple you are searching df[col] inside some_tuple, which is obviously not what you want. What you want is df[col].isin(some_tuple), which returns a new series of booleans of the same length of df[col].
So, why you get that error anyway? The function for searching a value in a tuple is more or less like the following:
for v in some_tuple:
if df[col] == v:
return True
return False
df[col] == v evaluates to a series result; no problem here
then Python try to evaluate if result: and you get that error because you have a series in a condition clause, meaning that you are (implicitly) trying to evaluate a series as a boolean; this is not allowed by pandas.
For your problem, anyway, I would use DataFrame.apply. It takes a function that map a value to another; in your case, a function that map each ethnicity to a category. There are many ways to define it (see options below).
import numpy as np
import pandas as pd
d = pd.DataFrame({
'field': range(6),
'ethnicity': list('ABCDE') + [np.nan]
})
# Option 1: define a dict {ethnicity: category}
category_of = {
'A': 'X',
'B': 'X',
'C': 'Y',
'D': 'Y',
'E': 'Y',
np.nan: np.nan,
}
result = d.assign(category=d['ethnicity'].apply(category_of.__getitem__))
print(result)
# Option 2: define categories, then "invert" the dict.
categories = {
'X': ['A', 'B'],
'Y': ['C', 'D', 'E'],
np.nan: [np.nan],
}
# If you do this frequently you could define a function invert_mapping(d):
category_of = {eth: cat
for cat, values in categories.items()
for eth in values}
result = d.assign(category=d['ethnicity'].apply(category_of.__getitem__))
print(result)
# Option 3: define a function (a little less efficient)
def ethnicity_to_category(ethnicity):
if ethnicity in {'A', 'B'}:
return 'X'
if ethnicity in {'C', 'D', 'E'}:
return 'Y'
if pd.isna(ethnicity):
return np.nan
raise ValueError('unknown ethnicity: %s' % ethnicity)
result = d.assign(category=d['ethnicity'].apply(ethnicity_to_category))
print(result)
Basically, I would like to fill in column Discount_Sub_Dpt with 'Yes' or 'No' depending on if there is a Discount for that Sub_Dpt for that week EXCLUDING the product on which that row lands (for instance I don't want any of the A rows to consider whether there is a Discount for that week for A but rather only for the products in that sub department(in most cases there is more than one other product).
I have tried using groupby with Sub_Dpt and Week to no avail.
Does anyone know how to solve this issue?
The Yellow column is obviously the desired outcome from the code.
Here is some of the code I have used, I am trying to create the column first and then update the values (but it could all potentially be wrong) (also I intentionally named the data frame df1):
df1['Discount_Sub_Dpt'] = np.where((df1['Discount']=='Yes'),'Yes','No')
grps = []
grps.append(df1.Sub_Dpt.unique())
for x in grps:
x = str(x)
yes_weeks = df1.loc[(df1.Discount_SubDpt == 'Yes') & (df1.Sub_Dpt_Description == x),'Week'].unique()
df1.loc[df1['Week'].isin(yes_weeks) & df1['Sub_Dpt_Description'] == x, 'Discount_SubDpt'] = 'Yes'
Ok, the following is a bit crazy, but it works pretty nicely, so listen up.
First, we are going to build a NetworkX graph as follows.
import networkx as nx
import numpy as np
import pandas as pd
G = nx.Graph()
Prods = df.Product.unique()
G.add_nodes_from(Prods)
We now add edges between our nodes (which are all of the products) whenever they belong to the same sub_dpt. In this case, since A and B share a dept, and C and D, do, we add edges AB and CD. If we had ABC in the same department, we would add AB, AC, BC. Confusing, I know, but just trust me on this one.
G.add_edges_from([('A','B'),('C','D')])
Now comes the fun part. We need to convert your Discount column from Yes/No to 1/0.
df['Disc2']=np.nan
df.loc[df['Discount']=='Yes','Disc2']=1
df.loc[df['Discount']=='No','Disc2']=0
Now we pivot the data
tab = df.pivot(index = 'Week',columns='Product',values = 'Disc2')
And now, we do this
tab = pd.DataFrame(np.dot(tab,nx.adjacency_matrix(G,Prods).todense()), columns=Prods,index=df.Week.unique())
tab[0].astype(bool)
df = df.merge(tab.unstack().reset_index(),left_on=['Product','Week'],right_on=['level_0','level_1'])
df['Discount_Sub_Dpt']=df[0]
print(df[['Product','Week','Sub_Dpt','Discount','Discount_Sub_Dpt']])
You may ask, why go through this trouble? Well, two reasons. First, its far more stable. The other answers can't handle all possible cases of your problem. Second, it's much faster than the other solutions. I hope this helped!
You can perform a GroupBy to map ('Week', 'Sub_Dpt') to lists of 'Product' only when Discount is "Yes".
Then use a list comprehension to check if any are on Discount apart from the product in question. Finally, map a Boolean series result to "Yes" / "No".
Data from #SahilPuri.
# GroupBy only when Discount == Yes
g = df1[df1['Discount'] == 'Yes'].groupby(['Week', 'Sub_Dpt'])['Product'].unique()
# calculate index by row
idx = df1.set_index(['Week', 'Sub_Dpt']).index
# construct list of Booleans according to criteria
L = [any(x for x in g.get(i, []) if x!=j) for i, j in zip(idx, df1['Product'])]
# map Boolean to strings
df1['Discount_SubDpt'] = pd.Series(L).map({True: 'Yes', False: 'No'})
print(df1)
Product Week Sub_Dpt Discount Discount_SubDpt
0 A 1 Toys Yes No
1 A 2 Toys No Yes
2 A 3 Toys No No
3 A 4 Toys Yes Yes
4 B 1 Toys No Yes
5 B 2 Toys Yes No
6 B 3 Toys No No
7 B 4 Toys Yes Yes
8 C 1 Candy No No
9 C 2 Candy No No
10 C 3 Candy Yes No
11 C 4 Candy Yes No
12 D 1 Candy No No
13 D 2 Candy No No
14 D 3 Candy No Yes
15 D 4 Candy No Yes
Okay, this might not scale well, but should be easy to read.
df1 = pd.DataFrame(data= [[ 'A', 1, 'Toys', 'Yes', ],
[ 'A', 2, 'Toys', 'No', ],
[ 'A', 3, 'Toys', 'No', ],
[ 'A', 4, 'Toys', 'Yes', ],
[ 'B', 1, 'Toys', 'No', ],
[ 'B', 2, 'Toys', 'Yes', ],
[ 'B', 3, 'Toys', 'No', ],
[ 'B', 4, 'Toys', 'Yes', ],
[ 'C', 1, 'Candy', 'No', ],
[ 'C', 2, 'Candy', 'No', ],
[ 'C', 3, 'Candy', 'Yes', ],
[ 'C', 4, 'Candy', 'Yes', ],
[ 'D', 1, 'Candy', 'No', ],
[ 'D', 2, 'Candy', 'No', ],
[ 'D', 3, 'Candy', 'No', ],
[ 'D', 4, 'Candy', 'No', ],], columns=['Product', 'Week', 'Sub_Dpt', 'Discount'])
df2 = df1.set_index(['Product', 'Week', 'Sub_Dpt'])
products = df1.Product.unique()
df1['Discount_SubDpt'] = df1.apply(lambda x: 'Yes' if 'Yes' in df2.loc[(list(products[products != x['Product']]), x['Week'], x['Sub_Dpt']), 'Discount'].tolist() else 'No', axis=1)
The first step creates a Multindex Dataframe.
Next, we get the list of all products
Next, for each row, we take out the same week and Sub Department and remove the product.
In this list if there is a discount, we select 'Yes' else 'No'
Edit 1:
If you don't want to create another dataframe (save memory, but will be a bit slower)
df1['Discount_SubDpt'] = df1.apply(lambda x: 'Yes' if 'Yes' in df1.loc[(df1['Product'] != x['Product']) & (df1['Week'] == x['Week']) & (df1['Sub_Dpt'] == x['Sub_Dpt']), 'Discount'].tolist() else 'No', axis=1)
It's late, but here's a go. I used the sample df in the comments above.
df1['dis'] = df1['Discount'].apply(lambda x: 1 if x =="Yes" else 0)
df2 = df1.groupby(['Sub_Dpt','Week']).sum()
df2.reset_index(inplace = True)
df3 = pd.merge(df1,df2, left_on=['Sub_Dpt','Week'], right_on =['Sub_Dpt','Week'])
df3['Discount_Sb_Dpt'] = np.where(df3['dis_x'] < df3['dis_y'], 'Yes', 'No')
df3.sort_values(by=['Product'], inplace = True)
df3
What's the best way to insert new rows into an existing pandas DataFrame while maintaining column data types and, at the same time, giving user-defined fill values for columns that aren't specified? Here's an example:
df = pd.DataFrame({
'name': ['Bob', 'Sue', 'Tom'],
'age': [45, 40, 10],
'weight': [143.2, 130.2, 34.9],
'has_children': [True, True, False]
})
Assume that I want to add a new record passing just name and age. To maintain data types, I can copy rows from df, modify values and then append df to the copy, e.g.
columns = ('name', 'age')
copy_df = df.loc[0:0, columns].copy()
copy_df.loc[0, columns] = 'Cindy', 42
new_df = copy_df.append(df, sort=False).reset_index(drop=True)
But that converts the bool column to an object.
Here's a really hacky solution that doesn't feel like the "right way" to do this:
columns = ('name', 'age')
copy_df = df.loc[0:0].copy()
missing_remap = {
'int64': 0,
'float64': 0.0,
'bool': False,
'object': ''
}
for c in set(copy_df.columns).difference(columns)):
copy_df.loc[:, c] = missing_remap[str(copy_df[c].dtype)]
new_df = copy_df.append(df, sort=False).reset_index(drop=True)
new_df.loc[0, columns] = 'Cindy', 42
I know I must be missing something.
As you found, since NaN is a float, adding NaN to a series may cause it to be either upcasted to float or converted to object. You are right in determining this is not a desirable outcome.
There is no straightforward approach. My suggestion is to store your input row data in a dictionary and combine it with a dictionary of defaults before appending. Note that this works because pd.DataFrame.append accepts a dict argument.
In Python 3.6, you can use the syntax {**d1, **d2} to combine two dictionaries with preference for the second.
default = {'name': '', 'age': 0, 'weight': 0.0, 'has_children': False}
row = {'name': 'Cindy', 'age': 42}
df = df.append({**default, **row}, ignore_index=True)
print(df)
age has_children name weight
0 45 True Bob 143.2
1 40 True Sue 130.2
2 10 False Tom 34.9
3 42 False Cindy 0.0
print(df.dtypes)
age int64
has_children bool
name object
weight float64
dtype: object
It's because, NaN value is a float, but True and False are bool. There are mixed dtypes in one column, so Pandas will automatically convert it into object.
Another instance of this is, if you have a column with all integer values and append a value with float, then pandas change entire column to float by adding '.0' to the remaining values.
Edit
Based on comments, Another hacky way to convert object to bool dtype.
df = pandas.DataFrame({
'name': ['Bob', 'Sue', 'Tom'],
'age': [45, 40, 10],
'weight': [143.2, 130.2, 34.9],
'has_children': [True, True, False]
})
row = {'name': 'Cindy', 'age': 12}
df = df.append(row, ignore_index=True)
df['has_children'] = df['has_children'].fillna(False).astype('bool')
Now the new dataframe looks like this :
age has_children name weight
0 45 True Bob 143.2
1 40 True Sue 130.2
2 10 False Tom 34.9
3 12 False Cindy NaN
This DataFrame has two columns, both are object type.
Dependents Married
0 0 No
1 1 Yes
2 0 Yes
3 0 Yes
4 0 No
I want to aggregate 'Dependents' based on 'Married'.
table = df.pivot_table(
values='Dependents',
index='Married',
aggfunc = lambda x: x.map({'0':0,'1':1,'2':2,'3':3}).mean())
This works, however, surprisingly, the following doesn't:
table = df.pivot_table(values = 'Dependents',
index = 'Married',
aggfunc = lambda x: x.map(int).mean())
It will produce a None instead.
Can anyone help explain?
Both examples of code provided in your question work. However, they are not the idiomatic way to achieve what you want to do -- particularly the first one.
I think this is the proper way to obtain the expected behavior.
# Test data
df = DataFrame({'Dependents': ['0', '1', '0', '0', '0'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No']})
# Converting object to int
df['Dependents'] = df['Dependents'].astype(int)
# Computing the mean by group
df.groupby('Married').mean()
Dependents
Married
No 0.00
Yes 0.33
However, the following code works.
df.pivot_table(values = 'Dependents', index = 'Married',
aggfunc = lambda x: x.map(int).mean())
It is equivalent (and more readable) of converting to int with map before pivoting data.
df['Dependents'] = df['Dependents'].map(int)
df.pivot_table(values = 'Dependents', index = 'Married')
Edit
I you have messy DataFrame, you can use to_numeric with the error parameter set to coerce.
If coerce, then invalid parsing will be set as NaN
# Test data
df = DataFrame({'Dependents': ['0', '1', '2', '3+', 'NaN'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No']})
df['Dependents'] = pd.to_numeric(df['Dependents'], errors='coerce')
print(df)
Dependents Married
0 0.0 No
1 1.0 Yes
2 2.0 Yes
3 NaN Yes
4 NaN No
print(df.groupby('Married').mean())
Dependents
Married
No 0.0
Yes 1.5
My originally question was why the method 2 using map(int) doesn't work. None of the above answers my question. Therefore there is no best answer.
However, as I look back, I found now in pandas 0.22, method 2 does work. I guess the problem is in pandas.
To robustly do the aggregation, my solution would be
df.pivot_table(
values='Dependents',
index='Married',
aggfunc = lambda x: x.map(lambda x:int(x.strip("+"))).mean())
To make it cleaner, I guess you could first translate the column "Dependents" to integer then do the aggregation.