I'm pulling my hair out on this one. Help appreciated.
I have a dataframe I'm munging which partially involves consolidating data that resides on several rows into one. I'm trying to use df.loc to do it:
df.loc[df['foo'] == 1, 'Output Column'] = df.loc[df['bar'] == 2, 'Desired Column']
So what I want is for any row where 'foo' = 1, go look for where 'bar' = 2 and put the value that resides in the 'Desired Column' into the original row. Essentially this will consolidate the rows to create cleaner output. As a toy example...
(Edited to show where my code is going wrong)
Here's what I want...
Before:
idx foo bar Desired Column Output Column
0 1
1 2 Hi there!
2 1
3 6
After:
idx foo bar Desired Column Output Column
0 1 Hi there!
1 2 Hi there!
2 1 Hi there!
3 6
However here's what I'm actually getting:
Before:
idx foo bar Desired Column Output Column
0 1
1 2 Hi there!
2 1
3 6
After:
idx foo bar Desired Column Output Column
0 1
1 2 Hi there! Hi there!
2 1
3 6
Thanks for your help!
Well this worked... not sure it is the most pythonic solution ever but here it goes:
df.loc[df['foo'] == 1, 'Output Column'] = df.loc[df['bar'] == 2, 'Desired Column']
df['Output Column'] = df.groupby(['foo'])['Output Column'].transform(max)
In my toy example, this populated with the single number that corresponded to bar=2
Try using where:
df['Output Column']=df['Output Column'].where(df['bar']==2,'Hi There!')
print(df)
Output:
idx foo bar Desired Column Output Column
0 0 1 NaN NaN Hi there!
1 1 NaN 2 Hi there! NaN
To replace NaN's with '', do:
df=df.fillna('')
after where.
Then:
print(df)
Will be:
idx foo bar Desired Column Output Column
0 0 1 Hi there!
1 1 2 Hi there!
Or more dis-manually , do:
df['Output Column']=df['Output Column'].where(df['bar']==2,df.loc[df['bar']==2,'Desired Column'].tolist())
print(df)
Then can do same thing to replace NaN's with ''
Update:
First:
df['Output Column']=df['Output Column'].where(df['foo']!=1,'Hi There!')
print(df)
Output:
Desired Column Output Column bar foo idx
0 NaN Hi There! NaN 1.0 0
1 Hi There! NaN 2.0 NaN 1
2 NaN Hi There! NaN 1.0 2
3 NaN NaN NaN 6.0 3
Second:
df['Output Column']=df['Output Column'].where(df['foo'].notnull(),'Hi There!')
print(df)
Output:
Desired Column Output Column bar foo idx
0 NaN NaN NaN 1.0 0
1 Hi There! Hi There! 2.0 NaN 1
2 NaN NaN NaN 1.0 2
3 NaN NaN NaN 6.0 3
Can do Same Thing to replace NaN's with ''
Related
I have a dataframe where one of the column names is a variable:
xx = pd.DataFrame([{'ID':1, 'Name': 'Abe', 'HasCar':1},
{'ID':2, 'Name': 'Ben', 'HasCar':0},
{'ID':3, 'Name': 'Cat', 'HasCar':1}])
ID Name HasCar
0 1 Abe 1
1 2 Ben 0
2 3 Cat 1
In this dummy example column 2 could be "HasCar", or "IsStaff", or some other unknowable value. I want to select all rows, where column 2 is True, whatever the column name is.
I've tried the following without success:
xx.iloc[:,[2]] == 1
HasCar
0 True
1 False
2 True
and then trying to use that as an index results in:
xx[xx.iloc[:,[2]] == 1]
ID Name HasCar
0 NaN NaN 1.0
1 NaN NaN NaN
2 NaN NaN 1.0
Which isn't helpful. I suppose I could go about renaming column 2 but that feels a little wrong. The issue seems to be that xx.iloc[:,[2]] returns a dataframe while xx['hasCar'] returns a series. I can't figure out how to force a (x,1) shaped dataframe into a series without knowing the column name, as described here .
Any ideas?
It was almost correct, but you sliced in 2D, use a Series slicing instead:
xx[xx.iloc[:, 2] == 1]
Output:
ID Name HasCar
0 1 Abe 1
2 3 Cat 1
difference:
# 2D slicing, this gives a DataFrame (with a single column)
xx.iloc[:,[2]]
HasCar
0 1
1 0
2 1
# 1D slicing, as Series
xx.iloc[:,2]
0 1
1 0
2 1
Name: HasCar, dtype: int64
I would like to sum values of a dataframe conditionally, based on the values of a different dataframe. Say for example I have two dataframes:
df1 = pd.DataFrame(data = [[1,-1,5],[2,1,1],[3,0,0]],index=[0,1,2],columns = [0,1,2])
index 0 1 2
-----------------
0 1 -1 5
1 2 1 1
2 3 0 0
df2 = pd.DataFrame(data = [[1,1,3],[1,1,2],[0,2,1]],index=[0,1,2],columns = [0,1,2])
index 0 1 2
-----------------
0 1 1 3
1 1 1 2
2 0 2 1
Now what I would like is that for example, if the row/index value of df1 equals 1, to sum the location of those values in df2.
In this example, if the condition is 1, then the sum of df2 would be 4. If the condition was 0, the result would be 3.
Another option with Pandas' query:
df2.query("#df1==1").sum().sum()
# 4
You can use a mask with where:
df2.where(df1.eq(1)).to_numpy().sum()
# or
# df2.where(df1.eq(1)).sum().sum()
output: 4.0
intermediate:
df2.where(df1.eq(1))
0 1 2
0 1.0 NaN NaN
1 NaN 1.0 2.0
2 NaN NaN NaN
Assuming that one wants to store the value in the variable value, there are various options to achieve that. Will leave below two of them.
Option 1
One can simply do the following
value = df2[df1 == 1].sum().sum()
[Out]: 4.0 # numpy.float64
# or
value = sum(df2[df1 == 1].sum())
[Out]: 4.0 # float
Option 2
Using pandas.DataFrame.where
value = df2.where(df1 == 1, 0).sum().sum()
[Out]: 4.0 # numpy.int64
# or
value = sum(df2.where(df1 == 1, 0).sum())
[Out]: 4 # int
Notes:
Both df2[df1 == 1] and df2.where(df1 == 1, 0) give the following output
0 1 2
0 1.0 NaN NaN
1 NaN 1.0 2.0
2 NaN NaN NaN
Depending on the desired output (float, int, numpy.float64,...) one method might be better than the other.
This is my current data frame:
sports_gpa music_gpa Activity Sport
2 3 nan nan
0 2 nan nan
3 3.5 nan nan
2 1 nan nan
I have the following condition:
If the 'sports_gpa' is greater than 0 and the 'music_gpa' is greater than the 'sports_gpa', fill the the 'Activity' column with the 'sport_gpa' and fill the 'Sport' column with the str 'basketball'.
Expected output:
sports_gpa music_gpa Activity Sport
2 3 2 basketball
0 2 nan nan
3 3.5 3 basketball
2 1 nan nan
To do this I would use the following statement...
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sport_gpa'],'basketball'), (df['Activity'], df['Sport']))
This of course gives an error that operands could not be broadcast together with shapes.
To fix this I could add a column to the data frame..
df.loc[:,'str'] = 'basketball'
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sport_gpa'],df['str']), (df['Activity'], df['Sport']))
This gives me my expected output.
I am wondering if there is a way to fix this error without having to create a new column in order to add the str value 'basketball' to the 'Sport' column in the np.where statement.
Use np.where + Series.fillna:
where=df['sports_gpa'].ne(0)&(df['sports_gpa']<df['music_gpa'])
df['Activity'], df['Sport'] = np.where(where, (df['sports_gpa'],df['Sport'].fillna('basketball')), (df['Activity'], df['Sport']))
You can also use Series.where + Series.mask:
df['Activity']=df['sports_gpa'].where(where)
df['Sport']=df['Sport'].mask(where,'basketball')
print(df)
sports_gpa music_gpa Activity Sport
0 2 3.0 2.0 basketball
1 0 2.0 NaN NaN
2 3 3.5 3.0 basketball
3 2 1.0 NaN NaN
Just figured out I could do:
df['Activity'], df['Sport'] = np.where(((df['sports_gpa'] > 0) & (df['music_gpa'] > df['sports_gpa'])), (df['sports_gpa'],df['Sport'].astype(str).replace({"nan": "basketball"})), (df['Activity'], df['Sport']))
I am facing an issue with my code using queries on simple pandas DataFrame, I am sure I am missing a tiny detail. Can you guys help me with this?
I don't understand why I've only NAN values.
You can change [['date']] to [date] for select Series instead one column DataFrame.
Sample:
df = pd.DataFrame({'A':[1,2,3]})
print (df['A'])
0 1
1 2
2 3
Name: A, dtype: int64
print (df[['A']])
A
0 1
1 2
2 3
print (df[df['A'] == 1])
A
0 1
print (df[df[['A']] == 1])
A
0 1.0
1 NaN
2 NaN
I want to replace the missing value in one column of my df with "missing value".
I tried
result['emp_title'].fillna('missing')
or
result['emp_title'] = result['emp_title'].replace({ np.nan:'missing'})
the second one works, since when i count missing value after this code:
result['emp_title'].isnull().sum()
it gave me 0.
However, the first one does not work as I expected, which did not give me a 0, instead of the previous count for missing value.
Why the first one does not work? Thank you!
You need to fill inplace, or assign:
result['emp_title'].fillna('missing', inplace=True)
or
result['emp_title'] = result['emp_title'].fillna('missing')
MVCE:
In [1697]: df = pd.DataFrame({'Col1' : [1, 2, 3, np.nan, 4, 5, np.nan]})
In [1702]: df.fillna('missing'); df # changes not seen in the original
Out[1702]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1703]: df.fillna('missing', inplace=True); df
Out[1703]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 missing
You should be aware that if you are trying to apply fillna to slices, don't use inplace=True, instead, use df.loc/iloc and assign to sub-slices:
In [1707]: df.Col1.iloc[:5].fillna('missing', inplace=True); df # doesn't work
Out[1707]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1709]: df.Col1.iloc[:5] = df.Col1.iloc[:5].fillna('missing')
In [1710]: df
Out[1710]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 NaN