Query errors in a panda Dataframe - python

I am facing an issue with my code using queries on simple pandas DataFrame, I am sure I am missing a tiny detail. Can you guys help me with this?
I don't understand why I've only NAN values.

You can change [['date']] to [date] for select Series instead one column DataFrame.
Sample:
df = pd.DataFrame({'A':[1,2,3]})
print (df['A'])
0 1
1 2
2 3
Name: A, dtype: int64
print (df[['A']])
A
0 1
1 2
2 3
print (df[df['A'] == 1])
A
0 1
print (df[df[['A']] == 1])
A
0 1.0
1 NaN
2 NaN

Related

Pandas: Select column by location and rows by value

I have a dataframe where one of the column names is a variable:
xx = pd.DataFrame([{'ID':1, 'Name': 'Abe', 'HasCar':1},
{'ID':2, 'Name': 'Ben', 'HasCar':0},
{'ID':3, 'Name': 'Cat', 'HasCar':1}])
ID Name HasCar
0 1 Abe 1
1 2 Ben 0
2 3 Cat 1
In this dummy example column 2 could be "HasCar", or "IsStaff", or some other unknowable value. I want to select all rows, where column 2 is True, whatever the column name is.
I've tried the following without success:
xx.iloc[:,[2]] == 1
HasCar
0 True
1 False
2 True
and then trying to use that as an index results in:
xx[xx.iloc[:,[2]] == 1]
ID Name HasCar
0 NaN NaN 1.0
1 NaN NaN NaN
2 NaN NaN 1.0
Which isn't helpful. I suppose I could go about renaming column 2 but that feels a little wrong. The issue seems to be that xx.iloc[:,[2]] returns a dataframe while xx['hasCar'] returns a series. I can't figure out how to force a (x,1) shaped dataframe into a series without knowing the column name, as described here .
Any ideas?
It was almost correct, but you sliced in 2D, use a Series slicing instead:
xx[xx.iloc[:, 2] == 1]
Output:
ID Name HasCar
0 1 Abe 1
2 3 Cat 1
difference:
# 2D slicing, this gives a DataFrame (with a single column)
xx.iloc[:,[2]]
HasCar
0 1
1 0
2 1
# 1D slicing, as Series
xx.iloc[:,2]
0 1
1 0
2 1
Name: HasCar, dtype: int64

Pandas How to find duplicate row in group

everyone, I try to find duplicate row in double grouped DataFrame and I don't understand how to do it.
df_part[df_part.income_flag==1].groupby(['app_id', 'month_num'])['amnt'].duplicate()
For example df:
So I want to see something like this:
So, if I use thise code I see that there are two same value 'amnt' 0.387677 but in different month... it's information that i need
df_part[(df_part.income_flag==2) & df_part.duplicated(['app_id','amnt'], keep=False)].groupby(['app_id', 'amnt', 'month_num'])['month_num'].count().head(10)
app_id amnt month_num
0 0.348838 3 1
0.387677 6 1
10 2
0.426544 2 2
0.475654 2 1
0.488173 1 1
1 0.297589 1 1
4 1
0.348838 2 1
0.426544 8 3
Name: month_num, dtype: int64
Thanks all.
I think you need chain another mask by & for bitwise AND with DataFrame.duplicated and then use GroupBy.size:
df = (df_part[(df_part.income_flag==1) & df_part.duplicated(['app_id','amnt'], keep=False)]
.groupby('app_id')['amnt']
.size()
.reset_index(name='duplicate_count'))
print (df)
app_id duplicate_count
0 12 2
1 13 3

Replacing missing values with mean

I am exploring pandas library, and I'd find this dataset. My task is to fill ? with mean of by group of column 'num-of-doors'. When I used dataframe.groupby('num-of-doors').mean() pandas was unable to find mean of these columns:
'peak-rpm', 'price', 'bore', 'stroke', 'normalized-losses', 'horsepower'
So, I tried with my own dataset to know why it is not working. I created a file with the following contents
c0,c1,type
1,2,0
2,3,0
2,4,0
1,?,1
1,3,1
and I wrote the following script:
data = pd.read_csv("data.csv")
data = data.replace('?',np.nan)
print(data)
print(data.groupby('type').mean())
this is what I'm getting as output:
c0 c1 type
0 1 2 0
1 2 3 0
2 2 4 0
3 1 NaN 1
4 1 3 1
c0
type
0 1.666667
1 1.000000
Can you please explain what is going on here? Why I'm not getting mean for column c1? Even I tried some Stackoverflow's answers, but still got nothing. Any suggestions?
Really appreciate your help.
The problem is that c1, is not of type numeric, do:
data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
print(data.groupby('type').mean())
Output
c0 c1
type
0 1.666667 3.0
1 1.000000 3.0
When you read the original data DataFrame, as it has a ? the column is of dtype object (using dtypes to verify):
c0 int64
c1 object
type int64
dtype: object
If you want to replace the nan, with the mean of the group use transform + fillna:
data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
res = data.groupby('type').transform('mean')
print(data.fillna(res))
Output
c0 c1 type
0 1 2.0 0
1 2 3.0 0
2 2 4.0 0
3 1 3.0 1
4 1 3.0 1
As a last advise you could read the csv as:
data = pd.read_csv("data.csv", na_values='?')
print(data)
Output
c0 c1 type
0 1 2.0 0
1 2 3.0 0
2 2 4.0 0
3 1 NaN 1
4 1 3.0 1
This will save you the need of converting the columns to numeric.
df['c1']=df['c1'].str.replace('[?]','NaN').astype(float)
df.groupby('type').apply(lambda x: x.fillna(x.mean()))

Using df.loc to place a value from a different row

I'm pulling my hair out on this one. Help appreciated.
I have a dataframe I'm munging which partially involves consolidating data that resides on several rows into one. I'm trying to use df.loc to do it:
df.loc[df['foo'] == 1, 'Output Column'] = df.loc[df['bar'] == 2, 'Desired Column']
So what I want is for any row where 'foo' = 1, go look for where 'bar' = 2 and put the value that resides in the 'Desired Column' into the original row. Essentially this will consolidate the rows to create cleaner output. As a toy example...
(Edited to show where my code is going wrong)
Here's what I want...
Before:
idx foo bar Desired Column Output Column
0 1
1 2 Hi there!
2 1
3 6
After:
idx foo bar Desired Column Output Column
0 1 Hi there!
1 2 Hi there!
2 1 Hi there!
3 6
However here's what I'm actually getting:
Before:
idx foo bar Desired Column Output Column
0 1
1 2 Hi there!
2 1
3 6
After:
idx foo bar Desired Column Output Column
0 1
1 2 Hi there! Hi there!
2 1
3 6
Thanks for your help!
Well this worked... not sure it is the most pythonic solution ever but here it goes:
df.loc[df['foo'] == 1, 'Output Column'] = df.loc[df['bar'] == 2, 'Desired Column']
df['Output Column'] = df.groupby(['foo'])['Output Column'].transform(max)
In my toy example, this populated with the single number that corresponded to bar=2
Try using where:
df['Output Column']=df['Output Column'].where(df['bar']==2,'Hi There!')
print(df)
Output:
idx foo bar Desired Column Output Column
0 0 1 NaN NaN Hi there!
1 1 NaN 2 Hi there! NaN
To replace NaN's with '', do:
df=df.fillna('')
after where.
Then:
print(df)
Will be:
idx foo bar Desired Column Output Column
0 0 1 Hi there!
1 1 2 Hi there!
Or more dis-manually , do:
df['Output Column']=df['Output Column'].where(df['bar']==2,df.loc[df['bar']==2,'Desired Column'].tolist())
print(df)
Then can do same thing to replace NaN's with ''
Update:
First:
df['Output Column']=df['Output Column'].where(df['foo']!=1,'Hi There!')
print(df)
Output:
Desired Column Output Column bar foo idx
0 NaN Hi There! NaN 1.0 0
1 Hi There! NaN 2.0 NaN 1
2 NaN Hi There! NaN 1.0 2
3 NaN NaN NaN 6.0 3
Second:
df['Output Column']=df['Output Column'].where(df['foo'].notnull(),'Hi There!')
print(df)
Output:
Desired Column Output Column bar foo idx
0 NaN NaN NaN 1.0 0
1 Hi There! Hi There! 2.0 NaN 1
2 NaN NaN NaN 1.0 2
3 NaN NaN NaN 6.0 3
Can do Same Thing to replace NaN's with ''

pandas not setting column correctly

I have the following program in python
# input
import pandas as pd
import numpy as np
data = pd.DataFrame({'a':pd.Series([1.,2.,3.]), 'b':pd.Series([4.,np.nan,6.])})
Here the data is:
In: print data
a b
0 1 4
1 2 NaN
2 3 6
Now I want a isnull column indicating if the row has any nan:
# create data
data['isnull'] = np.zeros(len(data))
data['isnull'][pd.isnull(data).any(axis=1)] = 1
The output is not correct (the second one should be 1):
In: print data
a b isnull
0 1 4 0
1 2 NaN 0
2 3 6 0
However, if I execute the exact command again, the output will be correct:
data['isnull'][pd.isnull(data).any(axis=1)] = 1
print data
a b isnull
0 1 4 0
1 2 NaN 1
2 3 6 0
Is this a bug with pandas or am I missing something obvious?
my python version is 2.7.6. pandas is 0.12.0. numpy is 1.8.0
You're chain indexing which doesn't give reliable results in pandas. I would do the following:
data['isnull'] = pd.isnull(data).any(axis=1).astype(int)
print data
a b isnull
0 1 4 0
1 2 NaN 1
2 3 6 0
For more on the problems with chained indexing, see here:
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#indexing-view-versus-copy

Categories