Pandas: Select column by location and rows by value - python

I have a dataframe where one of the column names is a variable:
xx = pd.DataFrame([{'ID':1, 'Name': 'Abe', 'HasCar':1},
{'ID':2, 'Name': 'Ben', 'HasCar':0},
{'ID':3, 'Name': 'Cat', 'HasCar':1}])
ID Name HasCar
0 1 Abe 1
1 2 Ben 0
2 3 Cat 1
In this dummy example column 2 could be "HasCar", or "IsStaff", or some other unknowable value. I want to select all rows, where column 2 is True, whatever the column name is.
I've tried the following without success:
xx.iloc[:,[2]] == 1
HasCar
0 True
1 False
2 True
and then trying to use that as an index results in:
xx[xx.iloc[:,[2]] == 1]
ID Name HasCar
0 NaN NaN 1.0
1 NaN NaN NaN
2 NaN NaN 1.0
Which isn't helpful. I suppose I could go about renaming column 2 but that feels a little wrong. The issue seems to be that xx.iloc[:,[2]] returns a dataframe while xx['hasCar'] returns a series. I can't figure out how to force a (x,1) shaped dataframe into a series without knowing the column name, as described here .
Any ideas?

It was almost correct, but you sliced in 2D, use a Series slicing instead:
xx[xx.iloc[:, 2] == 1]
Output:
ID Name HasCar
0 1 Abe 1
2 3 Cat 1
difference:
# 2D slicing, this gives a DataFrame (with a single column)
xx.iloc[:,[2]]
HasCar
0 1
1 0
2 1
# 1D slicing, as Series
xx.iloc[:,2]
0 1
1 0
2 1
Name: HasCar, dtype: int64

Related

Query errors in a panda Dataframe

I am facing an issue with my code using queries on simple pandas DataFrame, I am sure I am missing a tiny detail. Can you guys help me with this?
I don't understand why I've only NAN values.
You can change [['date']] to [date] for select Series instead one column DataFrame.
Sample:
df = pd.DataFrame({'A':[1,2,3]})
print (df['A'])
0 1
1 2
2 3
Name: A, dtype: int64
print (df[['A']])
A
0 1
1 2
2 3
print (df[df['A'] == 1])
A
0 1
print (df[df[['A']] == 1])
A
0 1.0
1 NaN
2 NaN

Python pandas: map and return Nan

I have two data frame, the first one is:
id code
1 2
2 3
3 3
4 1
and the second one is:
id code name
1 1 Mary
2 2 Ben
3 3 John
I would like to map the data frame 1 so that it looks like:
id code name
1 2 Ben
2 3 John
3 3 John
4 1 Mary
I try to use this code:
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
My mapping is correct, but the mapping value are all NAN:
mapping = {1:"Mary", 2:"Ben", 3:"John"}
id code name
1 2 NaN
2 3 NaN
3 3 NaN
4 1 NaN
Can anyone know why an how to solve?
Problem is different type of values in column code so necessary converting to integers or strings by astype for same types in both:
print (df1['code'].dtype)
object
print (df2['code'].dtype)
int64
print (type(df1.loc[0, 'code']))
<class 'str'>
print (type(df2.loc[0, 'code']))
<class 'numpy.int64'>
mapping = dict(df2[['code','name']].values)
#same dtypes - integers
df1['name'] = df1['code'].astype(int).map(mapping)
#same dtypes - object (obviously strings)
df2['code'] = df2['code'].astype(str)
mapping = dict(df2[['code','name']].values)
df1['name'] = df1['code'].map(mapping)
print (df1)
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mary
Alternate way is using dataframe.merge
df.merge(df2.drop(['id'],1), how='left', on=['code'])
Output:
id code name
0 1 2 Ben
1 2 3 John
2 3 3 John
3 4 1 Mery

Python - Dataframe - Splitting and stacking string column

I have a dataframe that is generated by the following code:
data={'ID':[1,2,3],'String': ['xKx;yKy;zzz','-','z01;x04']}
frame=pd.DataFrame(data)
I would like to transform the frame dataframe into a dataframe that looks like this:
data_trans={'ID':[1,1,1,2,3,3],'String': ['xKx','yKy','zzz','-','z01','x04']}
frame_trans=pd.DataFrame(data_trans)
So, in words, I would like to have the elements of "String" in data split at the ";" and then stacked underneath each other in a new dataframe, and the associated ID should be duplicated accordingly. Splitting, of course, in principle, is not hard, but I am having trouble with the stacking.
I would appreciate it if you could offer me some hints on how to approach this in Python. Many thanks!!
I'm not sure this is the best way to do this but here is a working approach:
data={'ID':[1,2,3],'String': ['xKx;yKy;zzz','-','z01;x04']}
frame=pd.DataFrame(data)
print(frame)
data_trans={'ID':[1,1,1,2,3,3],'String': ['xKx','yKy','zzz','-','z01','x04']}
frame_trans=pd.DataFrame(data_trans)
print(frame_trans)
frame2 = frame.set_index('ID')
# This next line does almost all the work.This can be very memory intensive.
frame3 = frame2['String'].str.split(';').apply(pd.Series, 1).stack().reset_index()[['ID', 0]]
frame3.columns = ['ID', 'String']
print(frame3)
# Verbose version
# Setting the index makes it easy to have the index column be repeated for each value later
frame2 = frame.set_index('ID')
print("frame2")
print(frame2)
#Make one column for each of the values in the multi-value columns
frame3a = frame2['String'].str.split(';').apply(pd.Series, 1)
print("frame3a")
print(frame3a)
# Convert from a wide-data format to a long-data format
frame3b = frame3a.stack()
print("frame3b")
print(frame3b)
# Get only the columns we care about
frame3c = frame3b.reset_index()[['ID', 0]]
print("frame3c")
print(frame3c)
# The columns we have have the wrong titles. Let's fix that
frame3d = frame3c.copy()
frame3d.columns = ['ID', 'String']
print("frame3d")
print(frame3d)
Output:
frame2
String
ID
1 xKx;yKy;zzz
2 -
3 z01;x04
frame3a
0 1 2
ID
1 xKx yKy zzz
2 - NaN NaN
3 z01 x04 NaN
frame3b
ID
1 0 xKx
1 yKy
2 zzz
2 0 -
3 0 z01
1 x04
dtype: object
frame3c
ID 0
0 1 xKx
1 1 yKy
2 1 zzz
3 2 -
4 3 z01
5 3 x04
frame3d
ID String
0 1 xKx
1 1 yKy
2 1 zzz
3 2 -
4 3 z01
5 3 x04

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.
I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.
The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})
I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

Categories