pandas pivot table aggfunc troubleshooting - python

This DataFrame has two columns, both are object type.
Dependents Married
0 0 No
1 1 Yes
2 0 Yes
3 0 Yes
4 0 No
I want to aggregate 'Dependents' based on 'Married'.
table = df.pivot_table(
values='Dependents',
index='Married',
aggfunc = lambda x: x.map({'0':0,'1':1,'2':2,'3':3}).mean())
This works, however, surprisingly, the following doesn't:
table = df.pivot_table(values = 'Dependents',
index = 'Married',
aggfunc = lambda x: x.map(int).mean())
It will produce a None instead.
Can anyone help explain?

Both examples of code provided in your question work. However, they are not the idiomatic way to achieve what you want to do -- particularly the first one.
I think this is the proper way to obtain the expected behavior.
# Test data
df = DataFrame({'Dependents': ['0', '1', '0', '0', '0'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No']})
# Converting object to int
df['Dependents'] = df['Dependents'].astype(int)
# Computing the mean by group
df.groupby('Married').mean()
Dependents
Married
No 0.00
Yes 0.33
However, the following code works.
df.pivot_table(values = 'Dependents', index = 'Married',
aggfunc = lambda x: x.map(int).mean())
It is equivalent (and more readable) of converting to int with map before pivoting data.
df['Dependents'] = df['Dependents'].map(int)
df.pivot_table(values = 'Dependents', index = 'Married')
Edit
I you have messy DataFrame, you can use to_numeric with the error parameter set to coerce.
If coerce, then invalid parsing will be set as NaN
# Test data
df = DataFrame({'Dependents': ['0', '1', '2', '3+', 'NaN'],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No']})
df['Dependents'] = pd.to_numeric(df['Dependents'], errors='coerce')
print(df)
Dependents Married
0 0.0 No
1 1.0 Yes
2 2.0 Yes
3 NaN Yes
4 NaN No
print(df.groupby('Married').mean())
Dependents
Married
No 0.0
Yes 1.5

My originally question was why the method 2 using map(int) doesn't work. None of the above answers my question. Therefore there is no best answer.
However, as I look back, I found now in pandas 0.22, method 2 does work. I guess the problem is in pandas.
To robustly do the aggregation, my solution would be
df.pivot_table(
values='Dependents',
index='Married',
aggfunc = lambda x: x.map(lambda x:int(x.strip("+"))).mean())
To make it cleaner, I guess you could first translate the column "Dependents" to integer then do the aggregation.

Related

pandas DataFrame (easy?) manipulation

pd.DataFrame({'id': ['id1', 'id1', 'id2', 'id2'],
'value': ['1', '2', '10', '20'],
'index': ['day1', 'day2', 'day1', 'day2']})
how can I transform this data correctly (and concisely) with pandas that it results in:
| id1 | id2
day1 : 1 | 10
day2 : 2 | 20
Maybe something with groupby but without aggregation, I dont know what to google, can you help me?
thank you very much
Use pandas pivot. It reshapes datframe based on the input conditions
pd.pivot_table(df, index=['index'], columns=['id'],values='value').reset_index()
Just remember to set value to float or integer

Change None value in a column to string 'None' in Python

Some columns in my data set have missing values that are represented as None (Nonetype, not a string). Some other missing values are represented as 'N/A' or 'No'. I want to be able to handle these missing values in below method.
df.loc[df.col1.isin('None', 'Yes', 'No'), col1] = 'N/A'
Now my problem is, None is a value not a string and so I can not use none as 'None'. I have read somewhere we can convert that none value to a string 'None'.
Can anyone kindly give me any clue how to go about it ?
Note 1:
Just for clarity of explanation if I run below code:
df.col1.unique()
I get this output:
array([None, 'No', 'Yes'], dtype=object)
Note 2:
I know I can handle missing or None value with isnull() but in this case I need to use .isin() method
Sample dataframe:
f = {'name': ['john', 'tom', None, 'rock', 'dick'], 'DoB': [None, '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', None]}
df1 = pd.DataFrame(data = f)
When you run below code you will see None as a value.
df1.Address.unique()
output: array(['NY', 'NJ', 'PA', None], dtype=object)
I want the None to be displayed as 'None'
There is a different between a null/None and 'None'. So you can change your original statement to
df.loc[df.col1.isin([None, 'Yes', 'No']), col1] = 'N/A'
That is, take out the apostrophes for None
Or you can first find all the indices where a null's or none's exist and then select all those rows based on the index. And then you can use your original statement.
df["col1"].loc[df["col1"].isnull()] = 'None'
Create an example df:
df = pd.DataFrame({"A": [None, 'Yes', 'No', 1, 3, 5]})
which looks like:
A
0 None
1 Yes
2 No
3 1
4 3
5 5
Replace your 'None' by None and make the to be replaced arguments a list (that's how isin works):
df.loc[df.A.isin([None, 'Yes', 'No']), 'A'] = 'N/A'
which returns:
A
0 N/A
1 N/A
2 N/A
3 1
4 3
5 5

Convert all numeric columns of dataframe to absolute value

I want to convert all numeric columns in a dataframe to their absolute values and am doing this:
df = df.abs()
However, it gives the error:
*** TypeError: bad operand type for abs(): 'unicode'
How to fix this? I would really prefer not having to manually specify the column names
You could use np.issubdtype to check whether your dtype of the columns is np.number or not with apply. Using #Amy Tavory example:
df = pd.DataFrame({'a': ['-1', '2'], 'b': [-1, 2]})
res = df.apply(lambda x: x.abs() if np.issubdtype(x.dtype, np.number) else x)
In [14]: res
Out[14]:
a b
0 -1 1
1 2 2
Or you could use np.dtype.kind to check whether your dtype is numeric:
res1 = df.apply(lambda x: x.abs() if x.dtype.kind in 'iufc' else x)
In [20]: res1
Out[20]:
a b
0 -1 1
1 2 2
Note: You may be also interested in NumPy dtype hierarchy
Borrowing from an answer to this question, how about selecting the columns that are numeric?
Say you start with
df = pd.DataFrame({'a': ['-1', '2'], 'b': [-1, 2]})
>>> df
a b
0 -1 -1
1 2 2
Then just do
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
for c in [c for c in df.columns if df[c].dtype in numerics]:
df[c] = df[c].abs()
>>> df
a b
0 -1 1
1 2 2
Faster than the existing answers and more to the point:
df.update(df.select_dtypes(include=[np.number]).abs())
(Careful: I noticed that the update sometimes doesn't do anything when df has a non-trivial multi-index. I'll update this answer once I figure out where the problem is. This definitely works fine for trivial range-indices)
If you know the columns you want to change to absolute value use this:
df.iloc[:,2:7] = df.iloc[:,2:7].abs()
which means change all values from the third to sixth column (inclusive) to its absolute values.
If you don't, you can create a list of column names whos values are not objects
col_list = [col for col in df.columns if df[col].dtype != object]
Then use .loc instead
df.loc[:,col_list] = df.loc[:,col_list].abs()
I know it is wordy but I think it avoids the slow nature of apply or lambda

Convert row to column header for Pandas DataFrame,

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Frequency tables in pandas (like plyr in R)

My problem is how to calculate frequencies on multiple variables in pandas .
I have from this dataframe :
d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
To the following result
Participated OfWhichpassed
ExamenYear
2007 3 2
2008 4 3
2009 3 2
(1) One possibility I tried is to compute two dataframe and bind them
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
tx = pd.concat([t1, t2] , axis = 1)
Res1 = tx['yes']
(2) The second possibility is to use an aggregation function .
import collections
dg = d1.groupby('ExamenYear')
Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]})
Res2.columns = ['Participated', 'OfWhichpassed']
Both ways are awckward to say the least.
How is this done properly in pandas ?
P.S: I also tried value_counts instead of collections.Counter but could not get it to work
For reference: Few months ago, I asked similar question for R here and plyr could help
---- UPDATE ------
user DSM is right. there was a mistake in the desired table result.
(1) The code for option one is
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
Res1 = pd.DataFrame( {'All': t1,
'OfWhichParticipated': t2['yes'],
'OfWhichPassed': t3['yes']})
It will produce the result
All OfWhichParticipated OfWhichPassed
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be
Res2 = d1.groupby('ExamenYear').agg({'StudentID': len,
'Participated': lambda x: x.value_counts()['yes'],
'Passed': lambda x: x.value_counts()['yes']})
Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']
which will produce the same result as Res1
My question remains though:
Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?
---- A NEW UPDATE ----
I have finally decided to use apply which I understand is more flexible.
I finally decided to use apply.
I am posting what I came up with hoping that it can be useful for others.
From what I understand from Wes' book "Python for Data analysis"
apply is more flexible than agg and transform because you can define your own function.
the only requirement is that the functions returns a pandas object or a scalar value.
the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
One needs to "hard-code" structure you want at the end
Here is what I came up with
def ZahlOccurence_0(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes')})
when I run it :
d1.groupby('ExamenYear').apply(ZahlOccurence_0)
I get the correct results
All Part Pass
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
This approach would also allow me to combine frequencies with other stats
import numpy as np
d1['testValue'] = np.random.randn(len(d1))
def ZahlOccurence_1(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes'),
'test' : x['testValue'].mean()})
d1.groupby('ExamenYear').apply(ZahlOccurence_1)
All Part Pass test
ExamenYear
2007 3 2 2 0.358702
2008 4 3 3 1.004504
2009 3 3 2 0.521511
I hope someone else will find this useful
You may use pandas crosstab function, which by default computes a frequency table of two or more variables. For example,
> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed no yes
ExamenYear
2007 1 2
2008 1 3
2009 1 2
Use the margins=True option if you also want to see the subtotal of each row and column.
> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated no yes All
ExamenYear
2007 1 2 3
2008 1 3 4
2009 0 3 3
All 2 8 10
This:
d1.groupby('ExamenYear').agg({'Participated': len,
'Passed': lambda x: sum(x == 'yes')})
doesn't look way more awkward than the R solution, IMHO.
There is another approach that I like to use for similar problems, it uses groupby and unstack:
d1 = pd.DataFrame({'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
(this is just the raw data from above)
d2 = d1.groupby("ExamenYear").Participated.value_counts().unstack(fill_value=0)['yes']
d3 = d1.groupby("ExamenYear").Passed.value_counts().unstack(fill_value=0)['yes']
d2.name = "Participated"
d3.name = "Passed"
pd.DataFrame(data=[d2,d3]).T
Participated Passed
ExamenYear
2007 2 2
2008 3 3
2009 3 2
This solution is slightly more cumbersome than the one above using apply, but this one is easier to understand and extend, I feel.

Categories