Frequency tables in pandas (like plyr in R) - python

My problem is how to calculate frequencies on multiple variables in pandas .
I have from this dataframe :
d1 = pd.DataFrame( {'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
To the following result
Participated OfWhichpassed
ExamenYear
2007 3 2
2008 4 3
2009 3 2
(1) One possibility I tried is to compute two dataframe and bind them
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
tx = pd.concat([t1, t2] , axis = 1)
Res1 = tx['yes']
(2) The second possibility is to use an aggregation function .
import collections
dg = d1.groupby('ExamenYear')
Res2 = dg.agg({'Participated': len,'Passed': lambda x : collections.Counter(x == 'yes')[True]})
Res2.columns = ['Participated', 'OfWhichpassed']
Both ways are awckward to say the least.
How is this done properly in pandas ?
P.S: I also tried value_counts instead of collections.Counter but could not get it to work
For reference: Few months ago, I asked similar question for R here and plyr could help
---- UPDATE ------
user DSM is right. there was a mistake in the desired table result.
(1) The code for option one is
t1 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], aggfunc = len)
t2 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Participated'], aggfunc = len)
t3 = d1.pivot_table(values = 'StudentID', rows=['ExamenYear'], cols = ['Passed'], aggfunc = len)
Res1 = pd.DataFrame( {'All': t1,
'OfWhichParticipated': t2['yes'],
'OfWhichPassed': t3['yes']})
It will produce the result
All OfWhichParticipated OfWhichPassed
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
(2) For Option 2, thanks to user herrfz, I figured out how to use value_count and the code will be
Res2 = d1.groupby('ExamenYear').agg({'StudentID': len,
'Participated': lambda x: x.value_counts()['yes'],
'Passed': lambda x: x.value_counts()['yes']})
Res2.columns = ['All', 'OfWgichParticipated', 'OfWhichPassed']
which will produce the same result as Res1
My question remains though:
Using Option 2, will it be possible to use the same Variable twice (for another operation ?) can one pass a custom name for the resulting variable ?
---- A NEW UPDATE ----
I have finally decided to use apply which I understand is more flexible.

I finally decided to use apply.
I am posting what I came up with hoping that it can be useful for others.
From what I understand from Wes' book "Python for Data analysis"
apply is more flexible than agg and transform because you can define your own function.
the only requirement is that the functions returns a pandas object or a scalar value.
the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
One needs to "hard-code" structure you want at the end
Here is what I came up with
def ZahlOccurence_0(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes')})
when I run it :
d1.groupby('ExamenYear').apply(ZahlOccurence_0)
I get the correct results
All Part Pass
ExamenYear
2007 3 2 2
2008 4 3 3
2009 3 3 2
This approach would also allow me to combine frequencies with other stats
import numpy as np
d1['testValue'] = np.random.randn(len(d1))
def ZahlOccurence_1(x):
return pd.Series({'All': len(x['StudentID']),
'Part': sum(x['Participated'] == 'yes'),
'Pass' : sum(x['Passed'] == 'yes'),
'test' : x['testValue'].mean()})
d1.groupby('ExamenYear').apply(ZahlOccurence_1)
All Part Pass test
ExamenYear
2007 3 2 2 0.358702
2008 4 3 3 1.004504
2009 3 3 2 0.521511
I hope someone else will find this useful

You may use pandas crosstab function, which by default computes a frequency table of two or more variables. For example,
> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed no yes
ExamenYear
2007 1 2
2008 1 3
2009 1 2
Use the margins=True option if you also want to see the subtotal of each row and column.
> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated no yes All
ExamenYear
2007 1 2 3
2008 1 3 4
2009 0 3 3
All 2 8 10

This:
d1.groupby('ExamenYear').agg({'Participated': len,
'Passed': lambda x: sum(x == 'yes')})
doesn't look way more awkward than the R solution, IMHO.

There is another approach that I like to use for similar problems, it uses groupby and unstack:
d1 = pd.DataFrame({'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6", "x7", "x8", "x9"],
'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
(this is just the raw data from above)
d2 = d1.groupby("ExamenYear").Participated.value_counts().unstack(fill_value=0)['yes']
d3 = d1.groupby("ExamenYear").Passed.value_counts().unstack(fill_value=0)['yes']
d2.name = "Participated"
d3.name = "Passed"
pd.DataFrame(data=[d2,d3]).T
Participated Passed
ExamenYear
2007 2 2
2008 3 3
2009 3 2
This solution is slightly more cumbersome than the one above using apply, but this one is easier to understand and extend, I feel.

Related

Using Python 'in' operator to check if Dataframe column values are in list of strings results in ValueError

I have a dataset similar to this one:
Mother ID ChildID ethnicity
0 1 1 White Other
1 2 2 Indian
2 3 3 Black
3 4 4 Other
4 4 5 Other
5 5 6 Mixed-White and Black
To simplify my dataset and make it more relevant to the classification I am performing, I want to categorise ethnicities into 3 categories as such:
White: within this category I will include 'White British' and 'White Other' values
South Asian: the category will include 'Pakistani', 'Indian', 'Bangladeshi'
Other: 'Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian' values
So I want the above dataset to be transformed to:
Mother ID ChildID ethnicity
0 1 1 White
1 2 2 South Asian
2 3 3 Other
3 4 4 Other
4 4 5 Other
5 5 6 Other
To do this I have run the following code, similar to the one provided in this answer:
col = 'ethnicity'
conditions = [ (df[col] in ('White British', 'White Other')),
(df[col] in ('Indian', 'Pakistani', 'Bangladeshi')),
(df[col] in ('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian'))]
choices = ['White', 'South Asian', 'Other']
df["ethnicity"] = np.select(conditions, choices, default=np.nan)
But when running this, I get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any idea why I am getting this error? Am I not handling the string comparison correctly? I am using a similar technique to manipulate other features in my dataset and it is working fine there.
I can not find why in is not working, but isin definitely solve the problem, maybe someone else can tell why in has a problem.
conditions = [ (df[col].isin(('White British', 'White Other'))),
(df[col].isin(('Indian', 'Pakistani', 'Bangladeshi'))),
(df[col].isin(('Other', 'Black', 'Mixed-White and Black', 'Mixed-White and South Asian')))]
print(conditions)
choices = ['White', 'South Asian', 'Other']
df["ethnicity"] = np.select(conditions, choices, default=np.nan)
print(df)
output
Mother ID ChildID ethnicity
0 1 1 White
1 2 2 South Asian
2 3 3 Other
3 4 4 Other
4 4 5 Other
5 5 6 nan
With df[col] in some_tuple you are searching df[col] inside some_tuple, which is obviously not what you want. What you want is df[col].isin(some_tuple), which returns a new series of booleans of the same length of df[col].
So, why you get that error anyway? The function for searching a value in a tuple is more or less like the following:
for v in some_tuple:
if df[col] == v:
return True
return False
df[col] == v evaluates to a series result; no problem here
then Python try to evaluate if result: and you get that error because you have a series in a condition clause, meaning that you are (implicitly) trying to evaluate a series as a boolean; this is not allowed by pandas.
For your problem, anyway, I would use DataFrame.apply. It takes a function that map a value to another; in your case, a function that map each ethnicity to a category. There are many ways to define it (see options below).
import numpy as np
import pandas as pd
d = pd.DataFrame({
'field': range(6),
'ethnicity': list('ABCDE') + [np.nan]
})
# Option 1: define a dict {ethnicity: category}
category_of = {
'A': 'X',
'B': 'X',
'C': 'Y',
'D': 'Y',
'E': 'Y',
np.nan: np.nan,
}
result = d.assign(category=d['ethnicity'].apply(category_of.__getitem__))
print(result)
# Option 2: define categories, then "invert" the dict.
categories = {
'X': ['A', 'B'],
'Y': ['C', 'D', 'E'],
np.nan: [np.nan],
}
# If you do this frequently you could define a function invert_mapping(d):
category_of = {eth: cat
for cat, values in categories.items()
for eth in values}
result = d.assign(category=d['ethnicity'].apply(category_of.__getitem__))
print(result)
# Option 3: define a function (a little less efficient)
def ethnicity_to_category(ethnicity):
if ethnicity in {'A', 'B'}:
return 'X'
if ethnicity in {'C', 'D', 'E'}:
return 'Y'
if pd.isna(ethnicity):
return np.nan
raise ValueError('unknown ethnicity: %s' % ethnicity)
result = d.assign(category=d['ethnicity'].apply(ethnicity_to_category))
print(result)

How do I create a data frame and form two separate columns from these two outputs?

I am trying to have a dataframe that includes the following two outputs, side by side as columns:
finalcust = mainorder_df, custname1_df
print(finalcust)
finalcust
Out[46]:
(10 10103.0
26 10104.0
39 10105.0
54 10106.0
72 10107.0
...
2932 10418.0
2941 10419.0
2955 10420.0
2977 10424.0
2983 10425.0
Name: ordernumber, Length: 213, dtype: float64,
1 Signal Gift Stores
2 Australian Collectors, Co.
3 La Rochelle Gifts
4 Baane Mini Imports
5 Mini Gifts Distributors Ltd.
...
117 Motor Mint Distributors Inc.
118 Signal Collectibles Ltd.
119 Double Decker Gift Stores, Ltd
120 Diecast Collectables
121 Kelly's Gift Shop
Name: customerName, Length: 91, dtype: object)
I have tried pd.merge but it says I am not allowed since there is no common column.
Anyone have any idea?
What are you actually trying to accomplish?
General Merging with df.merge()
The data frames cannot be merged because they are not related in anyway. Pandas expects them to have a similar column in order to know how to merge. pandas.DataFrame.merge docs
Example: If you wanted to take information from a customer information sheet and add it to an order list.
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Adress_A', 'Address_B', 'Address_C', 'Address_D']
df1 = pd.DataFrame({'Customer': customers,
'Info': addresses})
df2 = pd.DataFrame({'Customer': ['A', 'B', 'C', 'D','A','B','C','D','A','B'],
'Order': [1,2,3,4,5,6,7,8,9,10]})
df = df1.merge(df2)
df =
Customer Info Order
0 A Adress_A 1
1 A Adress_A 5
2 A Adress_A 9
3 B Address_B 2
4 B Address_B 6
5 B Address_B 10
6 C Address_C 3
7 C Address_C 7
8 D Address_D 4
9 D Address_D 8
Combining with df.concat()
If they were the same size, you would use concat to combine them. There is a post about it here
Example: Adding a new list of customers to the customer df
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Address_A', 'Address_B', 'Address_C', 'Address_D']
new_customers = ['E', 'F', 'G', 'H']
new_addresses = ['Address_E', 'Address_F', 'Address_G', 'Address_G']
df1 = pd.DataFrame({'Customer': customers,
'Info': addresses})
df2 = pd.DataFrame({'Customer': new_customers,
'Info': new_addresses})
df = pd.concat([df1, df2])
df =
Customer Info
0 A Address_A
1 B Address_B
2 C Address_C
3 D Address_D
0 E Address_E
1 F Address_F
2 G Address_G
3 H Address_G
Combining "Side by Side" by Adding a New Column
The side by side method of combination would be adding a column.
Example: Adding a new column to customer information df.
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Address_A', 'Address_B', 'Address_C', 'Address_D']
phones = [1,2,3,4]
df = pd.DataFrame({'Customer': customers,
'Info': addresses})
df['Phones'] = phones
df =
Customer Info Phones
0 A Address_A 1
1 B Address_B 2
2 C Address_C 3
3 D Address_D 4
Actually Doing...?
If you are trying to assign a customer name to an order, that can't be done with the data you have here.
Hope this helps..

Weird behaviour with groupby on ordered categorical columns

MCVE
df = pd.DataFrame({
'Cat': ['SF', 'W', 'F', 'R64', 'SF', 'F'],
'ID': [1, 1, 1, 2, 2, 2]
})
df.Cat = pd.Categorical(
df.Cat, categories=['R64', 'SF', 'F', 'W'], ordered=True)
As you can see, I've define an ordered categorical column on Cat. To verify, check;
0 SF
1 W
2 F
3 R64
4 SF
5 F
Name: Cat, dtype: category
Categories (4, object): [R64 < SF < F < W]
I want to find the largest category PER ID. Doing groupby + max works.
df.groupby('ID').Cat.max()
ID
1 W
2 F
Name: Cat, dtype: object
But I don't want ID to be the index, so I specify as_index=False.
df.groupby('ID', as_index=False).Cat.max()
ID Cat
0 1 W
1 2 SF
Oops! Now, the max is taken lexicographically. Can anyone explain whether this is intended behaviour? Or is this a bug?
Note, for this problem, the workaround is df.groupby('ID').Cat.max().reset_index().
Note,
>>> pd.__version__
'0.22.0'
This is not intended behavior, it's a bug.
Source diving shows the flag does two completely different things. The one simply ignores grouper levels and names, it just takes the values with a new range index. The other one clearly keeps them.

find difference between any two columns of dataframes with a common key column pandas

I have two dataframes with one having
Title Name Quantity ID
as the columns
and the 2nd dataframe has
ID Quantity
as the columns with lesser number of rows than first dataframe .
I need to find the difference between the Quantity of both dataframes based the match in the ID columns and I want to store this difference in a seperate column in the first dataframe .
I tried this (did't work) :
DF1[['ID','Quantity']].reset_index(drop=True).apply(lambda id_qty_tup : DF2[DF2.ID==asin_qty_tup[0]].quantity - id_qty_tup[1] , axis = 1)
Another approach is to apply the ID and quantity of DF1 and iterate through each row of DF2 but it takes more time . Im sure there is a better way .
You can perform index-aligned subtraction, and pandas takes care of the rest.
df['Diff'] = df.set_index('ID').Quantity.sub(df2.set_index('ID').Quantity).values
Demo
Here, changetype is the index, and I've already set it, so pd.Series.sub will align subtraction by default. Otherwise, you'd need to set the index as above.
df1
strings test
changetype
0 a very -1.250150
1 very boring text -1.376637
2 I cannot read it -1.011108
3 Hi everyone -0.527900
4 please go home -1.010845
5 or I will go 0.008159
6 now -0.470354
df2
strings test
changetype
0 a very very boring text 0.625465
1 I cannot read it -1.487183
2 Hi everyone 0.292866
3 please go home or I will go now 1.430081
df1.test.sub(df2.test)
changetype
0 -1.875614
1 0.110546
2 -1.303974
3 -1.957981
4 NaN
5 NaN
6 NaN
Name: test, dtype: float64
You can use map in this case:
df['diff'] = df['ID'].map(df2.set_index('ID').Quantity) - df.Quantity
Some Data
import pandas as pd
df = pd.DataFrame({'Title': ['A', 'B', 'C', 'D', 'E'],
'Name': ['AA', 'BB', 'CC', 'DD', 'EE'],
'Quantity': [1, 21, 14, 15, 611],
'ID': ['A1', 'A1', 'B2', 'B2', 'C1']})
df2 = pd.DataFrame({'Quantity': [11, 51, 44],
'ID': ['A1', 'B2', 'C1']})
We will use df2 to create a dictionary which can be used to map ID to Quantity. So anywhere there is an ID==A1 in df it gets assigned the Quantity 11, B2 gets assigned 51 and C1 gets assigned 44. Here' I'll add it as another column just for illustration purposes.
df['Quantity2'] = df['ID'].map(df2.set_index('ID').Quantity)
print(df)
ID Name Quantity Title Quantity2
0 A1 AA 1 A 11
1 A1 BB 21 B 11
2 B2 CC 14 C 51
3 B2 DD 15 D 51
4 C1 EE 611 E 44
Name: ID, dtype: int64
Then you can just subtract df['Quantity'] and the column we just created to get the difference. (Or subtract that from df['Quantity'] if you want the other difference)

Calculating the minimum distance between two DataFrames

I like to find the item of DF2 that is cloest to the item in DF1.
The distance is euclidean distance.
For example, for A in DF1, F in DF2 is the cloeset one.
>>> DF1
X Y name
0 1 2 A
1 3 4 B
2 5 6 C
3 7 8 D
>>> DF2
X Y name
0 3 8 E
1 2 4 F
2 1 9 G
3 6 4 H
My code is
DF1 = pd.DataFrame({'name' : ['A', 'B', 'C', 'D'],'X' : [1,3,5,7],'Y' : [2,4,6,8]})
DF2 = pd.DataFrame({'name' : ['E', 'F', 'G', 'H'],'X' : [3,2,1,6],'Y' : [8,4,9,4]})
def ndis(row):
try:
X,Y=row['X'],row['Y']
DF2['DIS']=(DF2.X-X)*(DF2.X-X)+(DF2.Y-Y)*(DF2.Y-Y)
temp=DF2.ix[DF2.DIS.idxmin()]
return temp[2] # print temp[2]
except:
pass
DF1['Z']=DF1.apply(ndis, axis=1)
This works fine, and it will take too long for large data set.
Another question is to how to find the 2nd and 3d cloeset ones.
There is more than one approach, for example one can use numpy:
>>> xy = ['X', 'Y']
>>> distance_array = numpy.sum((df1[xy].values - df2[xy].values)**2, axis=1)
>>> distance_array.argmin()
1
Top 3 closest (not the fastest approach, I suppose, but simplest)
>>> distance_array.argsort()[:3]
array([1, 3, 2])
If speed is a concern, run performance tests.
Look at scipy.spatial.KDTree and the related cKDTree, which is faster but offers only a subset of the functionality. For large sets, you probably won't beat that for speed.

Categories