Counting String Values in Pivot Across Multiple Columns - python

I'd like to use Pandas to pivot a table into multiple columns, and get the count of their values.
In this example table:
LOCATION
ADDRESS
PARKING TYPE
AAA0001
123 MAIN
LARGE LOT
AAA0001
123 MAIN
SMALL LOT
AAA0002
456 TOWN
LARGE LOT
AAA0003
789 AVE
MEDIUM LOT
AAA0003
789 AVE
MEDIUM LOT
How do I pivot out this table to show total counts of each string within "Parking Type"? Maybe my mistake is calling this a "pivot?"
Desired output:
LOCATION
ADDRESS
SMALL LOT
MEDIUM LOT
LARGE LOT
AAA0001
123 MAIN
1
0
1
AAA0002
456 TOWN
0
0
1
AAA0003
789 AVE
0
2
0
Currently, I have a pivot going, but it is only counting the values of the first column, and leaving everything else as 0s. Any guidance would be amazing.
Current Code:
pivot = pd.pivot_table(df, index=["LOCATION"], columns=['PARKING TYPE'], aggfunc=len)
pivot = pivot.reset_index()
pivot.columns = pivot.columns.to_series().apply(lambda x: "".join(x))

You could use pd.crosstab:
out = (pd.crosstab(index=[df['LOCATION'], df['ADDRESS']], columns=df['PARKING TYPE'])
.reset_index()
.rename_axis(columns=[None]))
or you could use pivot_table (but you have to pass "ADDRESS" into the index as well):
out = (pd.pivot_table(df, index=['LOCATION','ADDRESS'], columns=['PARKING TYPE'], values='ADDRESS', aggfunc=len, fill_value=0)
.reset_index()
.rename_axis(columns=[None]))
Output:
LOCATION ADDRESS LARGE LOT MEDIUM LOT SMALL LOT
0 AAA0001 123 MAIN 1 0 1
1 AAA0002 456 TOWN 1 0 0
2 AAA0003 789 AVE 0 2 0

You can use get_dummies() and then a grouped sum to get a row per your groups:
>>> pd.get_dummies(df, columns=['PARKING TYPE']).groupby(['LOCATION','ADDRESS'],as_index=False).sum()
LOCATION ADDRESS PARKING TYPE_LARGE LOT PARKING TYPE_MEDIUM LOT PARKING TYPE_SMALL LOT
0 AAA0001 123 MAIN 1 0 1
1 AAA0002 456 TOWN 1 0 0
2 AAA0003 789 AVE 0 2 0

Related

Aggregating in pandas with two different identification columns

I am trying to aggregate a dataset with purchases, I have shortened the example in this post to keep it simple. The purchases are distinguished based on two different columns used to identify both customer and transaction. The reference refers to the same transaction, while the ID refers to the type of transaction.
I want to sum these records based on ID, however while keeping in mind the reference and not double-counting the size. The example I provide clears it up.
What I tried so far is:
df_new = df.groupby(by = ['id'], as_index=False).agg(aggregate)
df_new = df.groupby(by = ['id','ref'], as_index=False).agg(aggregate)
Let me know if you have any idea what I can do in pandas, or otherwise in Python.
This is basically what I have,
Name
Reference
Side
Size
ID
Alex
0
BUY
2400
0
Alex
0
BUY
2400
0
Alex
0
BUY
2400
0
Alex
1
BUY
3000
0
Alex
1
BUY
3000
0
Alex
1
BUY
3000
0
Alex
2
SELL
4500
1
Alex
2
SELL
4500
1
Sam
3
BUY
1500
2
Sam
3
BUY
1500
2
Sam
3
BUY
1500
2
What I am trying to achieve is the following,
Name
Side
Size
ID
Alex
BUY
5400
0
Alex
SELL
4500
1
Sam
BUY
1500
2
P.S. the records are not duplicates of each other, what I provide is a simplified version, but in reality 'Name' is 20 more columns identifying each row.
P.S. P.S. My solution was to first aggregate by Reference then by ID.
Use drop_duplicates, groupby, and agg:
new_df = df.drop_duplicates().groupby(['Name', 'Side']).agg({'Size': 'sum', 'ID': 'first'}).reset_index()
Output:
>>> new_df
Name Side Size ID
0 Alex BUY 5400 0
1 Alex SELL 4500 1
2 Sam BUY 1500 2
Edit: richardec's solution is better as this will also sum the ID column.
This double groupby should achieve the output you want, as long as names are unique.
df.groupby(['Name', 'Reference']).max().groupby(['Name', 'Side']).sum()
Explanation: First we group by Name and Reference to get the following dataframe. The ".max()" could just as well be ".min()" or ".mean()" as it seems your data will have the same size per unique transaction:
Name
Reference
Side
Size
ID
Alex
0
BUY
2400
0
1
BUY
3000
0
2
SELL
4500
1
Sam
3
BUY
1500
2
Then we group this data by Name and Side with a ".sum()" operation to get the final result.
Name
Side
Size
ID
Alex
BUY
5400
0
SELL
4500
1
Sam
BUY
1500
2
Just drop duplicates first and then aggregate with a list
something like this should do (not tested)
I always like to reset the index after
i.e
df.drop_duplicates().groupby(["Name","Side","ID"]).sum()["Size"].reset_index()
or
# stops the double counts
df_dropped = df.drop_duplicates()
# groups by all the fields in your example
df_grouped = df_dropped.groupby(["Name","Side","ID"]).sum()["Size"]
# resets the 3 indexes created with above
df_reset = df_grouped.reset_index()

Pandas: How to match / filter same key / id values (duplicates) from 2 different dataframes and replace values?

I have 2 dataframes of different sizes. The first dataframe(df1) has 4 columns, but two of those columns have the same name as the columns in the second dataframe(df2), which is only comprised of 2 columns. The columns in common are ['ID'] and ['Department'].
I want to check if any ID from df2 are in df1. If so, I want to replace df1['Department'] value with df2['Department'] value.
For instance, df1 looks something like this:
ID Department Yrs Experience Education
1234 Science 1 Bachelors
2356 Art 3 Bachelors
2456 Math 2 Masters
4657 Science 4 Masters
And df2 looks something like this:
ID Department
1098 P.E.
1234 Technology
2356 History
I want to check if the ID from df2 is in df1 and if so, update Department. The output should looks something like this:
ID Department Yrs Experience Education
1234 **Technology** 1 Bachelors
2356 **History** 3 Bachelors
2456 Math 2 Masters
4657 Science 4 Masters
The expected updates to df1 are in bold
Is there an efficient way to do this?
Thank you for taking the time to read this and help.
You can use ID of df1 to map with the Pandas series formed by setting ID on df2 as index and taking the column of Department from df2 (this acts as a mapping table).
Then, in case of no match of ID from df2, we fill-in the original values of Department from df1 (to retain original values in case of no match):
df1['Department'] = (df1['ID'].map(df2.set_index('ID')['Department'])
.fillna(df1['Department'])
)
Result:
print(df1)
ID Department Yrs Experience Education
0 1234 Technology 1 Bachelors
1 2356 History 3 Bachelors
2 2456 Math 2 Masters
3 4657 Science 4 Masters
Try:
df1["Department"].update(
df1[["ID"]].merge(df2, on="ID", how="left")["Department"]
)
print(df1)
Prints:
ID Department Yrs Experience Education
0 1234 Technology 1 Bachelors
1 2356 History 3 Bachelors
2 2456 Math 2 Masters
3 4657 Science 4 Masters
df_1 = pd.DataFrame(data={'ID':[1234, 2356, 2456, 4657], 'Department':['Science', 'Art', 'Math', 'Science']})
df_2 = pd.DataFrame(data={'ID':[1234, 2356], 'Department':['Technology', 'History']})
df_1.loc[df_1['ID'].isin(df_2['ID']), 'Department'] = df_2['Department']
OutPut
ID Department
0 1234 Technology
1 2356 History
2 2456 Math
3 4657 Science

merging on pandas: reduce the set of merging variables when match is not possible

Using python, I want to merge on multiple variables; A, B, C, but when realization a-b-c in one dataset is missing, use the finer combination that the observation has (like b-c).
Example:
Suppose I have a dataset (df1) containing person's characteristics (gender, married, city). And another dataset (df2) that I have the median income of a person according to their gender, city, married (created with a groupby).
Then I want to input that median income into the first dataset (df1) matching in as many characterisics as possible. That is if individual has characteristics gender-city-married that has median income, use that value. If the individual has characteristics that there is only city-married median income, to use that value.
Something like that
df1 = pd.DataFrame({'Male':['0', '0', '1','1'],'Married':['0', '1', '0','1'], 'City': ['NY', 'NY', 'NY', 'NY']})
Male Married City
0 0 NY
0 1 NY
1 0 NY
1 1 NY
df2 = pd.DataFrame({'Male':['0', '0', '1'],'Married':['0', '1', '1'], 'City': ['NY', 'NY','NY'], 'income':['300','400', '500']})
Male Married City income
0 0 NY 300
0 1 NY 400
1 1 NY 500
'''
and the desired outcome:
'''
desired_df1:
Male Married City income
0 0 NY 300
0 1 NY 400
1 0 NY 300
1 1 NY 400
I was thinking to do a 1st merge by=['male','married','city'], and then fill missing values from a 2nd merge by=['married','city']. But I think there should be a more systematic and simpler way. Any suggestions?
Thanks and sorry if formulation is not correct or it is duplicate (I look deeply and didn't find anything).
You can do a groupby and fillna too after merging:
out = df1.merge(df2,on=['Male','Married','City'],how='left')
out['income'] = (out['income'].fillna(out.groupby(['Married','City'])['income']
.fillna(method='ffill')))
print(out)
Male Married City income
0 0 0 NY 300
1 0 1 NY 400
2 1 0 NY 300
3 1 1 NY 500 # <- Note that this should be 500 not 400

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1

How to suppress a pandas dataframe?

I have this data frame:
age Income Job yrs
Churn Own Home
0 0 39.497576 42.540247 7.293301
1 42.667392 58.975215 8.346974
1 0 44.499774 45.054619 7.806146
1 47.615546 60.187945 8.525210
Born from this line of code:
gb = df3.groupby(['Churn', 'Own Home'])['age', 'Income', 'Job yrs'].mean()
I want to "suppress" or unstack this data frame so that it looks like this:
Churn Own Home age Income Job yrs
0 0 0 39.49 42.54 7.29
1 0 1 42.66 58.97 8.34
2 1 0 44.49 45.05 7.80
3 1 1 47.87 60.18 8.52
I have tried using both .stack() and .unstack() with no luck, also I was not able to find anything online talking about this. Any help is greatly appreciated.
Your dataFrame looks like a MultiIndex that you can revert to a single index using the command :
gb.reset_index(level=[0,1])

Categories