How to suppress a pandas dataframe? - python

I have this data frame:
age Income Job yrs
Churn Own Home
0 0 39.497576 42.540247 7.293301
1 42.667392 58.975215 8.346974
1 0 44.499774 45.054619 7.806146
1 47.615546 60.187945 8.525210
Born from this line of code:
gb = df3.groupby(['Churn', 'Own Home'])['age', 'Income', 'Job yrs'].mean()
I want to "suppress" or unstack this data frame so that it looks like this:
Churn Own Home age Income Job yrs
0 0 0 39.49 42.54 7.29
1 0 1 42.66 58.97 8.34
2 1 0 44.49 45.05 7.80
3 1 1 47.87 60.18 8.52
I have tried using both .stack() and .unstack() with no luck, also I was not able to find anything online talking about this. Any help is greatly appreciated.

Your dataFrame looks like a MultiIndex that you can revert to a single index using the command :
gb.reset_index(level=[0,1])

Related

Pandas Multilevel Dataframe stack columns next to each other

I have a Dataframe in the following format:
id employee date week result
1234565 Max 2022-07-04 27 Project 1
27.1 Customer 1
27.2 100%
27.3 Work
1245513 Susanne 2022-07-04 27 Project 2
27.1 Customer 2
27.2 100%
27.3 In progress
What I want to achieve is the following format:
id employee date week result customer availability status
1234565 Max 2022-07-04 27 Project 1 Customer 1 100% Work
1245513 Susanne 2022-07-04 27 Project 2 Customer 2 100% In progress
The id, employee, date and week column are index, so I have a multilevel index.
I have tried several things but nothing really brings the expected result...
So basically I want to unpivot the result.
You can do this ( you need pandas version >= 1.3.0 ):
cols = ['result', 'customer', 'availability', 'status']
new_cols = df.index.droplevel('week').names + cols + ['week']
df = df.groupby(df.index.names).agg(list)
weeks = df.reset_index('week').groupby(df.index.droplevel('week').names)['week'].first()
df = df.unstack().droplevel('week',axis=1).assign(week=weeks).reset_index()
df.columns = new_cols
df = df.explode(cols)
print(df):
id employee date result customer availability \
0 1234565 Max 2022-07-04 Project 1 Customer 1 100%
1 1245513 Susanne 2022-07-04 Project 2 Customer 2 100%
status week
0 Work 27.0
1 In progress 27.0

Counting String Values in Pivot Across Multiple Columns

I'd like to use Pandas to pivot a table into multiple columns, and get the count of their values.
In this example table:
LOCATION
ADDRESS
PARKING TYPE
AAA0001
123 MAIN
LARGE LOT
AAA0001
123 MAIN
SMALL LOT
AAA0002
456 TOWN
LARGE LOT
AAA0003
789 AVE
MEDIUM LOT
AAA0003
789 AVE
MEDIUM LOT
How do I pivot out this table to show total counts of each string within "Parking Type"? Maybe my mistake is calling this a "pivot?"
Desired output:
LOCATION
ADDRESS
SMALL LOT
MEDIUM LOT
LARGE LOT
AAA0001
123 MAIN
1
0
1
AAA0002
456 TOWN
0
0
1
AAA0003
789 AVE
0
2
0
Currently, I have a pivot going, but it is only counting the values of the first column, and leaving everything else as 0s. Any guidance would be amazing.
Current Code:
pivot = pd.pivot_table(df, index=["LOCATION"], columns=['PARKING TYPE'], aggfunc=len)
pivot = pivot.reset_index()
pivot.columns = pivot.columns.to_series().apply(lambda x: "".join(x))
You could use pd.crosstab:
out = (pd.crosstab(index=[df['LOCATION'], df['ADDRESS']], columns=df['PARKING TYPE'])
.reset_index()
.rename_axis(columns=[None]))
or you could use pivot_table (but you have to pass "ADDRESS" into the index as well):
out = (pd.pivot_table(df, index=['LOCATION','ADDRESS'], columns=['PARKING TYPE'], values='ADDRESS', aggfunc=len, fill_value=0)
.reset_index()
.rename_axis(columns=[None]))
Output:
LOCATION ADDRESS LARGE LOT MEDIUM LOT SMALL LOT
0 AAA0001 123 MAIN 1 0 1
1 AAA0002 456 TOWN 1 0 0
2 AAA0003 789 AVE 0 2 0
You can use get_dummies() and then a grouped sum to get a row per your groups:
>>> pd.get_dummies(df, columns=['PARKING TYPE']).groupby(['LOCATION','ADDRESS'],as_index=False).sum()
LOCATION ADDRESS PARKING TYPE_LARGE LOT PARKING TYPE_MEDIUM LOT PARKING TYPE_SMALL LOT
0 AAA0001 123 MAIN 1 0 1
1 AAA0002 456 TOWN 1 0 0
2 AAA0003 789 AVE 0 2 0

How to aggregate days by year-month and pivot so that count becomes count_source summed over the month with Python

I am manipulating some data in Python and was wondering if anyone can help.
I have data that looks like this:
count source timestamp tokens
0 1 alt-right-census 2006-03-21 setting
1 1 alt-right-census 2006-03-21 twttr
2 1 stormfront 2006-06-24 head
3 1 stormfront 2006-10-07 five
and I need data that looks like this:
count_stormfront count_alt-right-census month token
2 1 2006-01 setting
or like this:
date token alt_count storm_count
4069995 2016-09 zealand 0 0
4069996 2016-09 zero 11 8
4069997 2016-09 zika 295 160
How can I aggregate days by year-month and pivot so that count becomes count_source summed over the month?
Any help would be appreciated. Thanks!
df.groupby(['source', df['timestamp'].str[:7]]).size().unstack()
Result:
timestamp 2006-03 2006-06 2006-10
source
alt-right-census 2.0 NaN NaN
stormfront NaN 1.0 1.0

How Solve a Data Science Question Using Python's Panda Data Structure Syntax

Good afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated from a US university and employed while currently taking the Coursera.org course of "Python for Data Science" just for professional development, which is offered online at Coursera's platform by the University of Michigan. I'm not sharing answers to anyone either as I abide by Coursera's Honor Code.
First, I was given this panda dataframe chart concerning Olympic medals won by countries around the world:
# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ
Second, the question asked is, "Which country has won the most gold medals in summer games?"
Third, a hint given me as to how to answer using Python's panda syntax is this:
"This function should return a single string value."
Fourth, I tried entering this as the answer in Python's panda syntax:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
def answer_one():
if df.columns[:2]=='00':
df.rename(columns={col:'Country'+col[4:]}, inplace=True)
df_max = df[df[max('Gold')]]
return df_max['Country']
answer_one()
Fifth, I have tried other various answers like this in Coursera's auto-grader, but
it keeps giving this error message:
There was a problem evaluating function answer_one, it threw an exception was thus counted as incorrect.
0.125 points were not awarded.
Could you please help me solve that question? Any hints/suggestions/comments are welcome for that.
Thanks, Kevin
You can use pandas' loc function to find the country name corresponding to the maximum of the "Gold" column:
data = [('Afghanistan', 13),
('Algeria', 12),
('Argentina', 23)]
df = pd.DataFrame(data, columns=['Country', 'Gold'])
df['Country'].loc[df['Gold'] == df['Gold'].max()]
The last line returns Argentina as answer.
Edit 1:
I just noticed you import the .csv file using pd.read_csv('olympics.csv', index_col=0, skiprows=1). If you leave out the skiprows argument you will get a dataframe where the first line in the .csv file correspond to column names in the dataframe. This makes handling of your dataframe much easier in pandas and is encouraged. Second, I see that using the index_col=0 argument you use the country names as indices in the dataframe. In this case you should choose to use index over the loc function as follows:
df.index[df['Gold'] == df['Gold'].max()][0]
import pandas as pd
def answer_one():
df1=pd.Series.max(df['Gold'])
df1=df[df['Gold']==df1]
return df1.index[0]
answer_one()
Function argmax() returns the index of the maximum element in the data frame.
return df['Gold'].argmax()

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Categories