Pass / Fail Dataframe example - python

I have the followign code:
import pandas as pd
status = ['Pass','Fail']
item_info = pd.DataFrame({
'student': ['John','Alice','Pete','Mike','John','Alice','Joseph'],
'test': ['Pass','Pass','Pass','Pass','Pass','Pass','Pass']
})
item_status = pd.crosstab(item_info['student'],item_info['test'])
print(item_status)
Which produces:
| Student | Pass |
|---------|------|
| Alice | 2 |
| John | 2 |
| Joseph | 1 |
| Mike | 1 |
| Pete | 1 |
However, I want to create something that looks like this:
| Student | Pass | Fail | Total |
|---------|------|------|-------|
| Alice | 2 | 0 | 2 |
| John | 2 | 0 | 2 |
| Joseph | 1 | 0 | 1 |
| Mike | 1 | 0 | 1 |
| Pete | 1 | 0 | 1 |
How do I change the code so that it includes a Fail column with 0 for all of the students and provides a total?

Generic solution which adds an extra label without knowing the existing labels in advance, with reindex
cols = item_info['test'].unique().tolist()+['Fail'] #adding the extra label
pd.crosstab(item_info['student'],item_info['test']).reindex(columns=cols,fill_value=0)
Or depending on what you want, I assumed you are looking to chain methods:
item_status = pd.crosstab(item_info['student'],item_info['test'])
item_status['Fail'] = 0
test Pass Fail
student
Alice 2 0
John 2 0
Joseph 1 0
Mike 1 0
Pete 1 0

Related

How to split a column into several columns by taking the string values as column headers?

This is my dataset:
| Name | Dept | Project area/areas interested |
| -------- | -------- |-----------------------------------|
| Joe | Biotech | Cell culture//Bioinfo//Immunology |
| Ann | Biotech | Cell culture |
| Ben | Math | Trigonometry//Algebra |
| Keren | Biotech | Microbio |
| Alice | Physics | Optics |
This is how I want my result:
| Name | Dept |Cell culture|Bioinfo|Immunology|Trigonometry|Algebra|Microbio|Optics|
| -------- | -------- |------------|-------|----------|------------|-------|--------|------|
| Joe | Biotech | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| Ann | Biotech | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| Ben | Math | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| Keren | Biotech | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Alice | Physics | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Not only do I have to split the last column into the different columns based on the rows - I have to resplit certain column values that are seperated by "//". And the values in the dataframe have to be replaced with 1 or 0 (int).
I've been stuck on this for a while now (-_-;)
You can use pandas.concat in combination with pandas.get_dummies like this:
pd.concat([df[["Name", "Dept"]], df["Project area/areas interested"].str.get_dummies(sep='//')], axis=1)

Is is possible to one-hot based on a list of values being an element in a column?

Here's the dataframe that I'm working with:
| Name | Vehicle Types Owned |
| -------- | -------------- |
| Bob | [Car, Bike] |
| Billy | [Car, Truck, Train] |
| Rob | [Plane, Train, Boat] |
| Sally | [Bike, Boat] |
I am looking for something like this
| Name | Car | Bike | Truck | Train | Plane | Boat |
| ---- | ----| --- | --- | --- | --- | --- |
| Bob | 1 | 1 | 0 | 0 | 0 | 0 |
| Billy| 1 | 0 | 1 | 1 | 0 | 0 |
| Rob | 0 | 0 | 0 | 1 | 1 | 1 |
| Sally| 0 | 1 | 0 | 0 | 0 | 1 |
The original dataframe looked like this, in case that'd be more useful to work with:
| Name | Vehicle Type Owned | Num Wheels | Top Speed |
| -------- | -------------- | --------- | --------- |
| Bob | Car | 4 | 200 mph |
| Bob | Bike | 2 | 20 mph |
| Billy | Car | 4 | 220 mph |
| Billy | Truck | 8 | 100 mph |
| Billy | Train | 80 | 86 mph |
| Rob | Plane | 3 | 600 mph |
| Rob | Train | 80 | 98 mph |
| Rob | Boat | 3 | 128 mph |
| Sally | Bike | 2 | 34 mph |
| Sally | Boat | 3 | 78 mph |
I'm using pandas.
Try explode then crosstab
df = df.explode('Vehicle Types Owned')
df_ = pd.crosstab([df['Name']], df['Vehicle Types Owned']).reset_index().rename_axis(None, axis=1)
print(df_)
Name Bike Boat Car Plane Train Truck
0 Billy 0 0 1 0 1 1
1 Bob 1 0 1 0 0 0
2 Rob 0 1 0 1 1 0
3 Sally 1 1 0 0 0 0
From the original dataframe, you can use pivot_table:
df.assign(count=1).pivot_table(index='Name', columns='Vehicle Type Owned', values='count', fill_value=0)
Result:
Vehicle Type Owned Bike Boat Car Plane Train Truck
Name
Billy 0 0 1 0 1 1
Bob 1 0 1 0 0 0
Rob 0 1 0 1 1 0
Sally 1 1 0 0 0 0

Pandas: group and custom transfrom dataframe long to wide

I have a dataframe in following form:
+---------+-------+-------+---------+---------+
| payment | type | err | country | source |
+---------+-------+-------+---------+---------+
| visa | type1 | OK | AR | source1 |
| paypal | type1 | OK | DE | source1 |
| mc | type2 | ERROR | AU | source2 |
| visa | type3 | OK | US | source2 |
| visa | type2 | OK | FR | source3 |
| visa | type1 | OK | FR | source2 |
+---------+-------+-------+---------+---------+
df = pd.DataFrame({'payment':['visa','paypal','mc','visa','visa','visa'],
'type':['type1','type1','type2','type3','type2','type1'],
'err':['OK','OK','ERROR','OK','OK','OK'],
'country':['AR','DE','AU','US','FR','FR'],
'source':['source1','source1','source2','source2','source3','source2'],
})
My goal is to transform it so that I have group by payment and country, but create new columns:
number_payments - just count for groupby,
num_errors - number of ERROR values for group,
num_type1.. num_type3 - number of corresponding values in column type (only 3 possible values),
num_source1.. num_source3 - number of corresponding values in column source (only 3 possible values).
Like this:
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | num_type3 | num_source1 | num_source2 | num_source3 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| visa | AR | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| visa | US | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| visa | FR | 2 | 0 | 1 | 2 | 0 | 0 | 1 | 1 |
| mc | AU | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
I tried to combine pandas groupby and pivot, but failed to make all and it's ugly. I'm pretty sure that there are some good and fast methods to do this..
Appreciate any help.
You can use get dummies and then select the 2 grouper columns and create the group, then join the size with sum:
c = df['err'].eq("ERROR")
g = (df[['payment','country']].assign(num_errors=c,
**pd.get_dummies(df[['type','source']],prefix=['num','num']))
.groupby(['payment','country']))
out = g.size().to_frame("number_payments").join(g.sum()).reset_index()
print(out)
payment country number_payments num_errors num_type1 num_type2 \
0 mc AU 1 1 0 1
1 paypal DE 1 0 1 0
2 visa AR 1 0 1 0
3 visa FR 2 0 1 1
4 visa US 1 0 0 0
num_type3 num_source1 num_source2 num_source3
0 0 0 1 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 1
4 1 0 1 0
First it is better to clean the data for your stated purposes:
df['err_bool'] = (df['err'] == 'ERROR').astype(int)
Then we use groupby for applicable columns:
df_grouped = df.groupby(['country','payment']).agg({
'number_payments' : 'count',
'err_bool':sum})
Then we can use the pivot for type and source:
df['dummy'] = 1
df_type = df.pivot(
index=['country','payment'],
columns='type',
value='dummy',
aggfunc = np.sum
)
df_source = df.pivot_table(
index=['country','payment'],
columns='source',
value='dummy',
aggfunc = np.sum
)
Then we join everything together:
df_grouped = df_grouped.join(df_type).join(df_source)

Joining two dataframes based on the columns of one of them and the row of another

Sorry if the title doesn't make sense, but wasn't sure how eles to explain it. Here's an example of what i'm talking about
df_1
| ID | F\_Name | L\_Name |
|----|---------|---------|
| 0 | | |
| 1 | | |
| 2 | | |
| 3 | | |
df_2
| ID | Name\_Type | Name |
|----|------------|--------|
| 0 | First | Bob |
| 0 | Last | Smith |
| 1 | First | Maria |
| 1 | Last | Garcia |
| 2 | First | Bob |
| 2 | Last | Stoops |
| 3 | First | Joe |
df_3 (result)
| ID | F\_Name | L\_Name |
|----|---------|---------|
| 0 | Bob | Smith |
| 1 | Maria | Garcia |
| 2 | Bob | Stoops |
| 3 | Joe | |
Any and all advice are welcomed! Thank you
I guess that what you want to do is to reshape your second DataFrame to have the same structure of the first one, right?
You can use pivot method to achieve it:
df_3 = df_2.pivot(columns="Name_Type", values="Name")
Then, you can rename the index and the columns:
df_3 = df_3.rename(columns={"First": "F_Name", "Second": "L_Name"})
df_3.columns.name = None
df_3.index.name = "ID"

Change Duplicate values index

I have a dataset like
+---------------------------+
| | Name | Id |
| ------------------------- |
| 0 | nick | 1 |
| 1 | john | 2 |
| 2 | mick | 3 |
| 3 | nick | 4 |
| 4 | mick | 5 |
| 5 | nick | 6 |
+---------------------------+
And I want to reset the Id Index like
index | Name | Id
-------------------------
0 | nick | 1
1 | john | 2
2 | mick | 3
3 | nick | 1
4 | mick | 3
5 | nick | 1
Use factorize by name column:
df['Id'] = pd.factorize(df['Name'])[0] + 1
print (df)
Name Id
0 nick 1
1 john 2
2 mick 3
3 nick 1
4 mick 3
5 nick 1

Categories