Change Duplicate values index - python

I have a dataset like
+---------------------------+
| | Name | Id |
| ------------------------- |
| 0 | nick | 1 |
| 1 | john | 2 |
| 2 | mick | 3 |
| 3 | nick | 4 |
| 4 | mick | 5 |
| 5 | nick | 6 |
+---------------------------+
And I want to reset the Id Index like
index | Name | Id
-------------------------
0 | nick | 1
1 | john | 2
2 | mick | 3
3 | nick | 1
4 | mick | 3
5 | nick | 1

Use factorize by name column:
df['Id'] = pd.factorize(df['Name'])[0] + 1
print (df)
Name Id
0 nick 1
1 john 2
2 mick 3
3 nick 1
4 mick 3
5 nick 1

Related

Is is possible to one-hot based on a list of values being an element in a column?

Here's the dataframe that I'm working with:
| Name | Vehicle Types Owned |
| -------- | -------------- |
| Bob | [Car, Bike] |
| Billy | [Car, Truck, Train] |
| Rob | [Plane, Train, Boat] |
| Sally | [Bike, Boat] |
I am looking for something like this
| Name | Car | Bike | Truck | Train | Plane | Boat |
| ---- | ----| --- | --- | --- | --- | --- |
| Bob | 1 | 1 | 0 | 0 | 0 | 0 |
| Billy| 1 | 0 | 1 | 1 | 0 | 0 |
| Rob | 0 | 0 | 0 | 1 | 1 | 1 |
| Sally| 0 | 1 | 0 | 0 | 0 | 1 |
The original dataframe looked like this, in case that'd be more useful to work with:
| Name | Vehicle Type Owned | Num Wheels | Top Speed |
| -------- | -------------- | --------- | --------- |
| Bob | Car | 4 | 200 mph |
| Bob | Bike | 2 | 20 mph |
| Billy | Car | 4 | 220 mph |
| Billy | Truck | 8 | 100 mph |
| Billy | Train | 80 | 86 mph |
| Rob | Plane | 3 | 600 mph |
| Rob | Train | 80 | 98 mph |
| Rob | Boat | 3 | 128 mph |
| Sally | Bike | 2 | 34 mph |
| Sally | Boat | 3 | 78 mph |
I'm using pandas.
Try explode then crosstab
df = df.explode('Vehicle Types Owned')
df_ = pd.crosstab([df['Name']], df['Vehicle Types Owned']).reset_index().rename_axis(None, axis=1)
print(df_)
Name Bike Boat Car Plane Train Truck
0 Billy 0 0 1 0 1 1
1 Bob 1 0 1 0 0 0
2 Rob 0 1 0 1 1 0
3 Sally 1 1 0 0 0 0
From the original dataframe, you can use pivot_table:
df.assign(count=1).pivot_table(index='Name', columns='Vehicle Type Owned', values='count', fill_value=0)
Result:
Vehicle Type Owned Bike Boat Car Plane Train Truck
Name
Billy 0 0 1 0 1 1
Bob 1 0 1 0 0 0
Rob 0 1 0 1 1 0
Sally 1 1 0 0 0 0

Pandas Add New Column using Lookup using Multiple Columns from another DataFrame

I have two dataframes.
df1 = pd.DataFrame({
'id':[1,1,1,1,1,1,2,2,2,2,2,2],
'pp':[3,'',2,'',1,0,4, 3, 2, 1, '', 0],
'pc':[6,5,4,3,2,1,6,5,4,3,2,1]
})
| | id | pp | pc |
|---:|-----:|:-----|-----:|
| 0 | 1 | 3 | 6 |
| 1 | 1 | | 5 |
| 2 | 1 | 2 | 4 |
| 3 | 1 | | 3 |
| 4 | 1 | 1 | 2 |
| 5 | 1 | 0 | 1 |
| 6 | 2 | 4 | 6 |
| 7 | 2 | 3 | 5 |
| 8 | 2 | 2 | 4 |
| 9 | 2 | 1 | 3 |
| 10 | 2 | | 2 |
| 11 | 2 | 0 | 1 |
df2 = pd.DataFrame({
'id':[1,1,1,2,2,2],
'pp':['', 3, 4, 1, 2, ''],
'yu':[1,2,3,4,5,6]
})
| | id | pp | yu |
|---:|-----:|:-----|-----:|
| 0 | 1 | | 1 |
| 1 | 1 | 3 | 2 |
| 2 | 1 | 4 | 3 |
| 3 | 2 | 1 | 4 |
| 4 | 2 | 2 | 5 |
| 5 | 2 | | 6 |
I'd like to merge the two so that final results look like this.
| | id | pp | pc | yu |
|---:|-----:|:-----|:-----|-----:|
| 0 | 1 | | | 1 |
| 1 | 1 | 0 | 1 | 2 |
| 2 | 1 | 3 | 6 | 3 |
| 3 | 2 | 1 | 3 | 4 |
| 4 | 2 | 2 | 4 | 5 |
| 5 | 2 | | | 6 |
Basically, the df1 has the value that I need to lookup from.
df2 is the has id and pp column that are used to lookup.
However when I do
pd.merge(df2, df1, on=['id', 'pp'], how='left') results in
| | id | pp | pc | yu |
|---:|-----:|:-----|-----:|-----:|
| 0 | 1 | | 5 | 1 |
| 1 | 1 | | 3 | 1 |
| 2 | 1 | 3 | 6 | 2 |
| 3 | 1 | 4 | nan | 3 |
| 4 | 2 | 1 | 3 | 4 |
| 5 | 2 | 2 | 4 | 5 |
| 6 | 2 | | 2 | 6 |
This is not correct because it looks at empty rows as well.
If the value in df2 is empty, there should be no mapping.
I do want to keep the empty rows in df2 as it showed so can't use inner join
We can dropna for empty row in df1
out = pd.merge(df2, df1.replace({'':np.nan}).dropna(), on=['id', 'pp'], how='left')
Out[121]:
id pp yu pc
0 1 1 NaN
1 1 3 2 6.0
2 1 4 3 NaN
3 2 1 4 3.0
4 2 2 5 4.0
5 2 6 NaN

Pass / Fail Dataframe example

I have the followign code:
import pandas as pd
status = ['Pass','Fail']
item_info = pd.DataFrame({
'student': ['John','Alice','Pete','Mike','John','Alice','Joseph'],
'test': ['Pass','Pass','Pass','Pass','Pass','Pass','Pass']
})
item_status = pd.crosstab(item_info['student'],item_info['test'])
print(item_status)
Which produces:
| Student | Pass |
|---------|------|
| Alice | 2 |
| John | 2 |
| Joseph | 1 |
| Mike | 1 |
| Pete | 1 |
However, I want to create something that looks like this:
| Student | Pass | Fail | Total |
|---------|------|------|-------|
| Alice | 2 | 0 | 2 |
| John | 2 | 0 | 2 |
| Joseph | 1 | 0 | 1 |
| Mike | 1 | 0 | 1 |
| Pete | 1 | 0 | 1 |
How do I change the code so that it includes a Fail column with 0 for all of the students and provides a total?
Generic solution which adds an extra label without knowing the existing labels in advance, with reindex
cols = item_info['test'].unique().tolist()+['Fail'] #adding the extra label
pd.crosstab(item_info['student'],item_info['test']).reindex(columns=cols,fill_value=0)
Or depending on what you want, I assumed you are looking to chain methods:
item_status = pd.crosstab(item_info['student'],item_info['test'])
item_status['Fail'] = 0
test Pass Fail
student
Alice 2 0
John 2 0
Joseph 1 0
Mike 1 0
Pete 1 0

How do you create rows for every categories in a column?

Say that I have the following data. Like how many times my kids opened the fridge for each hour from 1 PM to 3 PM.
| ----- | ----- | ----- |
| Name | Hour | Open |
| ----- | ----- | ----- |
| Bob | 1 | 4 |
| ----- | ----- | ----- |
| Bob | 3 | 2 |
| ----- | ----- | ----- |
| Jane | 1 | 1 |
| ----- | ----- | ----- |
| Jane | 2 | 7 |
| ----- | ----- | ----- |
If I call this with pandas, how do I fill the missing hours so I could have the following dataframe?
| ----- | ----- | ----- |
| Name | Hour | Open |
| ----- | ----- | ----- |
| Bob | 1 | 4 |
| ----- | ----- | ----- |
| Bob | 2 | None | <<-- New row with Null or 0 for 'Open' column.
| ----- | ----- | ----- |
| Bob | 3 | 2 |
| ----- | ----- | ----- |
| Jane | 1 | 1 |
| ----- | ----- | ----- |
| Jane | 2 | 7 |
| ----- | ----- | ----- |
| Jane | 3 | None | <<-- New row with Null or 0 for 'Open' column.
| ----- | ----- | ----- |
Obviously, I kinda need it to be automatic so I could use it for some real data. So I can't just insert a row. The index or value sorting is not important.
Idea is use DataFrame.reindex by all possible combinations created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([df['Name'].unique(),
range(1, df['Hour'].max() + 1)], names=['Name','Hour'])
df1 = (df.set_index(['Name','Hour'])
.reindex(mux)
.reset_index())
print (df1)
Name Hour Open
0 Bob 1 4.0
1 Bob 2 NaN
2 Bob 3 2.0
3 Jane 1 1.0
4 Jane 2 7.0
5 Jane 3 NaN
If use pandas 0.24+ is possible use Nullable Integer Data Type:
df1 = (df.set_index(['Name','Hour'])
.reindex(mux).astype('Int64')
.reset_index())
print (df1)
Name Hour Open
0 Bob 1 4
1 Bob 2 NaN
2 Bob 3 2
3 Jane 1 1
4 Jane 2 7
5 Jane 3 NaN
And for replace non exist values to 0 add fill_value parameter:
df1 = (df.set_index(['Name','Hour'])
.reindex(mux, fill_value=0)
.reset_index())
print (df1)
Name Hour Open
0 Bob 1 4
1 Bob 2 0
2 Bob 3 2
3 Jane 1 1
4 Jane 2 7
5 Jane 3 0

Some calculations in pandas with the addition of a column

I have a table that has a column "Col1" that looks something like this:
| Col1 |
| 2 |
| 2 |
| 4 |
| 4 |
| 4 |
| 4 |
| 3 |
| 3 |
| 3 |
| 3 |
| 3 |
| 3 |
I need to create a new column "Col2". The table after this should look like this:
| Col1 | Col2 |
| 2 | 1 |
| 2 | 2 |
| 4 | 1 |
| 4 | 2 |
| 4 | 3 |
| 4 | 4 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
Is it possible to make so that if I have the same values in a row, the code starts from 1? As for example with 3.
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
Let's try this pandas solution without looping:
df2 = df.assign(Col2=df.groupby('Col1')['Col1'].cumcount().mod(df['Col1']).add(1))
print(df2)
Output:
Col1 Col2
0 2 1
1 2 2
2 4 1
3 4 2
4 4 3
5 4 4
6 3 1
7 3 2
8 3 3
9 3 1
10 3 2
11 3 3
import pandas as pd
df = pd.DataFrame({'Col1':[2,2,4,4,4,4,3,3,3,3,3,3]})
i = 0
Col2 = []
Col1 = df.Col1
#Construct Col2
while i < (len(Col1)):
Col2.extend(list(range(1,Col1[i]+1)))
i = len(Col2)
#Add Col2 to Dataframe
df['Col2'] = Col2

Categories