How to add a column into Pandas with a condition - python

here is a simple pandas DataFrame :
data={'Name': ['John', 'Dav', 'Ann', 'Mike', 'Dany'],
'Number': ['2', '3', '2', '4', '2']}
df = pd.DataFrame(data, columns=['Name', 'Number'])
df
I would like to add a third column named "color" where the value is 'red' if Number = 2 and 'Blue' if Number = 3
This dataframe just has 5 rows. In reality It has thousand rows so I can not just add a simple column manually.

You can use .map:
dct = {2: "Red", 3: "Blue"}
df["color"] = df["Number"].astype(int).map(dct) # remove .astype(int) if the values are already integer
print(df)
Prints:
Name Number color
0 John 2 Red
1 Dav 3 Blue
2 Ann 2 Red
3 Mike 4 NaN
4 Dany 2 Red

Related

Pandas merge dataframe with conditions depends on value in a column

a help will be appreciated.
I have 2 DataFrames.
The first data frame consisted of an activity schedule of person,schedule, as following:
PersonID Person Origin Destination
3-1 1 A B
3-1 1 B A
13-1 1 C D
13-1 1 D C
13-2 2 A B
13-2 2 B A
And I have another DataFrame, household, containing the details of the person/agent.
PersonID1 Age1 Gender1 PersonID2 Age2 Gender2
3-1 20 M NaN NaN NaN
13-1 45 F 13-2 17 M
I want to perform a VLOOKUP on these two using pd.merge. Since the lookup(merge) will depends on the person's ID, I tried to that with a condition.
def merging(row):
if row['Person'] == 1:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age1', 'Gender1'])
else:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age2','Gender2'])
return row
schedule_merged = schedule.apply(merging, axis=1)
However, for some reason, it just doesn't work. The error says ValueError: len(right_on) must equal len(left_on). I'm aiming to make this kind of data in the end:
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M
I think I messed up the pd.merge lines. While it might be more efficient to use VLOOKUP in Excel, it's just to heavy for my PC, since I have to apply this for a hundred thousand data. How could I do this properly? Thanks!
This is how I would do it if the real dataset is not more complicated than the given example. Other wise I would suggest looking at pd.melt() for more complex unpivoting.
import pandas as pd
import numpy as np
# Create Dummy schedule DataFrame
d = {'PersonID': ['3-1', '3-1', '13-1', '13-1', '13-2', '13-2'], 'Person': ['1', '1', '1', '1', '2', '2'], 'Origin': ['A', 'B', 'C', 'D', 'A', 'B'], 'Destination': ['B', 'A', 'D', 'C', 'B', 'A']}
schedule = pd.DataFrame(data=d)
schedule
# Create Dummy houshold DataFrame
d = {'PersonID1': ['3-1', '13-1'], 'Age1': ['20', '45'], 'Gender1': ['M', 'F'], 'PersonID2': [np.nan, '13-2'], 'Age2': [np.nan, '17'], 'Gender2': [np.nan, 'M']}
household = pd.DataFrame(data=d)
household
# Select columns for PersonID1 and rename columns
household1 = household[['PersonID1', 'Age1', 'Gender1']]
household1.columns = ['PersonID', 'Age', 'Gender']
# Select columns for PersonID1 and rename columns
household2 = household[['PersonID2', 'Age2', 'Gender2']]
household2.columns = ['PersonID', 'Age', 'Gender']
# Concat them together
household_new = pd.concat([household1, household2])
# Merge houshold and schedule df together on PersonID
schedule = schedule.merge(household_new, how='left', left_on='PersonID', right_on='PersonID', validate='many_to_one')
Output
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M

Label Encoder and Inverse_Transform on SOME Columns

Suppose I have a dataframe like the following
df = pd.DataFrame({'animal': ['Dog', 'Bird', 'Dog', 'Cat'],
'color': ['Black', 'Blue', 'Brown', 'Black'],
'age': [1, 10, 3, 6],
'pet': [1, 0, 1, 1],
'sex': ['m', 'm', 'f', 'f'],
'name': ['Rex', 'Gizmo', 'Suzy', 'Boo']})
I want to use label encoder to encode "animal", "color", "sex" and "name", but I don't need to encode the other two columns. I also want to be able to inverse_transform the columns afterwards.
I have tried the following, and although encoding works as I'd expect it to, reversing does not.
to_encode = ["animal", "color", "sex", "name"]
le = LabelEncoder()
for col in to_encode:
df[col] = fit_transform(df[col])
## to inverse:
for col in to_encode:
df[col] = inverse_transform(df[col])
The inverse_transform function results in the following dataframe:
animal
color
age
pet
sex
name
Rex
Boo
1
1
Gizmo
Rex
Boo
Gizmo
10
0
Gizmo
Gizmo
Rex
Rex
3
1
Boo
Suzy
Gizmo
Boo
6
1
Boo
Boo
It's obviously not right, but I'm not sure how else I'd accomplish this?
Any advice would be appreciated!
As you can see in your output, when you are trying to inverse_transfom, it seems that the code is only using the information he obtained for the last column "name". You can see that because now, all the rows of your columns have values related to names. You should have one LabelEncoder() for each column.
The key here is to have one LabelEncoder fitted for each different column. To do this, I recommend you save them in a dictionary:
to_encode = ["animal", "color", "sex", "name"]
d={}
for col in to_encode:
d[col]=preprocessing.LabelEncoder().fit(df[col]) #For each column, we create one instance in the dictionary. Take care we are only fitting now.
If we print the dictionary now, we will obtain something like this:
{'animal': LabelEncoder(),
'color': LabelEncoder(),
'sex': LabelEncoder(),
'name': LabelEncoder()}
As we can see, for each column we want to transform, we have his LabelEncoder() information. This means, for example, that for the animal LabelEncoder it saves that 0 is equal to bird, 1 equal to cat, ... And the same for each column.
Once we have every column fitted, we can proceed to transform, and then, if we want to inverse_transform. The only thing to be aware is that every transform/inverse_transform have to use the corresponding LabelEncoder of this column.
Here we transform:
for col in to_encode:
df[col] = d[col].transform(df[col]) #Be aware we are using the dictionary
df
animal color age pet sex name
0 2 0 1 1 1 2
1 0 1 10 0 1 1
2 2 2 3 1 0 3
3 1 0 6 1 0 0
And, once the df is transformed, we can inverse_transform:
for col in to_encode:
df[col] = d[col].inverse_transform(df[col])
df
animal color age pet sex name
0 Dog Black 1 1 m Rex
1 Bird Blue 10 0 m Gizmo
2 Dog Brown 3 1 f Suzy
3 Cat Black 6 1 f Boo
One interesting idea could be using ColumnTransformer, but unfortunately, it doesn't suppport inverse_transform().

subset pandas df columns with partial string match OR match before "?" using lists of names

I hope someone might help me.
I have a dataframe that inculdes columns with similar names (see example data)
I have 3 additional lists of column names which include the original names of the columns (i.e. the string occurring before the question mark (see lists of column names)
I need to subset the df dataframe into 3 separate dataframes, based on matching the first part of the column names present in the 3 lists. The expected output at the bottom.
It has to be in lists (or something programmatic) as I have lots and lots of columns like this. I tried pattern matching but because some names are very similar, they match to multiple lists.
thank you in advance!
example data
df = {'id': ['1','2','3','4'],
'ab? op': ['green', 'red', 'blue', 'None'],
'ab? 1': ['red', 'yellow', 'None', 'None'],
'cd': ['L', 'XL', 'M','L'],
'efab? cba' : ['husband', 'wife', 'husband', 'None'],
'efab? 1':['son', 'grandparent', 'son', 'None'],
'efab? 2':['None', 'son', 'None', 'None'],
'fab? 4':['9', '10', '5', '3'],
'fab? po':['England', 'Scotland', 'Wales', 'NA'] }
df = pd.DataFrame(df, columns = ['id','ab? op', 'ab? 1', 'cd', 'efab? cba', 'efab? 1', 'efab? 2', 'fab? 4', 'fab? po'])
list of column names in other 3 data frames
df1_lst = ['ab', 'cd']
df2_lst = ['efab']
df3_lst = ['fab']
desired output
df1 = ['ab? op', 'ab? 1', 'cd']
df2 = ['efab? cba', 'efab? 1', 'efab? 2']
df3 = ['fab? 4', 'fab? po']
You can form a dynamic regex for each df lists:
df_lists = [df1_lst, df2_lst, df3_lst]
result = [df.filter(regex=fr"\b({'|'.join(names)})\??") for names in df_lists]
e.g., for the first list, the regex is \b(ab|cd)\?? i.e. look for either ab or cd but they should be standalone from the left side (\b) and there might be an optional ? afterwards.
The desired entries are in the result list e.g.
>>> result[1]
efab? cba efab? 1 efab? 2
0 husband son None
1 wife grandparent son
2 husband son None
3 None None None
Split column names by ?, keep the first part and check if they are in list:
df1 = df.loc[:, df.columns.str.split('?').str[0].isin(df1_lst)]
df2 = df.loc[:, df.columns.str.split('?').str[0].isin(df2_lst)]
df3 = df.loc[:, df.columns.str.split('?').str[0].isin(df3_lst)]
>>> df1
ab? op ab? 1 cd
0 green red L
1 red yellow XL
2 blue None M
3 None None L
>>> df2
efab? cba efab? 1 efab? 2
0 husband son None
1 wife grandparent son
2 husband son None
3 None None None
>>> df3
fab? 4 fab? po
0 9 England
1 10 Scotland
2 5 Wales
3 3 NA

Identifying the columns having duplicate column value with Different column name in python

How to identify the columns in a data frame with same column_value But with different column name , we need to list both the column , here i am able to list only one of them.
from pandas import DataFrame
import numpy as np
import pandas as pd
raw_data = {
'id': ['1', '2', '2', '3', '3'],
'name': ['A', 'B', 'B', 'C', 'D'],
'age' : [1, 2, 2, 3, 3],
'name_dup': ['A', 'B', 'B', 'C', 'D'],
'age_dup': [1, 2, 2, 3, 3]}
df = pd.DataFrame(raw_data, columns = ['id', 'name','age','name_dup','age_dup'])
Like in the image ,one can observe that name and name_dup have same column values but column names are different With the below Function i am able to get only name as an output as shown below where expected is name_dup.
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
iv = vs.iloc[:,i].tolist()
for j in range(i+1, lcs):
jv = vs.iloc[:,j].tolist()
if iv == jv:
dups.append(cs[i])
break
return dups
duplicate_columns(df)
Output of Above Code is Shown Below :
Expected List Duplicate columns Output
name and name_dup age and age_dup.
Here further to This keep drop any one of the column and rename the new column from list_check if we have a list of column name :
list_check = ['name','age']
Expected DataFrame
Note : It is not compulsory that it will always be colname will be colname_dup it can also be lname.
Do you mean by:
s = df.T.duplicated().reset_index()
vals = s.loc[s[0], 'index'].tolist()
colk = df.columns.drop(vals)
print(vals)
print(colk)
print(df.drop(vals, axis=1))
Output:
['name_dup', 'age_dup']
['id', 'name', 'age']
id name age
0 1 A 1
1 2 B 2
2 2 B 2
3 3 C 3
4 3 D 3
You can try this:
df.T.drop_duplicates().T
output:
id name age
0 1 A 1
1 2 B 2
2 2 B 2
3 3 C 3
4 3 D 3

Python Pandas Groupby isin

I have a dataframe that lists different teams (green, blue, yellow,orange, [there are hundreds of teams]etc) and also lists their revenue on a monthly basis. I want to be able to create a list of the top 10 teams based on revenue and then feed that into a groupby statement so that I am only looking at those teams as I work through various dataframes. These are the statements I have created and that I am having trouble with:
Rev = df['Revenue'].head(10) and I have also used Rev = df.nlargest(10,['Revenue'])
grpby = df.groupby([df['team'].isin(rev), 'team'], as_index=False)['Revenue'].sum().sort_values('Revenue', ascending=False).reset_index()
*Edit: Other code leading up to this request:
*Edit: df = pd.read_excel('c:/Test.xlsx', sheet_name="Sheet1", index_col = 'Date', parse_dates=True)
*Edit: df = pd.DataFrame(df)
I can make the groupby statement work, but I cannot feed in the 'Rev' list to the groupby statement that limits/filters which groups to look at.
Also, when using a groupby statement to create a dataframe, how do I add back in other columns that are not being grouped? For example, in my above statement i try to utilize 'team' and 'revenue', but if I also wanted to add in other columns like ('location' or 'team lead') what is the syntax to add in more columns?
*Edit
Sample input via excel file:
Teams Revenue
Green 10
Blue 15
Red 20
Orange 5
In the above example, I would like to use a statement that takes the top three and saves as a list and then feed that into the groupby statement. Now it looks like i have not filled the actual dataframe?
*from the console:
Empty DataFrame
Columns: [Team, Revenue]
Index: []
Need filter as first step by boolean indexing:
Sample:
df = pd.DataFrame({'Teams': ['Green', 'Blue', 'Red', 'Orange', 'Green', 'Blue', 'Grey', 'Purple'],
'Revenue': [18, 15, 20, 5, 10, 15, 2, 5],
'Location': ['A', 'B', 'V', 'G', 'A', 'D', 'B', 'C']})
print (df)
Teams Revenue Location
0 Green 18 A
1 Blue 15 B
2 Red 20 V
3 Orange 5 G
4 Green 10 A
5 Blue 15 D
6 Grey 2 B
7 Purple 5 C
First get top values and select column Teams:
Rev = df.nlargest(3,'Revenue')['Teams']
print (Rev)
2 Red
0 Green
1 Blue
Name: Teams, dtype: object
Then need filter first by boolean indexing:
print (df[df['Teams'].isin(Rev)])
Teams Revenue Location
0 Green 18 A
1 Blue 15 B
2 Red 20 V
4 Green 10 A
5 Blue 15 D
df1 = (df[df['Teams'].isin(Rev)]
.groupby('Teams',as_index=False)['Revenue']
.sum()
.sort_values('Revenue', ascending=False))
print (df1)
Teams Revenue
0 Blue 30
1 Green 28
2 Red 20
If need multiple columns to output is necessary set aggregation function for each of them like:
df2 = (df[df['Teams'].isin(Rev)]
.groupby('Teams',as_index=False)
.agg({'Revenue':'sum', 'Location': ', '.join, 'Another col':'mean'}))
print (df2)
Teams Revenue Location
0 Blue 30 B, D
1 Green 28 A, A
2 Red 20 V

Categories