Add indicator to inform where the data came from Python - python

Many thanks for reading.
I have a pandas data frame which is the result of a concatenation of multiple smaller data frames. What I want to do is add multiple indicator columns to my final data frame, so that I can see what smaller data frame each row came from.
This would be my desired result:
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
jon smith 0 0 0 1
charlie jim 1 0 0 1
ian james 0 1 0 0
For example, "Jon Smith" came from data frame 4, and 'Charlie Jim" came from data frames 1 and 4 (duplicate rows).
I have been able to achieve this for rows that only came from one data frame (e.g. rows 1 and 3) but not for duplicate rows that came from multiple data frames (e.g. row 2).
Many thanks for any help.

You can use:
first concat with parameter keys for identify DataFrames
reset_index for columns from MultiIndex
groupby and join indicators
create indicators by str.get_dummies
reindex if need append 0 columns for missing categories
reset_index for columns from Index
df1 = pd.DataFrame({'Forename':['charlie'], 'Surname':['jim']})
df2 = pd.DataFrame({'Forename':['ian'], 'Surname':['james']})
df3 = pd.DataFrame()
df4 = pd.DataFrame({'Forename':['charlie', 'jon'], 'Surname':['jim', 'smith']})
#list of DataFrames
dfs = [df1, df2, df3, df4]
#generate indicators
inds = ['Ind_{}'.format(x+1) for x in range(len(dfs))]
df = (pd.concat(dfs, keys=inds)
.reset_index()
.groupby(['Forename','Surname'])['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1
More general solution with groupby by all columns:
df = pd.concat(dfs, keys=inds)
print (df)
Forename Surname
Ind_1 0 charlie jim
Ind_2 0 ian james
Ind_4 0 charlie jim
1 jon smith
df1 =(df.reset_index()
.groupby(df.columns.tolist())['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df1)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1

Related

Python: filter dataframe rows so that values are in columns of other dataframe

Consider that I have one dataframe that looks like this:
Gender Employed
1 1
0 0
1 0
0 1
And I have another dataframe that looks like:
Name Area
Andy Gender
Ally HR
Chris Employed
Tom Food
I only want to keep the row entries in the area column that correspond to the column names of my first dataframe. These are example dataframes and my actual dataframe has hundreds of columns so no very specific answers involving 'Gender' and 'Employed' will work.
The end result should be
Name Area
Andy Gender
Chris Employed
You can filter a df like this:
df[df['Area'].isin(df2.columns)]
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics for DataFrame slicing basics.
You want to slice the 2nd dataframe based on the columns found in the first dataframe.
df1 = pd.DataFrame(
[(1, 1), (0, 0), (1, 0), (0, 1)],
columns=["Gender", "Employed"]
)
# Gender Employed
# 0 1 1
# 1 0 0
# 2 1 0
# 3 0 1
df2 = pd.DataFrame(
[("Andy", "Gender"), ("Ally", "HR"),
("Chris", "Employed"), ("Tom", "Food")],
columns=["Name", "Area"]
)
# Name Area
# 0 Andy Gender
# 1 Ally HR
# 2 Chris Employed
# 3 Tom Food
df2[df2.Area.isin(df1.columns)]
# Name Area
# 0 Andy Gender
# 2 Chris Employed

Joining two tables on fields having different names but same meaning

I have two datasets, df1 having columns
Date Name Text Label
John 1
Jack 0
Jim 1
(I only filled those fields that I need)
and df2 having columns
NickName Label
John 1
John 1
Wes 0
Jim 0
Jim 0
Jim 0
Martin 0
Name and Nickname indicate the same things: however some observations might be included in only one of the two columns. Label in df1 is not the same of Label in df2 (sad name choice), so I will need to rename Label in df2, for example with Index.
I would like to have in df2 also the column Label (from df1) for those values (Nickname) that are in df1 and, for those ones not in df1, the value -1.
The expected output should be
NickName Label Index
John 1 1
John 1 1
Wes 0 -1
Jim 0 0
Jim 0 0
Jim 0 0
Martin 0 0
...
Please note that all Name in df1 are in df2.
For renaming the column, I have no problem (using rename in pandas) but I would need actually to understand how to merge the two datasets in order to get the three columns and corresponding values as in the expected output. I am not familiar with merging/joining, but I would say that I would need something like
df1.append(df2)
You can use pd.DataFrame.merge and add suffixes to the columns so you can see which original DatFrame they came from.
df1.merge(
df2,
left_on='Name',
right_on='Nickname',
suffixes=('_left', '_right'),
how="outer",
)

inserting new columns by splitting each column and iterating for many columns in python pandas DataFrame

Here I have an example dataframe:
dfx = pd.DataFrame({
'name': ['alex','bob','jack'],
'age': ["0,26,4","1,25,4","5,30,2"],
'job': ["x,abc,0","y,xyz,1","z,pqr,2"],
'gender': ["0,1","0,1","0,1"]
})
I want to first split column dfx['age'] and insert 3 separate columns for it, one for each substring in age value, naming them dfx['age1'],dfx['age2'],dfx['age3'] . I used following code for this:
dfx = dfx.assign(**{'age1':(dfx['age'].str.split(',', expand = True)[0]),\
'age2':(dfx['age'].str.split(',', expand = True)[1]),\
'age3':(dfx['age'].str.split(',', expand = True)[2])})
dfx = dfx[['name', 'age','age1', 'age2', 'age3', 'job', 'gender']]
dfx
So far so good!
Now, I want to repeat the same operations to other columns job and gender.
Desired Output
name age age1 age2 age3 job job1 job2 job3 gender gender1 gender2
0 alex 0,26,4 0 26 4 x,abc,0 x abc 0 0,1 0 1
1 bob 1,25,4 1 25 4 y,xyz,1 y xyz 1 0,1 0 1
2 jack 5,30,2 5 30 2 z,pqr,2 z pqr 2 0,1 0 1
I have no problem doing it individually for small data frame like this. But, the actual datafile has many such columns. I need iterations.
I found difficulty in iteration over columns, and naming the individual columns.
I would be very glad to have better solution for it.
Thanks !
Use list comprehension for splitting columns defined in list for list of DataFrames, add filtered columns and join together by concat with sorting columns names, then prepend not matched columns by DataFrame.join:
cols = ['age','job','gender']
L = [dfx[x].str.split(',',expand=True).rename(columns=lambda y: f'{x}{y+1}') for x in cols]
df1 = dfx[dfx.columns.difference(cols)]
df = df1.join(pd.concat([dfx[cols]] + L, axis=1).sort_index(axis=1))
print (df)
name age age1 age2 age3 gender gender1 gender2 job job1 job2 job3
0 alex 0,26,4 0 26 4 0,1 0 1 x,abc,0 x abc 0
1 bob 1,25,4 1 25 4 0,1 0 1 y,xyz,1 y xyz 1
2 jack 5,30,2 5 30 2 0,1 0 1 z,pqr,2 z pqr 2
Thanks again #jezrael for your answer. Being inspired by the use of 'f-string' I have solved the problem using iteration as follows:
for col in dfx.columns[1:]:
for i in range(len(dfx[col][0].split(','))):
dfx[f'{col}{i+1}'] = (dfx[col].str.split(',', expand = True)[i])
dfx = dfx[['name', 'age','age1', 'age2', 'age3', 'job','job1', 'job2','job3', 'gender'
, 'gender1', 'gender2']]
dfx

Merge datasets with certain priority

I have 3 datasets
All the same shape
CustomerNumber, Name, Status
A customer can appear on 1, 2 or all 3.
Each dataset is a list of gold/silver/bronze.
example data:
Dataframe 1:
100,James,Gold
Dataframe 2:
100,James,Silver
101,Paul,Silver
Dataframe 3:
100,James,Bronze
101,Paul,Bronze
102,Fred,Bronze
Expected output/aggregated list:
100,James,Gold
101,Paul,Silver
102,Fred,Bronze
So a customer that is captured in all 3, I want to keep Status as gold.
Have been playing with join and merge and just can’t get it right.
Use concat with convert column to ordered categorical, so get priorites if sorting values by multiple columns and last remove duplicates by DataFrame.drop_duplicates:
print (df1)
print (df2)
print (df3)
a b c
0 100 James Gold
a b c
0 100 James Silver
1 101 Paul Silver
a b c
0 101 Paul Bronze
1 102 Fred Bronze
df = pd.concat([df1, df2, df3], ignore_index=True)
df['c'] = pd.Categorical(df['c'], ordered=True, categories=['Gold','Silver','Bronze'])
df = df.sort_values(['a','b','c']).drop_duplicates(['a','b'])
print (df)
a b c
0 100 James Gold
2 101 Paul Silver
4 102 Fred Bronze

Creating new records in dataframe based on character

I have fields in a pandas dataframe like the sample data below. The values in one of the fields are fractions with the form something/count(something). I would like to split the values like the example output below, and create new records. Basically the numerator and the denominator. Some of the values even have multiple /, like count(something)/count(thing)/count(dog). So I'd want to split that value in to 3 records. Any tips on how to do this would be greatly appreciated.
Sample Data:
SampleDf=pd.DataFrame([['tom','sum(stuff)/count(things)'],['bob','count(things)/count(stuff)']],columns=['ReportField','OtherField'])
Example Output:
OutputDf=pd.DataFrame([['tom1','sum(stuff)'],['tom2','count(things)'],['bob1','count(things)'],['bob2','count(stuff)']],columns=['ReportField','OtherField'])
There might be a better way but try this,
df = df.set_index('ReportField')
df = pd.DataFrame(df.OtherField.str.split('/', expand = True).stack().reset_index(-1, drop = True)).reset_index()
You get
ReportField 0
0 tom sum(stuff)
1 tom count(things)
2 bob count(things)
3 bob count(stuff)
One possible way might be as following:
# split and stack
new_df = pd.DataFrame(SampleDf.OtherField.str.split('/').tolist(), index=SampleDf.ReportField).stack().reset_index()
print(new_df)
Output:
ReportField level_1 0
0 tom 0 sum(stuff)
1 tom 1 count(things)
2 bob 0 count(things)
3 bob 1 count(stuff)
Now, combine ReportField with level_1:
# combine strings for tom1, tom2 ,.....
new_df['ReportField'] = new_df.ReportField.str.cat((new_df.level_1+1).astype(str))
# remove level column
del new_df['level_1']
# rename columns
new_df.columns = ['ReportField', 'OtherField']
print (new_df)
Output:
ReportField OtherField
0 tom1 sum(stuff)
1 tom2 count(things)
2 bob1 count(things)
3 bob2 count(stuff)
You can use:
split with expand=True for new DataFrame
reshape by stack and reset_index
add Counter to ReportField column with convert to str by astype
remove helper column level_1 by drop
OutputDf = SampleDf.set_index('ReportField')['OtherField'].str.split('/',expand=True)
.stack().reset_index(name='OtherField')
OutputDf['ReportField'] = OutputDf['ReportField'] + OutputDf['level_1'].add(1).astype(str)
OutputDf = OutputDf.drop('level_1', axis=1)
print (OutputDf)
ReportField OtherField
0 tom1 sum(stuff)
1 tom2 count(things)
2 bob1 count(things)
3 bob2 count(stuff)

Categories