Split a column into multiple columns / cleaning data set - python

So I have initialized a table from a pdf into a pandas Dataframe, it looks like the following:
df_current= pd.DataFrame({'Country': ['NaN','NaN','Nan','NaN','Denmark', 'Sweden',
'Germany'],
'Explained Part':['Personal and job characteristics',
'Education Occupation Job Employment', 'experience contract',
'Employment contract','20 -7 2 0','4 6 2 0', '-9 -6 -1 :']})
The expected (or the output I aim for in the end):
df_expected = pd.DataFrame({'Country': ['Denmark', 'Sweden',
'Germany'],'Personal and job characteristics':[20 ,4,-9],
'Education Occupation Job Employment':[-7,6,-6],
'experience contract':[2,2,-1],'Employment contract':[0,0,':']})
the problem is: the column 'Explained part' holds 4 columns' worth of data, and some of the data is shown as symbols, like ':'.
I was thinking about using
df[['Personal and job characteristics',
'Education Occupation Job Employment',
'experience contract',
'experience contract']] = df['Explained part'].str.split(" ",expand=True,)
But I cannot get it to work.
I want to split the column into 3 but since some cells have split the numbers.
Any ideas ?
Thanks in advance ~
PS. i have updated the question as I think my first post was too hard to understand, I have now added some of the data from the actual problem, and added an expected output, thx for the feedback so far!.

If NaNs are missing values first remove rows with them by DataFrame.dropna and then apply your solution with DataFrame.pop for extract column:
df_current= pd.DataFrame({'Country': [np.nan,np.nan,np.nan,np.nan,'Denmark', 'Sweden',
'Germany'],
'Explained Part':['Personal and job characteristics',
'Education Occupation Job Employment', 'experience contract',
'Employment contract','20 -7 2 0','4 6 2 0', '-9 -6 -1 :']})
print (df_current)
Country Explained Part
0 NaN Personal and job characteristics
1 NaN Education Occupation Job Employment
2 NaN experience contract
3 NaN Employment contract
4 Denmark 20 -7 2 0
5 Sweden 4 6 2 0
6 Germany -9 -6 -1 :
df = df_current.dropna(subset=['Country']).copy()
cols = ['Personal and job characteristics','Education Occupation Job Employment',
'experience contract','Employment contract']
df[cols] = df.pop('Explained Part').str.split(expand=True)
print (df)
Country Personal and job characteristics \
4 Denmark 20
5 Sweden 4
6 Germany -9
Education Occupation Job Employment experience contract Employment contract
4 -7 2 0
5 6 2 0
6 -6 -1 :
Or without pop:
df = df_current.dropna(subset=['Country']).copy()
cols = ['Personal and job characteristics','Education Occupation Job Employment',
'experience contract','Employment contract']
df[cols] = df['Explained Part'].str.split(expand=True)
df = df.drop('Explained Part', axis=1)

Related

Replace column value in one Panda Dataframe with column in another Panda Dataframe with conditions

I have the following 3 Panda Dataframe. I want to replace company and division columns the ID from their respective company and division dataframe.
pd_staff:
id name company division
P001 John Sunrise Headquarter
P002 Jane Falcon Digital Research & Development
P003 Joe Ashford Finance
P004 Adam Falcon Digital Sales
P004 Barbara Sunrise Human Resource
pd_company:
id name
1 Sunrise
2 Falcon Digital
3 Ashford
pd_division:
id name
1 Headquarter
2 Research & Development
3 Finance
4 Sales
5 Human Resource
This is the end result that I am trying to produce
id name company division
P001 John 1 1
P002 Jane 2 2
P003 Joe 3 3
P004 Adam 2 4
P004 Barbara 1 5
I have tried to combine Staff and Company using this code
pd_staff.loc[pd_staff['company'].isin(pd_company['name']), 'company'] = pd_company.loc[pd_company['name'].isin(pd_staff['company']), 'id']
which produces
id name company
P001 John 1.0
P002 Jane NaN
P003 Joe NaN
P004 Adam NaN
P004 Barbara NaN
You can do:
pd_staff['company'] = pd_staff['company'].map(pd_company.set_index('name')['id'])
pd_staff['division'] = pd_staff['division'].map(pd_division.set_index('name')['id'])
print(pd_staff):
id name company division
0 P001 John 1 1
1 P002 Jane 2 2
2 P003 Joe 3 3
3 P004 Adam 2 4
4 P004 Barbara 1 5
This will achieve the desired results
df_merge = df.merge(df2, how = 'inner', right_on = 'name', left_on = 'company', suffixes=('', '_y'))
df_merge = df_merge.merge(df3, how = 'inner', left_on = 'division', right_on = 'name', suffixes=('', '_z'))
df_merge = df_merge[['id', 'name', 'id_y', 'id_z']]
df_merge.columns = ['id', 'name', 'company', 'division']
df_merge.sort_values('id')
first, lets modify df company and df division a little bit
df2.rename(columns={'name':'company'},inplace=True)
df3.rename(columns={'name':'division'},inplace=True)
Then
df1=df1.merge(df2,on='company',how='left').merge(df3,on='division',how='left')
df1=df1[['id_x','name','id_y','id']]
df1.rename(columns={'id_x':'id','id_y':'company','id':'division'},inplace=True)
Use apply, you can have a function thar will replace the values. from the second excel you will pass the field to look up to and what's to replace in this. Here I am replacing Sunrise by 1 because it is in the second excel.
import pandas as pd
df = pd.read_excel('teste.xlsx')
df2 = pd.read_excel('ids.xlsx')
def altera(df33, field='Sunrise', new_field='1'): # for showing pourposes I left default values but they are to pass from the second excel
return df33.replace(field, new_field)
df.loc[:, 'company'] = df['company'].apply(altera)

Pandas: How to match / filter same key / id values (duplicates) from 2 different dataframes and replace values?

I have 2 dataframes of different sizes. The first dataframe(df1) has 4 columns, but two of those columns have the same name as the columns in the second dataframe(df2), which is only comprised of 2 columns. The columns in common are ['ID'] and ['Department'].
I want to check if any ID from df2 are in df1. If so, I want to replace df1['Department'] value with df2['Department'] value.
For instance, df1 looks something like this:
ID Department Yrs Experience Education
1234 Science 1 Bachelors
2356 Art 3 Bachelors
2456 Math 2 Masters
4657 Science 4 Masters
And df2 looks something like this:
ID Department
1098 P.E.
1234 Technology
2356 History
I want to check if the ID from df2 is in df1 and if so, update Department. The output should looks something like this:
ID Department Yrs Experience Education
1234 **Technology** 1 Bachelors
2356 **History** 3 Bachelors
2456 Math 2 Masters
4657 Science 4 Masters
The expected updates to df1 are in bold
Is there an efficient way to do this?
Thank you for taking the time to read this and help.
You can use ID of df1 to map with the Pandas series formed by setting ID on df2 as index and taking the column of Department from df2 (this acts as a mapping table).
Then, in case of no match of ID from df2, we fill-in the original values of Department from df1 (to retain original values in case of no match):
df1['Department'] = (df1['ID'].map(df2.set_index('ID')['Department'])
.fillna(df1['Department'])
)
Result:
print(df1)
ID Department Yrs Experience Education
0 1234 Technology 1 Bachelors
1 2356 History 3 Bachelors
2 2456 Math 2 Masters
3 4657 Science 4 Masters
Try:
df1["Department"].update(
df1[["ID"]].merge(df2, on="ID", how="left")["Department"]
)
print(df1)
Prints:
ID Department Yrs Experience Education
0 1234 Technology 1 Bachelors
1 2356 History 3 Bachelors
2 2456 Math 2 Masters
3 4657 Science 4 Masters
df_1 = pd.DataFrame(data={'ID':[1234, 2356, 2456, 4657], 'Department':['Science', 'Art', 'Math', 'Science']})
df_2 = pd.DataFrame(data={'ID':[1234, 2356], 'Department':['Technology', 'History']})
df_1.loc[df_1['ID'].isin(df_2['ID']), 'Department'] = df_2['Department']
OutPut
ID Department
0 1234 Technology
1 2356 History
2 2456 Math
3 4657 Science

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby, but I am struggling because my logical grouping condition is dynamic.
Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions
They must be the same customer (same Guest ID)
They must have been there on the previous day
If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.
I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.
How can I wrap the following code into something smarter than a loop?
for i in range (0, len(df)):
newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))
This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df, because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns
I have the following input
df
Day Guest ID
0 3230 Tom
1 3230 Peter
2 3231 Tom
3 3232 Peter
4 3232 Peter
and expect this output
df
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).
Do:
# ensure the df is sorted by date
df = df.sort_values('Day')
# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
UPDATE
If multiple visits are allowed per day, you could do:
# only keep unique visits per day
uniques = df.drop_duplicates()
# ensure the df is sorted by date
uniques = uniques.sort_values('Day')
# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])
print(res)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1
As an alternative, without sorting or merging, you could do:
lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)
Output
Day Guest ID newby
0 3230 Tom 1
1 3230 Peter 1
2 3231 Tom 0
3 3232 Peter 1
4 3232 Peter 1

Pandas: Group people into households to generate descriptives

My problem can be simplified as having two dataframes;
Dataframe 1 contains people and the household that they live in:
Person ID | Household ID
1 1
2 2
3 2
4 3
5 1
Dataframe 2 contains individual characteristics of people:
Person ID | Age | Workstatus | Education
1 20 Working High
2 29 Working Medium
3 31 Unemployed Low
4 45 Unemployed Medium
5 30 Working Medium
The goal is to group people belonging to the same Household ID together, in order to generate descriptives about the family, e.g. 'average age of persons in household", "average education level", etc.
I tried:
df1.groupby['Household ID']
but I'm not sure where to go from there, how to do it the 'pandas' way. The 'real' dataset is very large so working with lists takes too long.
The ideal output would be:
Household ID | Avg Age of persons | Education
1 25 High/med
2 25.7 High/High
3 28 Low/Low
we can use .map to get the household IDs and groupby with Named Aggregations
df3 = (
df2.assign(houseID=df2["Person ID"].map(df1.set_index("Person ID")["Household ID"]))
.groupby("houseID")
.agg(avgAgeOfPerson=("Age", "mean"), Education=("Education", "/".join))
)
print(df3)
avgAgeOfPerson Education
houseID
1 25 High/Medium
2 30 Medium/Low
3 45 Medium
You can merge both the datasets and then groupby on household id:
df1 = pd.DataFrame([[1,1],[2,2],[3,2],[4,3],[5,1]],columns = ['Person ID', 'Household ID'])
df2 = pd.DataFrame([[1,20,'Working', 'High'],[2,29,'Working','Medium'],[3,31,'Unemployed','Low'],[4,45,'Unemployed','Medium'],[5,30,'Working','Medium']],columns = ['Person ID','Age','Workstatus','Education'])
merged = pd.merge(df1,df2, on = 'Person ID', how = 'left')
merged.groupby('Household ID').agg({'Age':'mean', 'Education':list})
Result:
Age Education
Household ID
1 25 [High, Medium]
2 30 [Medium, Low]
3 45 [Medium]

Cleaning up & filling in categorical variables for Data Science analysis

I'm taking on my very first machine learning problem, and I'm struggling with cleaning my categorical features in my dataset. My goal is to build a rock climbing recommendation system.
PROBLEM 1:
I have three columns related columns that have erroneous information:
What it looks like now:
What I want it to look like:
If you groupby the location name, there are different location_id numbers and countries associated with that one name. However, there is a clear winner/clear majority to each of these discrepancies. I have a data set of 2 million entries and the mode of the location_id & location_country GIVEN the location_name is overwhelming pointing to one answer (example: "300" & "USA" for clear_creek).
Using pandas/python, how do I group my dataset by the location_name, compute the mode of location_id & location_country based on that location name, and then replace the entire id & country columns with these mode calculations based on location_name to clean up my data?
I've played around with groupby, replace, duplicated, but I think ultimately I will need to create a function that will do this, and I honestly have no idea where to start. (I apologize in advance for my coding naivety)I know there's got to be a solution, I just need to be pointed in the right direction.
PROBLEM 2:
Also, any one have suggestions on filling in NaN values in my location_name category (42,012/2 million) and location_country(46,890/2 million) columns? Is it best to keep as an unknown value? I feel like filling in these features based on frequency would be a horrible bias to my data set.
data = {'index': [1,2,3,4,5,6,7,8,9],
'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
'clear_creek', 'clear_creek', 'clear_creek',
'clear_creek', 'clear_creek'],
'location_id': [100,100,0,100,300,625,300,300,300],
'location_country': ['GRC', 'GRC', 'ESP', 'GRC', 'USA', 'IRE',
'USA', 'USA', 'USA']}
df = pd.DataFrame.from_dict(data)
***looking for it to return:
improved_data = {'index': [1,2,3,4,5,6,7,8,9],
'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
'clear_creek', 'clear_creek', 'clear_creek',
'clear_creek', 'clear_creek'],
'location_id': [100,100,100,100,300,300,300,300,300],
'location_country': ['GRC', 'GRC', 'GRC', 'GRC', 'USA', 'USA',
'USA', 'USA', 'USA']}
new_df = pd.DataFrame.from_dict(improved_data)
We can use .agg in combination with pd.Series.mode and cast that back to your dataframe with map:
m1 = df.groupby('location_name')['location_id'].agg(pd.Series.mode)
m2 = df.groupby('location_name')['location_country'].agg(pd.Series.mode)
df['location_id'] = df['location_name'].map(m1)
df['location_country'] = df['location_name'].map(m2)
print(df)
index location_name location_id location_country
0 1 kalaymous 100 GRC
1 2 kalaymous 100 GRC
2 3 kalaymous 100 GRC
3 4 kalaymous 100 GRC
4 5 clear_creek 300 USA
5 6 clear_creek 300 USA
6 7 clear_creek 300 USA
7 8 clear_creek 300 USA
8 9 clear_creek 300 USA
You can use transform by calculating mode using df.iat[]:
df=(df[['location_name']].join(df.groupby('location_name').transform(lambda x: x.mode()
.iat[0])).reindex(df.columns,axis=1))
print(df)
index location_name location_id location_country
0 1 kalaymous 100 GRC
1 1 kalaymous 100 GRC
2 1 kalaymous 100 GRC
3 1 kalaymous 100 GRC
4 5 clear_creek 300 USA
5 5 clear_creek 300 USA
6 5 clear_creek 300 USA
7 5 clear_creek 300 USA
8 5 clear_creek 300 USA
As Erfan mentions it would be helpful to have a view on your expected output for the first question.
For the second pandas has a fillna method. You can use this method to fill the NaN values. For example to fill values with 'UNKNOWN_LOCATION' you could do the following:
df.fillna('UNKNOWN_LOCATION')
See potential solution for first question:
df.groupby('location_name')[['location_id', 'location_country']].apply(lambda x: x.mode())

Categories