How to join tables of multiple events while preserving information?

How to join tables of multiple events while preserving information? - python

So I have a use case where I have a few tables with different types of events in a time series, plus another table with base information. The events are of different types with different columns, for example an event of "marriage" could have the columns "husband name" and "wife name", and a table of events on "jobs" can have columns of "hired on" and "fired on" but can also have "husband name". The base info table is not time series data, and has stuff like "case ID" and "city of case".
The goal would be to 1. have all the different time series tables in one table with all possible columns, wherever there's no data in a column it's okay to have NaN. And 2. All entries in the time series should have all available data from the base data table.
For example:
df = pd.DataFrame(np.array([['Dave', 1,'call'], ['Josh', 2, 'rejection'], ['Greg', 3,'call']]), columns=['husband name', 'casenum', 'event'])
df2 = pd.DataFrame(np.array([['Dave', 'Mona', 1, 'new lamp'], ['Max', 'Lisa',1, 'big increase'],['Pete', 'Esther',3,'call'], ['Josh', 'Moana', 2, 'delivery']]), columns=['husband name','wife name','casenum', 'event'])
df3 = pd.DataFrame(np.array([[1, 'new york'],[3,'old york'], [2, 'york']]), columns=['casenum','city'])
I'm trying a concat:
concat = pd.concat([df, df2, df3])
This doesn't work, because we already know that for case num 1 the city is 'new york'
I'm trying a join:
innerjoin = pd.merge(df, df2, on='casenum', how='inner')
innerjoin = pd.merge(innerjoin, df3, on='casenum', how='inner')
This also isn't right, as I want to have a record of all the events from both tables. Also, interestingly enough, the result is the same for both inner and outer joins on the dummy data, however, on my actual data an inner join will result in more rows than the sum of both the event tables, which I don't quite understand.
Basically, my desired outcome would be:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Josh 2 rejection NaN york
2 Greg 3 call NaN old york
0 Dave 1 new lamp Mona new york
1 Max 1 big increase Lisa new york
2 Pete 3 call Esther old york
3 Josh 2 delivery Moana york
I've tried inner joins, outer joins, concats, none seem to work. Maybe I'm just too tired, but what do I need to do to get this output? Thank you!

I think you can merge twice with outer option:
(df.merge(df2,on=['husband name', 'casenum', 'event'], how='outer')
.merge(df3, on='casenum')
)
Output:
husband name casenum event wife name city
0 Dave 1 call NaN new york
1 Dave 1 new lamp Mona new york
2 Max 1 big increase Lisa new york
3 Josh 2 rejection NaN york
4 Josh 2 delivery Moana york
5 Greg 3 call NaN old york
6 Pete 3 call Esther old york

Related

My merge() is not showing fully of the two dataframes I am combining, how do I show the full dataframe?

I have already sort the two dataframes
city_future:
City Future_50
7 Atlanta 1
9 Bal Harbour 1
1 Chicago 8
6 Coalinga 1
independents_future:
City independents_100
14 Amarillo 1
10 Atlanta 2
18 Atlantic City 1
20 Austin 1
This is what I got so far:
city_future = future.loc[:,"City"].value_counts().rename_axis('City').reset_index(name='Future_50').sort_values('City')
city_independents = independents.loc[:,"City"].value_counts().rename_axis('City').reset_index(name='independents_100').sort_values('City')
hot_cities = pd.merge(city_independents,city_future)
hot_cities
I need to show all the cities in both dataframe, which are in different lentgh, and mark those cities not in the other dataframe by 0.
I have no idea why my current output only shows 20 rows... which is in the form of:
City independents_100 Future_50
0 Atlanta 2 1
1 Bal Harbour 1 1
2 Chicago 15 8
Thank you for helping!

I believe you can do this without creating the two helper dataframes using the merge method.
setting indicator=True will create a new column in the resulting dataframe that will tell you if the row appears in the left dataframe only (city_future), the right dataframe only (independents_future), or both
merged_df = city_future.merge(right=independents_future,
left_on='City',
right_on='City',
how='outer',
indicator=True
)
here is the pandas.DataFrame.merge refrence page
hope this helps :)

Keep values assigned to one column in a new dataframe

I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.

Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)

Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back

Using df1 as a lookup table for df2, df2 has more unique values than df1 in Python

I have a df with US citizens state and I would like to use that as a lookup for world citizens
df1=
[Sam, New York;
Nick, California;
Sarah, Texas]
df2 =
[Sam;
Phillip;
Will;
Sam]
I would like to either df2.replace() with the states or create df3 where my output is:
[New York;
NaN;
NaN;
New York]
I have tried mapping with set_index and dict(zip()) but have had no luck so far.
Thank you.

How about this method:
import pandas as pd
df1 = pd.DataFrame([['Sam','New York'],['Nick','California'],['Sarah','Texas']],\
columns = ['name','state'])
display(df1)
df2 = pd.DataFrame(['Sam','Phillip','Will','Sam'],\
columns = ['name'])
display(df2)
df2.merge(right=df1,left_on='name',right_on='name',how='left')
resulting in
name state
0 Sam New York
1 Nick California
2 Sarah Texas
name
0 Sam
1 Phillip
2 Will
3 Sam
name state
0 Sam New York
1 Phillip NaN
2 Will NaN
3 Sam New York
you can then filter for just the state column in the merged dataframe

Comparing columns from two data frames

I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.

You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).

You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.

I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.

One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!

Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to join tables of multiple events while preserving information? - python

Related

My merge() is not showing fully of the two dataframes I am combining, how do I show the full dataframe?

Keep values assigned to one column in a new dataframe

Using df1 as a lookup table for df2, df2 has more unique values than df1 in Python

Comparing columns from two data frames

pandas fill missing country values based on city if it exists

Categories

Resources