Joining a pandas table with multi-index - python

I have two tables that I want to join - the main table has index SourceID, the sub-table is multi-indexed as it comes from a pivot table - indexes are (SourceID, sourceid)
How can I join a table with a single index to one with multi-index (or change the multi-indexed table to singular)?
The sub-table is created as follows:
d = {'SourceID': [1, 1, 2, 2, 3, 3, 3], 'Year': [0, 1, 0, 1, 1, 2, 3], 'Sales': [100, 200, 300, 400 , 500, 600, 700], 'Profit': [10, 20, 30, 40, 50, 60, 70]}
df = pd.DataFrame(data=d)
df_sub = (
df
.pivot_table(
index=['SourceID'],
columns=['Year'],
values=['Sales', 'Profit'],
fill_value=0,
aggfunc='mean'
)
# .add_prefix('sales_')
.reset_index()
)
L = [(a, f'{a.lower()}{b}') for a, b in df_sub.columns]
df_sub.columns = pd.MultiIndex.from_tuples(L)
df_sub = df_sub.reset_index()
I'm then trying to join it with the main table df_main
df_all = df_sub.join(df_main.set_index('SourceID'), on='SourceID.sourceid')
but this fails due to the multi-index. The index in the sub-table could be single as long as I don't lost the multi-index on the other fields.

It is possible, but then MultiIndex values are converted to tuples:
df_all = df_sub.join(df.set_index('SourceID'), on=[('SourceID','sourceid')])
print (df_all)
If want MultiIndex in output is necessary convert df columns to MultiIndex too, e.g. by MultiIndex.from_product:
df1 = df.copy()
df1.columns = pd.MultiIndex.from_product([['orig'], df1.columns])
df_all = df_sub.join(df1.set_index([('orig','SourceID')]), on=[('SourceID','sourceid')])

Related

Pandas: for matching row indices - update dataframe values with values from other dataframe with a different column size

I'm struggling with updating values from a dataframe with values from another dataframe using the row index as key. Dataframes are not identical in terms of number of columns so updating can only occur for matching columns. Using the code below it would mean that df3 yields the same result as df4. However df3 returns a None object.
Anyone who can put me in the right direction? It doesn't seem very complicated but I can't seem to get it right
ps. In reality the 2 dataframes are a lot larger than the ones in this example (both in terms of rows and columns)
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
df3 = df1.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)
```
pandas.DataFrame.update returns None. The method directly changes calling object.
source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.update.html
for your example this means two things.
update returns none. hence df3=none
df1 got changed when df3 = df1.update(df2) gets called. In your case df1 would look like df4 from that point on.
to write df3 and leave df1 untouched this could be done:
import pandas as pd
data1 = {'A': [1, 2, 3,4],'B': [4, 5, 6,7],'C':[7,8,9,10]}
df1 = pd.DataFrame(data1,index=['I_1','I_2','I_3','I_4'])
print(df1)
data2 = {'A': [10, 40], 'B': [40, 70]}
df2 = pd.DataFrame(data2 ,index=['I_1','I_4'])
print(df2)
#using deep=False if df1 should not get affected by the update method
df3 = df1.copy(deep=False)
df3.update(df2)
print(df3)
data4 = {'A': [10, 2, 3,40],'B': [40, 5, 6,70],'C':[7,8,9,10]}
df4 = pd.DataFrame(data4 ,index=['I_1','I_2','I_3','I_4'])
print(df4)

Merge a dataframe only when the column values are identical

I have two data frames df and df_copy. I would like to copy the data from df_copy, but only if the data is also identical. How do I do that?
import pandas as pd
d = {'Nameid': [100, 200, 300, 100]
, 'Name': ['Max', 'Michael', 'Susan', 'Max']
, 'Projectid': [100, 200, 200, 100]}
df = pd.DataFrame(data=d)
display(df.head(5))
df['nameid_index'] = df['Nameid'].astype('category').cat.codes
df['projectid_index'] = df['Projectid'].astype('category').cat.codes
display(df.head(5))
df_copy = df.copy()
df.drop(['Nameid', 'Name', 'Projectid'], axis=1, inplace=True)
df = df.drop([1, 3])
display(df.head(5))
df
df_copy
What I want
I looked at Pandas Merging 101
df.merge(df_copy, on=['nameid_index', 'projectid_index'])
But I got this result
The same row are twice, I only want once.
Use DataFrame.drop_duplicates first:
df1 = (df.drop_duplicates(['nameid_index', 'projectid_index'])
.merge(df_copy, on=['nameid_index', 'projectid_index']))
If need merge by intersection of columns names in both DataFrames, on parameter should be removed:
df1 = df.drop_duplicates(['nameid_index', 'projectid_index']).merge(df_copy)

How to efficiently retrieve groupby objects as a function of a pd.Series

I have two dataframes: main_df (cols=['Technology', 'Condition1', Condition2']) and database_df (cols=['Technology', 'Values1', 'Values2']).
I have grouped the database_df depending on the Technology column:
grouped = database_df.groupby(['Technology'])
Now, what I would like to do is to get the pd.series main_df['Technology'] in main_df, for every row retrieve the relevant group, filter according to some conditions depending on some other column values of main_df and return the first row's ['Character'] column (of the database_df) that fulfills the conditions.
I.e. I would like to do something like:
grouped = database_df.groupby(['Technology'])
main_df['New column'] = (
grouped.get_group(main_df['Technology']).loc[
(grouped.get_group(main_df['Technology']))['Values1'] > main_df['Condition1'])
& (grouped.get_group(main_df['Technology']))['Values2'] > main_df['Condition2'])]['Character'][0])
However, I cannot pass a pd.Series as an argument to the get_group method. I realise I could probably pass main_df['Technology'] as a str for every entry applying a lambda function, but I would like to perform this operation in a vectorial way... Is there any way?
MINIMAL VIABLE EXAMPLE:
main_df = pd.DataFrame({'Technology': ['A','A','B'],
'Condition1': [20, 10, 10],
'Condition2': [100, 200, 100]})
database_df = pd.DataFrame({'Technology':['A', 'A', 'A', 'B', 'B', 'B'],
'Values1':[10, 20, 30, 10, 20, 30],
'Values2':[100, 200, 300, 100, 200, 300]
'Character':[1, 2, 3, 1, 2, 3]})
I would like the outcome of the above mentioned operation with these dfs to be:
main_df['New column'] = [3, 3, 2]
If want compare between 2 DataFrames use outer join with convert index to column, then filter by conditions and last filter first matched values:
df = main_df.reset_index().merge(database_df, on='Technology', how='outer')
m = (df['Values1'] > df['Condition1']) & (df['Values2'] > df['Condition2'])
main_df['New column'] = df[m].groupby('index')['Character'].first()
print (main_df)
Technology Condition1 Condition2 New column
0 A 20 100 3
1 A 10 200 3
2 B 10 100 2

Finding out missing transactions in two excels by using Python

I have 2 excel csv files as below
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
In df1 I could see that there is one extra transaction called SC-003_Homepage which is not there in df2. Can someone help me how to find only that transaction which is missing in df2?
So far I had done below work to get the transactions.
merged_df = pd.merge(df1, df2, on = 'Transaction_Name', suffixes=('_df1', '_df2'), how='inner')
Maybe a simple set will do the job
set(df1['Transaction_Name']) - set(df2['Transaction_Name'])
Add a merger column and then filter the missing data based on that. see below example.
For more information see merge documentation
import pandas as pd
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
#create a merged df
merge_df = df1.merge(df2, on='Transaction_Name', how='outer', suffixes=['', '_'], indicator=True)
#filter rows which are missing in df2
missing_df2_rows = merge_df[merge_df['_merge'] =='left_only'][df1.columns]
#filter rows which are missing in df1
missing_df1_rows = merge_df[merge_df['_merge'] =='right_only'][df2.columns]
print missing_df2_rows
print missing_df1_rows
Output:
Count Transaction_Name
2 2.0 SC-003_Homepage
Count Transaction_Name
4 NaN SC-002_Signinlink

Selecting columns from dataframe based on the name of other dataframe

I have 3 dataframes,
df
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID':[4, 5, 6, 8, 9],
'CC_ID' : [2, 2, 3, 3, 3],
'DD_RE': [4,7,8,9,0],
'EE_RE':[5,8,9,9,10]})
and df_ID,
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
and the other one isdf_RE, both of these data frames has the column Name, so I need to merge it to data frame df, then I need to select the columns based on the last part of the data frame's name. That is, for example, if the data frame is df_ID then I need all columns ending with "ID" + "Name" for all matching rows from Name from data frame df, and if the data frame id df_REL then I need I all columns ends with "RE" + "Name" from df and I wanted to save it separately.
I know I could call inside the loop as,
for dfs in dataframes:
ID=[col for col in df.columns if '_ID' in col]
df_ID=pd.merge(df,df_ID,on='Name')
df_ID=df_ID[ID]
But here the ID , has to change again when the data frames ends with RE and so on , I have a couple of file with different strings so any better solution would be great
So at the end I need for df_ID as having all the columns ending with ID
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16'],
'AA_ID': [22, 22'],
'BB_ID':[4, 5],
'CC_ID' : [2, 2]})
Any help would be great
Assuming your columns in df are Name and anything with a suffix such as the examples you have listed (e.g. _ID, _RE), then what you could do is parse through the column names to first extract all unique possible suffixes:
# since the suffixes follow a pattern of `_*`, then I can look for the `_` character
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
Now, with the list of suffixes, you next want to create a dictionary of your existing dataframes, where the keys in the dictionary are suffixes, and the values are the dataframes with the suffix names (e.g. df_ID, df_RE):
dfs = {}
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
... # and so forth
Now you can loop through your suffixes list to extract the appropriate columns with each suffix in the list and do the merges and column extractions:
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
Now you have your dictionary of suffixed dataframes. If you want your dataframes as separate variables instead of keeping them in your dictionary, you can now set them back as individual objects:
df_ID = dfs['_ID']
df_RE = dfs['_RE']
... # and so forth
Putting it all together in an example
import pandas as pd
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID': [4, 5, 6, 8, 9],
'CC_ID': [2, 2, 3, 3, 3],
'DD_RE': [4, 7, 8, 9, 0],
'EE_RE': [5, 8, 9, 9, 10]})
# Get unique suffixes
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
dfs = {} # dataframes dictionary
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
df_RE = pd.DataFrame({'Name': ['AC007']})
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
df_ID = dfs['_ID']
df_RE = dfs['_RE']
print(df_ID)
print(df_RE)
Result:
AA_ID BB_ID CC_ID
0 22 4 2
1 22 5 2
DD_RE EE_RE
0 8 9
1 9 9
2 0 10
You can first merge df with df_ID and then take the columns end with ID.
pd.merge(df,df_ID,on='Name')[[e for e in df.columns if e.endswith('ID') or e=='Name']]
Out[121]:
AA_ID BB_ID CC_ID Name
0 22 4 2 CTA15
1 22 5 2 CTA16
Similarly, this can be done for the df_RE df as well.
pd.merge(df,df_RE,on='Name')[[e for e in df.columns if e.endswith('RE') or e=='Name']]

Categories