Joining multiple data frames with join with pandas - python

I have two data frames mention below.
df1 dataframe consists SaleDate column as the unique key column
df1 shape is (12, 11)
the 2nd data frame mention below
df2 dataframe consists SaleDate column as the unique key column
df2 shape is (2,19)
But the dimension of each data-frame are different .
Some how I need to join 2 data-frames based on new [month-year] column which can be derived from SaleDate and add same urea price for whole month of the respective year.
Expected out put mention below
df3 data-frame consist of monthly ureaprice for each raw at the data-frame
The shape of new dataframe (13,11)
***The actual df1 consist of 2 Million records and df2 consist of 360 records.
I tried to join two data-frames with left join to get above output. But, unable to achieve it.
import pandas as pd # Import Pandas for data manipulation using dataframes
df1['month_year']=pd.to_datetime(df1['SaleDate']).dt.to_period('M')
df2['month_year'] = pd.to_datetime(df2['SaleDate']).dt.to_period('M')
df1 = pd.DataFrame({'Factory': ['MF0322','MF0657','MF0300','MF0790'],
'SaleDate': ['2013-02-07','2013-03-07','2013-06-07','2013-05-07']
'month-year':['2013-02','2013-03','2013-06','2013-05']})
df2 = pd.DataFrame({'Price': ['398.17','425.63','398.13','363','343.33','325.13'],
'Month': ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01','2013-06-01']
'month-year':['2013-01','2013-02','2013-03','2013-04','2013-05','2013-06']})
Final data frame
s1 = pd.merge(df1, df2, how='left', on=['month_year'])
all values pertaining for the urea-price was "NaN".
Hope to get expert advice in this regard.

Assuming your SaleDate columns are string dtypes, you could just do:
df1['month_year'] = df1['SaleDate'].apply(lambda x: x[:7])
df2['month_year'] = df2['SaleDate'].apply(lambda x: x[:7])
And I think the rest should work!

I copied your code, without month_year column:
df1 = pd.DataFrame({'Factory': ['MF0322','MF0657','MF0300','MF0790'],
'SaleDate': ['2013-02-07','2013-03-07','2013-06-07','2013-05-07']})
df2 = pd.DataFrame({'Price': ['398.17','425.63','398.13','363','343.33','325.13'],
'Month': ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01',
'2013-06-01']})
Then I created month_year column in both DataFrames:
df1['month_year'] = pd.to_datetime(df1['SaleDate']).dt.to_period('M')
df2['month_year'] = pd.to_datetime(df2['Month']).dt.to_period('M')
and merged them:
s1 = pd.merge(df1, df2, how='left', on=['month_year'])
When I executed print(s1) I got:
Factory SaleDate month_year Price Month
0 MF0322 2013-02-07 2013-02 425.63 2013-02-01
1 MF0657 2013-03-07 2013-03 398.13 2013-03-01
2 MF0300 2013-06-07 2013-06 325.13 2013-06-01
3 MF0790 2013-05-07 2013-05 343.33 2013-05-01
As you can see, Price column is correct, equal to Price for
respective month (according to SaleDate).
So generally your code is OK.
Check for other sources of errors. E.g. in your code snippet:
you first set month_year in each DataFrame,
then you create both DataFrames again, destroying the previous content.
Copy my code (and nothing more) and confirm that it gives the same result.
Maybe the source of your problem is in some totally other place?
Note that e.g. your df2 has Month column, not SaleDate.
Maybe this is the root cause?

Related

Cant combining 2 dataframes without formatting issues?

I have 2 data frames that have been scraped from online and I need to combine them together into one data frame to export to excel. But I'm running into formatting issues and need to see if someone can help me solve this, please.
Dataframe 1=
df1= pd.DataFrame(table_contents)
df1= df1.replace(r'\n','',regex=True)
print(df1)
results:
0 1 2
0 Order Number Manager Order Date
1 Z57-808456-9 Victor Tully 01/13/2022
Dataframe2=
order_list.append(order_info)
df2 = pd.DataFrame(order_list)
df2.head()
print(df2)
results:
Order Number Location Zip Code
0 Z57-808456-9 Department 28 48911
I've tried using a few different alternatives but still not getting proper results.
combined_dfs= pd.concat([df1,df2],axis=1,join="inner")
print (combined_dfs)
results:
Order Number Location Zip Code 0 1 2
0 Z57-808456-9 Department 28 48911 Order Number Manager Order Date
I was trying to get them all together on 2 rows and possibly remove the duplicate Order Number that shows up on both. If not I can still live with it altogether and a duplicate.
expected results:
Order Number Location Zip Code Manager Order Date
Z57-808456-9 Department 28 48911 Victor Tully 01/13/2022
You can create columns by first row in DataFrame.set_axis, remove first row by iloc[1:] and then join with df2:
df = df1.set_axis(df1.iloc[0], axis=1, inplace=False).iloc[1:]
combined_dfs = df2.merge(df, on='Order Number')
print (combined_dfs)
Order Number Location Zip Code Manager Order Date
0 Z57-808456-9 Department 28 48911 Victor Tully 01/13/2022
It seems in your first data frame you have the column names as the first row. You can drop the first row and rename the columns, then merge the two dataframes.
# remove first row of data
df1 = df1.iloc[1:].reset_index()
# set column names
df1.columns = ['Order Number', 'Location', 'Zip Code']
# merge dataframes on order number
combined_df = pd.merge(df1, df2, on='Order Number', how='inner')
pd.merge(df1, df2, on='Order Number')

Insert colA into DF1 with vals from DF2['colB'] by matching colC in both DFs

I have two CSV files, CSV_A and CSV_B.csv. I must insert the column (Category) from CSV_B into CSV_A.
The two CSVs share a common column: StockID, and I must add the correct category onto each row by matching the StockID columns.
This can be done using merge, like this:
dfa.merge(dfb, how='left', on='StockID')
but I only want to add the one column, not join the two dataframes.
CSV_A (indexed on StockID):
StockID,Brand,ToolName,Price
ABC123,Maxwell,ToolA,1.25
BCD234,Charton,ToolB,2.22
CDE345,Bingley,ToolC,3.33
DEF789,Charton,ToolD,1.44
CSV_B:
PurchDate,Supplier,StockID,Category
20201005,Sigmat,BCD234,CatShop
20210219,Zorbak,AAA111,CatWares
20210307,Phillips
20210417,Tandey,CDE345,CatThings
20210422,Stapek,BBB222,CatElse
20210502,Zorbak,ABC123,CatThis
20210512,Zorbak,CCC999,CatThings
20210717,Phillips,DEF789,CatShop
My task is to insert a Cat field into CSV_A, matching each inserted Category with its correct StockID.
Note1: CSV_A is indexed on the StockID column. CSV_B has the default indexing.
Note2: There are some columns in CSV_B (e.g. row 3) that do not have complete information.
Note3: Adding column "Category" from CSV_B into CSV_A but calling it "Cat" in CSV_A
Use Series.map to map 'Category' based on 'StockID'
df_a['Cat'] = df_a['StockID'].map(dict(zip(df_b['StockID'], df_b['Category'])))
Note that for this specific question (i.e. with CSV_A indexed on StockID), the code must be:
df_a['Cat'] = df_a.index.map(dict(zip(df_b['StockID'], df_b['Category'])))
^^^^^
While creating the question I discovered the solution, so decided to post it rather than just deleting the question.
import pandas as pd
dfa = pd.read_csv('csv_a.csv')
dfa.set_index('StockID', inplace=True)
dfb = pd.read_csv('csv_b.csv')
#remove incomplete rows (i.e. without Category/StockID columns)
dfb_tmp = dfb[dfb['StockID'].notnull()]
def myfunc(row):
# NB: Use row.name because row['StockID'] is the index
if row.name in list(dfb_tmp['StockID']):
return dfb_tmp.loc[dfb_tmp['StockID'] == row.name]['Category'].values[0]
dfa['Cat'] = dfa.apply(lambda row: myfunc(row), axis=1)
print(dfa)
Result:
StockID Brand ToolName Price Cat
ABC123 Maxwell ToolA 1.25 CatThis
BCD234 Charton ToolB 2.22 CatShop
CDE345 Bingley ToolC 3.33 CatThings
DEF789 Charton ToolD 1.44 CatShop

Divide two dataframes with multiple columns (column specific)

I have two identical sized dataframes (df1 & df2). I would like to create a new dataframe with values that are df1 column1 / df2 column1.
So essentially df3 = df1(c1)/df2(c1), df1(c2)/df2(c2), df1(c3)/df2(c3)...
I've tried the below code, however both give a dataframe filled with NaN
#attempt 1
df3 = df2.divide(df1, axis='columns')
#attempt 2
df3= df2/df1
You can try the following code:
df3 = df2.div(df1.iloc[0], axis='columns')
To use the divide function, the indexes of the dataframes need to match. In this situation, df1 was beginning of month values, df2 was end of month. The questions can be solved by:
df3 = df2.reset_index(drop=True)/df1.reset_index(drop=True)
df3.set_index(df2.index,inplace=True) ##set the index back to the original (ie end of month)

Match 2 data frames by date and column name to get values

I have two data frames (they are already in a data frame format but for illustration, I created them as a dictionary first):
first = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul']}
df1 = pd.DataFrame(first)
And
second = {
'Date':['2013-02-28','2013-03-31','2013-05-30','2014-10-31'],
'Felix':['Value1_x','Value2_x','Value3_x','Value4_x'],
'Peter':['Value1_y','Value2_y','Value3_y','Value4_y']}
df2 = pd.DataFrame(second)
Now, I'd like to add an additional column to df1 containing the values of df2 if the df1.Date matches the df2.Date by year and month (the day does not usually match since df1 contains end of month dates) AND if the column name of df2 matches the according df1.Name values.
So the result should look like this:
df_new = {
'Date':['2013-02-14','2013-03-03','2013-05-02','2014-10-31'],
'Name':['Felix','Felix','Peter','Paul'],
'Values':['Value1_x','Value2_x','Value3_y','NaN']}
df_new = pd.DataFrame(df_new)
Do you have any suggestions how to solve this problem?
I considered creating additional columns for year and month (df1['year']= df1['Date'].dt.year) and then matching df1[(df1['year'] == df2['year']) & (df1['month'] == df2['month'])] and calling the df2.column but I cant figure out how to put everything together
In general, try not to post your data sets as images, b/c it's hard to help you out then.
I think the easiest thing to do would be to create a column in each data frame where the Date is rounded to the first day of each month.
df1['Date_round'] = df1['Date'] - pd.offsets.MonthBegin(1)
df2['Date_round'] = df2['Date'] - pd.offsets.MonthBegin(1)
Then reshape df2 using melt.
df2_reshaped = df2.melt(id_vars=['Date','Date_round'], var_name='Name', value_name='Values')
And then you can join the data frames on Date_round and Name using pd.merge.
df = pd.merge(df1, df2_reshaped.drop('Date', axis=1), how='left', on=['Date_round', 'Name'])

Filter elements from 2 pandas dataframes

I have two dataframes which represent stock prices over time and stock related information over time (e.g. fundamental data on the company).
Both dataframes contain monthly data, however they are over different time spans. One is 5 years, the other is 10 years. Also, both do not have the same number of stocks, there is only an 80% overlap.
Below is an example of the dataframes:
days1 = pd.date_range(start='1/1/1980', end='7/1/1980', freq='M')
df1 = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'),index=days1)
days2 = pd.date_range(start='1/1/1980', end='5/1/1980', freq='M')
df2 = pd.DataFrame(np.random.randn(4, 6), columns=list('ABCDEF'),index=days2)
My goal is to reduce both dataframes to the inner joint. That is, so both cover the same time period and contain the same stocks. My index contains the dates, and the column names are the stock names.
I have tried multiple variations of merge() etc, but those recreate a merged dataframe, I want to keep both dataframes. I have also tried isin() but I am struggling with accessing the index of each dataframe. For instance:
df3=df1[df1.isin(df2)].dropna()
Does someone have any suggestions?
for the column intersection:
column_intersection = df1.columns & df2.columns
for the row intersection:
row_intersection = df1.index & df2.index
then just subset each dataframe:
df1 = df1.loc[row_intersection, column_intersection]
df2 = df2.loc[row_intersection, column_intersection]

Categories