Cant combining 2 dataframes without formatting issues? - python

I have 2 data frames that have been scraped from online and I need to combine them together into one data frame to export to excel. But I'm running into formatting issues and need to see if someone can help me solve this, please.
Dataframe 1=
df1= pd.DataFrame(table_contents)
df1= df1.replace(r'\n','',regex=True)
print(df1)
results:
0 1 2
0 Order Number Manager Order Date
1 Z57-808456-9 Victor Tully 01/13/2022
Dataframe2=
order_list.append(order_info)
df2 = pd.DataFrame(order_list)
df2.head()
print(df2)
results:
Order Number Location Zip Code
0 Z57-808456-9 Department 28 48911
I've tried using a few different alternatives but still not getting proper results.
combined_dfs= pd.concat([df1,df2],axis=1,join="inner")
print (combined_dfs)
results:
Order Number Location Zip Code 0 1 2
0 Z57-808456-9 Department 28 48911 Order Number Manager Order Date
I was trying to get them all together on 2 rows and possibly remove the duplicate Order Number that shows up on both. If not I can still live with it altogether and a duplicate.
expected results:
Order Number Location Zip Code Manager Order Date
Z57-808456-9 Department 28 48911 Victor Tully 01/13/2022

You can create columns by first row in DataFrame.set_axis, remove first row by iloc[1:] and then join with df2:
df = df1.set_axis(df1.iloc[0], axis=1, inplace=False).iloc[1:]
combined_dfs = df2.merge(df, on='Order Number')
print (combined_dfs)
Order Number Location Zip Code Manager Order Date
0 Z57-808456-9 Department 28 48911 Victor Tully 01/13/2022

It seems in your first data frame you have the column names as the first row. You can drop the first row and rename the columns, then merge the two dataframes.
# remove first row of data
df1 = df1.iloc[1:].reset_index()
# set column names
df1.columns = ['Order Number', 'Location', 'Zip Code']
# merge dataframes on order number
combined_df = pd.merge(df1, df2, on='Order Number', how='inner')

pd.merge(df1, df2, on='Order Number')

Related

Insert colA into DF1 with vals from DF2['colB'] by matching colC in both DFs

I have two CSV files, CSV_A and CSV_B.csv. I must insert the column (Category) from CSV_B into CSV_A.
The two CSVs share a common column: StockID, and I must add the correct category onto each row by matching the StockID columns.
This can be done using merge, like this:
dfa.merge(dfb, how='left', on='StockID')
but I only want to add the one column, not join the two dataframes.
CSV_A (indexed on StockID):
StockID,Brand,ToolName,Price
ABC123,Maxwell,ToolA,1.25
BCD234,Charton,ToolB,2.22
CDE345,Bingley,ToolC,3.33
DEF789,Charton,ToolD,1.44
CSV_B:
PurchDate,Supplier,StockID,Category
20201005,Sigmat,BCD234,CatShop
20210219,Zorbak,AAA111,CatWares
20210307,Phillips
20210417,Tandey,CDE345,CatThings
20210422,Stapek,BBB222,CatElse
20210502,Zorbak,ABC123,CatThis
20210512,Zorbak,CCC999,CatThings
20210717,Phillips,DEF789,CatShop
My task is to insert a Cat field into CSV_A, matching each inserted Category with its correct StockID.
Note1: CSV_A is indexed on the StockID column. CSV_B has the default indexing.
Note2: There are some columns in CSV_B (e.g. row 3) that do not have complete information.
Note3: Adding column "Category" from CSV_B into CSV_A but calling it "Cat" in CSV_A
Use Series.map to map 'Category' based on 'StockID'
df_a['Cat'] = df_a['StockID'].map(dict(zip(df_b['StockID'], df_b['Category'])))
Note that for this specific question (i.e. with CSV_A indexed on StockID), the code must be:
df_a['Cat'] = df_a.index.map(dict(zip(df_b['StockID'], df_b['Category'])))
^^^^^
While creating the question I discovered the solution, so decided to post it rather than just deleting the question.
import pandas as pd
dfa = pd.read_csv('csv_a.csv')
dfa.set_index('StockID', inplace=True)
dfb = pd.read_csv('csv_b.csv')
#remove incomplete rows (i.e. without Category/StockID columns)
dfb_tmp = dfb[dfb['StockID'].notnull()]
def myfunc(row):
# NB: Use row.name because row['StockID'] is the index
if row.name in list(dfb_tmp['StockID']):
return dfb_tmp.loc[dfb_tmp['StockID'] == row.name]['Category'].values[0]
dfa['Cat'] = dfa.apply(lambda row: myfunc(row), axis=1)
print(dfa)
Result:
StockID Brand ToolName Price Cat
ABC123 Maxwell ToolA 1.25 CatThis
BCD234 Charton ToolB 2.22 CatShop
CDE345 Bingley ToolC 3.33 CatThings
DEF789 Charton ToolD 1.44 CatShop

Compare data of two columns of one dataframe with two columns of another dataframe and find mismatch data

I have dataframe df1 as following-
Second dataframe df2 is as following-
and I want the resulted dataframe as following
Dataframe df1 & df2 contains a large number of columns and data but here I am showing sample data. My goal is to compare Customer and ID column of df1 with Customer and Part Number of df2. Comparison is to find mismatch of data of df1['Customer'] and df1['ID'] with df2['Customer'] and df2['Part Number']. Finally storing mismatch data in another dataframe df3. For example: Customer(rishab) with ID(89ab) is present in df1 but not in df2.Thus Customer, Order#, and Part are stored in df3.
I am using isin() method to find mismatch of df1 with df2 for one column only but not able to do it for comparison of two columns.
df3 = df1[~df1['ID'].isin(df2['Part Number'].values)]
#here I am only able to find mismatch based upon only 1 column ID but I want to include Customer also
I can use loop also but the data is very large(Time complexity will increase) and I am sure there can be one-liner code to achieve this task. I have also tried to use merge but not able to produce the exact output.
So, how to produce this exact output? I am also not able to use isin() for two columns and I think isin() cannot to use for two columns
The easiest way to achieve this is:
df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how='left', indicator=True)
df3.reset_index(inplace = True)
df3 = df3[df3['_merge'] == 'left_only']
Here, you first do a left join on the columns, and put indicator = True, which will give another column like _merge, which has indicator mentioning which side the data exists, and then we pick left_only from those.
You can try outer join to get non matching rows. Something like df3 = df1.merge(df2, left_on = ['Customer', 'ID'],right_on= ['Customer', 'Part Number'], how = "outer")

Divide two dataframes with multiple columns (column specific)

I have two identical sized dataframes (df1 & df2). I would like to create a new dataframe with values that are df1 column1 / df2 column1.
So essentially df3 = df1(c1)/df2(c1), df1(c2)/df2(c2), df1(c3)/df2(c3)...
I've tried the below code, however both give a dataframe filled with NaN
#attempt 1
df3 = df2.divide(df1, axis='columns')
#attempt 2
df3= df2/df1
You can try the following code:
df3 = df2.div(df1.iloc[0], axis='columns')
To use the divide function, the indexes of the dataframes need to match. In this situation, df1 was beginning of month values, df2 was end of month. The questions can be solved by:
df3 = df2.reset_index(drop=True)/df1.reset_index(drop=True)
df3.set_index(df2.index,inplace=True) ##set the index back to the original (ie end of month)

Retrieve multiple lookup values in large dataset?

I have two dataframes:
import pandas as pd
data = [['138249','Cat']
,['103669','Cat']
,['191826','Cat']
,['196655','Cat']
,['103669','Cat']
,['116780','Dog']
,['184831','Dog']
,['196655','Dog']
,['114333','Dog']
,['123757','Dog']]
df1 = pd.DataFrame(data, columns = ['Hash','Name'])
print(df1)
data2 = [
'138249',
'103669',
'191826',
'196655',
'116780',
'184831',
'114333',
'123757',]
df2 = pd.DataFrame(data2, columns = ['Hash'])
I want to write a code that will take the item in the second dataframe, scan the leftmost values in the first dataframe, then return all matching values from the first dataframe into a single cell in the second dataframe.
Here's the result I am aiming for:
Here's what I have tried:
#attempt one: use groupby to squish up the dataset. No results
past = df1.groupby('Hash')
print(past)
#attempt two: use merge. Result: empty dataframe
past1 = pd.merge(df1, df2, right_index=True, left_on='Hash')
print(past1)
#attempt three: use pivot. Result: not the right format.
past2 = df1.pivot(index = None, columns = 'Hash', values = 'Name')
print(past2)
I can do this in Excel with the VBA code here but this code crashes when I apply to my real dataset (likely because it is too big - approximately 30,000 rows long)
IIUC first agg and join with df1 then reindex using df2
df1.groupby('Hash')['Name'].agg(','.join).reindex(df2.Hash).reset_index()
Hash Name
0 138249 Cat
1 103669 Cat,Cat
2 191826 Cat
3 196655 Cat,Dog
4 116780 Dog
5 184831 Dog
6 114333 Dog
7 123757 Dog

Joining multiple data frames with join with pandas

I have two data frames mention below.
df1 dataframe consists SaleDate column as the unique key column
df1 shape is (12, 11)
the 2nd data frame mention below
df2 dataframe consists SaleDate column as the unique key column
df2 shape is (2,19)
But the dimension of each data-frame are different .
Some how I need to join 2 data-frames based on new [month-year] column which can be derived from SaleDate and add same urea price for whole month of the respective year.
Expected out put mention below
df3 data-frame consist of monthly ureaprice for each raw at the data-frame
The shape of new dataframe (13,11)
***The actual df1 consist of 2 Million records and df2 consist of 360 records.
I tried to join two data-frames with left join to get above output. But, unable to achieve it.
import pandas as pd # Import Pandas for data manipulation using dataframes
df1['month_year']=pd.to_datetime(df1['SaleDate']).dt.to_period('M')
df2['month_year'] = pd.to_datetime(df2['SaleDate']).dt.to_period('M')
df1 = pd.DataFrame({'Factory': ['MF0322','MF0657','MF0300','MF0790'],
'SaleDate': ['2013-02-07','2013-03-07','2013-06-07','2013-05-07']
'month-year':['2013-02','2013-03','2013-06','2013-05']})
df2 = pd.DataFrame({'Price': ['398.17','425.63','398.13','363','343.33','325.13'],
'Month': ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01','2013-06-01']
'month-year':['2013-01','2013-02','2013-03','2013-04','2013-05','2013-06']})
Final data frame
s1 = pd.merge(df1, df2, how='left', on=['month_year'])
all values pertaining for the urea-price was "NaN".
Hope to get expert advice in this regard.
Assuming your SaleDate columns are string dtypes, you could just do:
df1['month_year'] = df1['SaleDate'].apply(lambda x: x[:7])
df2['month_year'] = df2['SaleDate'].apply(lambda x: x[:7])
And I think the rest should work!
I copied your code, without month_year column:
df1 = pd.DataFrame({'Factory': ['MF0322','MF0657','MF0300','MF0790'],
'SaleDate': ['2013-02-07','2013-03-07','2013-06-07','2013-05-07']})
df2 = pd.DataFrame({'Price': ['398.17','425.63','398.13','363','343.33','325.13'],
'Month': ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01',
'2013-06-01']})
Then I created month_year column in both DataFrames:
df1['month_year'] = pd.to_datetime(df1['SaleDate']).dt.to_period('M')
df2['month_year'] = pd.to_datetime(df2['Month']).dt.to_period('M')
and merged them:
s1 = pd.merge(df1, df2, how='left', on=['month_year'])
When I executed print(s1) I got:
Factory SaleDate month_year Price Month
0 MF0322 2013-02-07 2013-02 425.63 2013-02-01
1 MF0657 2013-03-07 2013-03 398.13 2013-03-01
2 MF0300 2013-06-07 2013-06 325.13 2013-06-01
3 MF0790 2013-05-07 2013-05 343.33 2013-05-01
As you can see, Price column is correct, equal to Price for
respective month (according to SaleDate).
So generally your code is OK.
Check for other sources of errors. E.g. in your code snippet:
you first set month_year in each DataFrame,
then you create both DataFrames again, destroying the previous content.
Copy my code (and nothing more) and confirm that it gives the same result.
Maybe the source of your problem is in some totally other place?
Note that e.g. your df2 has Month column, not SaleDate.
Maybe this is the root cause?

Categories