How to merge two data frames in Pandas without losing values - python

I have two data frames that I imported as spreadsheets into Pandas and cleaned up. They have a similar key value called 'PurchaseOrders' that I am using to match product numbers to a shipment number. When I attempt to merge them, I only end up with a df of 34 rows, but I have over 400 pairs of matching product to shipment numbers.
This is the closest I've gotten, but I have also tried using join()
ShipSheet = pd.merge(new_df, orders, how ='inner')
ShipSheet.shape
Here is my order df
orders df
and here is my new_df that I want to add to my orders df using the 'PurchaseOrders' key
new_df
In the end, I want them to look like this
end goal df
I am not sure if I'm not using the merge function improperly, but my end product should have around 300+ rows. I will note that the new_df data frame's 'PurchaseOrders' values had to be delimited from a single column and split into rows, so I guess this could have something to do with it.

Use the merge method on the dataframe and specify the key
merged_inner = pd.merge(left=df_left, right=df_right, left_on='PurchaseOrders', right_on='PurchaseOrders')
learn more here

Use the concat method on pandas and specify the axis.
final_df = pd.concat([new_df, order], axis = 1)
when you specify the axis please careful if you specify axis = 0 then it placed second data frame under the first one and if you specify axis = 1 then it placed the second data frame right to the first data frame.

Related

Percentage append next to value counts in Dataframe

I'm trying to create a excel with value counts and percentage, I'm almost finishing but when I run my for loop, the percentage is added like a new df.to_frame with two more columns but I only want one this is how it looks in excel:
I want that the blue square not appears in the excel or the df and the music percentage is next to the counts of music column, also the music percentage I would like to put it with percentage format instead 0.81 --> 81%. Below is my code.
li = []
for i in range(0, len(df.columns)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.style.format('{:.2%}')
li.append(value_counts)
li.append(value_percentage)
data = pd.concat(li, axis=1)
The .reset_index() function creates a column in your dataframe called index. So you are appending two-column dataframes each time, one of which is the index. You could add .drop(columns='index') after .reset_index() to drop the index column at each step and therefore also in your final dataframe.
However, depending on your application you may want to be careful with resetting the index because it looks like you are appending in a way where your rows do not align (i.e. not all your index columns are not all the same).
To change your dataframe values to strings with percentages you can use:
value_counts = (value_counts*100).astype(str)+'%'

How to reshape dataframe with pandas?

I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

How do I merge data between two panda's data frames where one data frame has duplicate index values

I have two data frames loaded into Pandas. Each data frame holds property information indexed by a 'pin' unique to a particular parcel of land.
The first data frame (df1) represents historic sales data. Because properties can be sold multiple times, index values (the 'pin') repeat (i.e. for each time a property was sold there will be a row with the parcel's 'pin' as the index number. If the property is sold 1 time in the data set, the index/'pin' is unique. If it was sold 5 times, the index/'pin' will occur 5 times in the data set).
The second data frame (df2) is a property record. Again they are indexed by the unique parcel pin, but because this data frame is a record of each property, the value_counts() for each index value is 1 (i.e. index values do not repeat).
I would like to add data to df1 or create a new data frame which keeps all data from df1 intact, but adds values from df2 based upon matching index values.
For Example: df1 has columns ['SALE_YEAR', 'SALE_VALUE'] - where there can be multiple rows with the same index value. df2 has columns ['Address', 'SQFT'], where the index values are all unique within the data frame. I want to add 'Address' & 'SQFT' data points to df1 by matching the index values.
Merge() & Concat() do not seem to work. I believe this is because the syntax is having a hard time processing/ matching df2 values to multiple df1 rows.
Visual Example:
Thank you for the help.
Try this:
import pandas as pd
merged_df = pd.merge(left=df1, right=df2, on='PIN', how='left')
If that still isn't working, maybe the PIN columns datatypes do not match.
df1['PIN'] = df1['PIN'].astype(int)
df2['PIN'] = df2['PIN'].astype(int)
merged_df = pd.merge(left=df1, right=df2, on='PIN', how='left')

How to join two different dataframe whit different index

Good morning, I want to join two different DataFrame, but they have different index (As you can see in the picture below). Infact, the first is the result of a train_test_split and the second is an array converted into a DataFrame. The first (new_features) is a DataFrame 1700x21 and the second (y_test_pred_new) is a DataFrame 1700x1. How can I add the second one (1700x1) to the first DataFrame without pay attention to the index? So Simply taking the 1700x1 and add it as the 22° columns in new_features.
new_features = pd.concat([X_test3, features_post_test], axis = 1)
y_test_pred_new = pd.DataFrame(y_test_pred,columns = ['Soot_EO_pred'])
I tried to do in this way but it doesn't work.
new_dataset = pd.concat([new_features, y_test_pred_new], axis= 1)
You can use append instead of concat, but you have to rest the index of the big dataframe

Categories