How to join two different dataframe whit different index - python

Good morning, I want to join two different DataFrame, but they have different index (As you can see in the picture below). Infact, the first is the result of a train_test_split and the second is an array converted into a DataFrame. The first (new_features) is a DataFrame 1700x21 and the second (y_test_pred_new) is a DataFrame 1700x1. How can I add the second one (1700x1) to the first DataFrame without pay attention to the index? So Simply taking the 1700x1 and add it as the 22° columns in new_features.
new_features = pd.concat([X_test3, features_post_test], axis = 1)
y_test_pred_new = pd.DataFrame(y_test_pred,columns = ['Soot_EO_pred'])
I tried to do in this way but it doesn't work.
new_dataset = pd.concat([new_features, y_test_pred_new], axis= 1)

You can use append instead of concat, but you have to rest the index of the big dataframe

Related

MultiIndex (multilevel) column names from Dataframe rows

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

How to merge two data frames in Pandas without losing values

I have two data frames that I imported as spreadsheets into Pandas and cleaned up. They have a similar key value called 'PurchaseOrders' that I am using to match product numbers to a shipment number. When I attempt to merge them, I only end up with a df of 34 rows, but I have over 400 pairs of matching product to shipment numbers.
This is the closest I've gotten, but I have also tried using join()
ShipSheet = pd.merge(new_df, orders, how ='inner')
ShipSheet.shape
Here is my order df
orders df
and here is my new_df that I want to add to my orders df using the 'PurchaseOrders' key
new_df
In the end, I want them to look like this
end goal df
I am not sure if I'm not using the merge function improperly, but my end product should have around 300+ rows. I will note that the new_df data frame's 'PurchaseOrders' values had to be delimited from a single column and split into rows, so I guess this could have something to do with it.
Use the merge method on the dataframe and specify the key
merged_inner = pd.merge(left=df_left, right=df_right, left_on='PurchaseOrders', right_on='PurchaseOrders')
learn more here
Use the concat method on pandas and specify the axis.
final_df = pd.concat([new_df, order], axis = 1)
when you specify the axis please careful if you specify axis = 0 then it placed second data frame under the first one and if you specify axis = 1 then it placed the second data frame right to the first data frame.

Concatenating Two Pandas DataFrames with the same length extends the length of the resulting DataFrame

I have two DataFrames that I am trying to concat together. df_output_norm is the dataframe I am trying to get. X_test_minmax has 81732 rows and 6 columns, y_test has 81732 rows and 1 column. This should be an easy concatenation, but when I concatenate it, the resulting size is (147158, 7)
df_output_norm = pd.DataFrame()
df_output_norm = pd.concat([pd.DataFrame(X_test_minmax), pd.DataFrame(y_test)], axis=1)
print(df_output_norm.shape)
print(df_output_norm['label'].shape)
print(X_test_minmax.shape)
print(y_test.shape)
The output is
(147158, 7)
(147158,)
(81732, 6)
(81732,)
The number of columns is correct, just that the number of rows in the last column is wrong. I've looked at the data and only the last column 'label' is extended, which is the column y_test. The first 6 columns that come from X_test_minmax are of the correct row length. Why is this happening?
Pretty old question, but I landed here looking for a solution to the same problem. I figured out that its because of the mismatch in row indices as the function will be trying to concat on them (very likely that you have chunked your dfs from a larger one by sampling or so).
Try
X_test_minmax.reset_index(inplace=True,drop=True)
y_test.reset_index(inplace=True,drop=True)
df_output_norm = pd.concat([pd.DataFrame(X_test_minmax), pd.DataFrame(y_test)], axis=1)
If ignore_index =True is not working, this is possibly caused by the duplicated column names: One of the column name in the first dataframe is the same as the column name of the second dataframe. Change the column name might help.
Perhaps the index is preventing the result you are seeking. Try
f_output_norm = pd.concat([pd.DataFrame(X_test_minmax), pd.DataFrame(y_test)],
axis=1,
ignore_index=True)
to ignore indexes on the concatenation axis.

Categories