Concat data frame not working - python

I have two data frames df1 and df2. both have same numbers of rows but different columns.
I want to concat all columns of df1 and 2nd and 3rd column of df2.
df1 has 119 columns and df2 has 3 of which i want 2nd & 3rd
Code I am using is:
data_train_test = pd.concat([df1,df2.iloc[:,
[2,3]]],axis=1,ignore_index=False)
Error I am getting is
ValueError: Shape of passed values is (121, 39880), indices imply (121, 28898)
My Analysis:
39880 - 28898 = 10982
df1 is TFID data frame made from concat of two other data frames with rows 17916+10982 = 28898.
how I made df2 is
frames = [data, prediction_data]
df2 = pd.concat(frames)
I am not able to find the exact reason for this problem. Can someone please help?

I think I solved it by resetting the index while creating df2.
frames = [data, prediction_data]
df2 = pd.concat(frames).reset_index()

I am not sure I understood your question correctly but I thinks what you want to do is :
data_train_test = pd.concat([df1,df2[[1,2]]])
.iloc[] is used to select a row (the ith row in the index of your dataframe). So you don't really need it their.

import pandas as pd
df1 = pd.DataFrame(data={'a':[0]})
df2 = pd.DataFrame(data={'b1':[1], 'b2':[2], 'b3':[3]})
data_train_test = pd.concat([df1,df2[df2.columns[1:3]]], axis=1)
# or
data_train_test = pd.concat([df1,df2.loc[:,df2.columns[1:3]]], axis=1)

Related

compare columns different dataframes

I got two DataFrames I would like to merge, but I would prefer to check if the one column that exists in both dfs has the exact same values in each row.
for genereal merging I tried several solutions in the comment you see the shape
df = pd.concat([df_b, df_c], axis=1, join='inner') # (245131, 40)
df = pd.concat([df_b, df_c], axis=1).reindex(df_b.index) # (245131, 40)
df = pd.merge(df_b, df_c, on=['client_id'], how='inner') # (420707, 39)
df = pd.concat([df_b, df_c], axis=1) # (245131, 40)
The original df_c is (245131, 14) and df_b is (245131, 26)
By that I assume that the column client_id has the exact values, since in three approaches I have a shape of 245131 rows.
I would like to compare the client_ids in a new_df, tried it with .loc, but it did not work out. Tried also df.rename(columns={ df.columns[20]: "client_id_1" }, inplace=True) but it renamed both columns
I tried
df_test = df_c.client_id
df_test.append(df_b.client_id, ignore_index=True)
but I only receive one index and one client_id column but the shape says 245131 rows.
If I can be sure that the values are exact the same, should I drop the client_id in one df and do the concat/merge after that? So that I got the correct shape of (245131, 39)
is there a mangle_dupe_cols command for merge or compare like for read_csv?
Chris if you wish to check if 2 columns of 2 separate dataframes are exactly the same, you can try the following:
tuple(df1['col'].values) == tuple(df2['col'].values)
This should return a bool value
If you want to merge 2 dataframes ensure all the rows for your column of interest has unique values as duplicates will cause addition of rows
Else use concat if you want to join the dataframes along the axis

Pandas Dataframes - Combine two Dataframes but leave out entry with same column

I'm trying to create a DataFrame out of two existing ones. I read the title of some articles in the web, first column is title and the ones after are timestamps
i want to concat both data frames but leave out the ones with the same title (column one)
I tried
df = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
but because the other columns may not be the exact same all the time, I need to leave out every data pack that has the same first column. how would I do this?
btw sorry for not knowing all the right terms for my problem
You should first remove the duplicate rows from df2 and then concat it with df1:
df = pd.concat([df1, df2[~df2.title.isin(df1.title)]]).reset_index(drop=True)
This probably solves your problem:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df2=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah1','blah2','blah3','blah4','blah']
df2.columns=['blah5','blah6','blah7','blah8','blah']
for i in range(len(df.columns)):
for j in range(len(df2.columns)):
if df.columns[i] == df2.columns[j]:
df2 = df2.drop(df2.columns[j], axis = 1)
else:
continue
print(pd.concat([df, df2], axis =1))

Filling a dataframe with multiple dataframe values

I have some 100 dataframes that need to be filled in another big dataframe. Presenting the question with two dataframes
import pandas as pd
df1 = pd.DataFrame([1,1,1,1,1], columns=["A"])
df2 = pd.DataFrame([2,2,2,2,2], columns=["A"])
Please note that both the dataframes have same column names.
I have a master dataframe that has repetitive index values as follows:-
master_df=pd.DataFrame(index=df1.index)
master_df= pd.concat([master_df]*2)
Expected Output:-
master_df['A']=[1,1,1,1,1,2,2,2,2,2]
I am using for loop to replace every n rows of master_df with df1,df2... df100.
Please suggest a better way of doing it.
In fact df1,df2...df100 are output of a function where the input is column A values (1,2). I was wondering if there is something like
another_df=master_df['A'].apply(lambda x: function(x))
Thanks in advance.
If you want to concatenate the dataframes you could just use pandas concat with a list as the code below shows.
First you can add df1 and df2 to a list:
df_list = [df1, df2]
Then you can concat the dfs:
master_df = pd.concat(df_list)
I used the default value of 0 for 'axis' in the concat function (which is what I think you are looking for), but if you want to concatenate the different dfs side by side you can just set axis=1.

pandas concat adding as columns with nans?

I have two dataframes, each with the same number of columns :
print(df1.shape)
(54, 35238)
print(df2.shape)
(64, 35238)
And both don't have any index set
print(df1.index.name)
None
print(df2.index.name)
None
However, whenever I try to vertically concat them (so to have a third dataframe with shape (118, 35238)), it produces a new df with NaNs:
df3 = pandas.concat([df1, df2], ignore_index=True)
print(df3)
The resultant df has the correct number of rows, but it has decided to concat them as new columns. Using the "axis" flag set to 1 results in the same number of (inappropriate) columns (e.g. shape of (63, 70476)).
Any ideas on how to fix this?
They have the same number of columns, but are the column names different? The documentation on concat suggests to me that you need identical column names to have them stack the way you want.
If this is the problem, you could probably fix it by changing one dataframe's column names to match the other before concatenating:
df2.columns = df1.columns
This might be because your df2 is a series, you can try:
pd.concat([df1, pd.DataFrame([df2])], axis=0, ignore_index=True)

Concat pandas dataframes without following a certain sequence

I have data files which are converted to pandas dataframes which sometimes share column names while others sharing time series index, which all I wish to combine as one dataframe based on both column and index whenever matching. Since there is no sequence in naming they appear randomly for concatenation. If two dataframe have different columns are concatenated along axis=1 it works well, but if the resulting dataframe is combined with new df with the column name from one of the earlier merged pandas dataframe, it fails to concat. For example with these data files :
import pandas as pd
df1 = pd.read_csv('0.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df2 = pd.read_csv('1.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
df3 = pd.read_csv('2.csv', index_col=0, parse_dates=True, infer_datetime_format=True)
data1 = pd.DataFrame()
file_list = [df1, df2, df3] # fails
# file_list = [df2, df3,df1] # works
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
I get ValueError: Plan shapes are not aligned when I try to do that. In my case there is no way to first load all the DataFrames and check their column names. Having that I could combine all df with same column names to later only concat these resulting dataframes with different column names along axis=1 which I know always works as shown below. However, a solution which requires preloading all the DataFrames and rearranging the sequence of concatenation is not possible in my case (it was only done for a working example above). I need a flexibility in terms of in whichever sequence the information comes it can be concatenated with the larger dataframe data1. Please let me know if you have a suggested suitable approach.
If you go through the loop step by step, you can find that in the first iteration it goes into the if, so data1 is equal to df1. In the second iteration it goes to the else, since data1 is not empty and ''Temperature product barrel ValueY'' is not in data1.columns.
After the else, data1 has some duplicated column names. In every row of the duplicated column names. (one of the 2 columns is Nan, the other one is a float). This is the reason why pd.concat() fails.
You can aggregate the duplicate columns before you try to concatenate to get rid of it:
for fn in file_list:
if data1.empty==True or fn.columns[1] in data1.columns:
# new:
data1 = data1.groupby(data1.columns, axis=1).agg(np.nansum)
data1 = pd.concat([data1,fn])
else:
data1 = pd.concat([data1,fn], axis=1)
After that, you would get
data1.shape
(30, 23)

Categories