Pandas concat yields ValueError: Plan shapes are not aligned - python

In pandas, I am attempting to concatenate a set of dataframes and I am getting this error:
ValueError: Plan shapes are not aligned
My understanding of .concat() is that it will join where columns are the same, but for those that it can't find it will fill with NA. This doesn't seem to be the case here.
Here's the concat statement:
dfs = [npo_jun_df, npo_jul_df,npo_may_df,npo_apr_df,npo_feb_df]
alpha = pd.concat(dfs)

In case it helps, I have also hit this error when I tried to concatenate two data frames (and as of the time of writing this is the only related hit I can find on google other than the source code).
I don't know whether this answer would have solved the OP's problem (since he/she didn't post enough information), but for me, this was caused when I tried to concat dataframe df1 with columns ['A', 'B', 'B', 'C'] (see the duplicate column headings?) with dataframe df2 with columns ['A', 'B']. Understandably the duplication caused pandas to throw a wobbly. Change df1 to ['A', 'B', 'C'] (i.e. drop one of the duplicate columns) and everything works fine.

I recently got this message, too, and I found like user #jason and #user3805082 above that I had duplicate columns in several of the hundreds of dataframes I was trying to concat, each with dozens of enigmatic varnames. Manually searching for duplicates was not practical.
In case anyone else has the same problem, I wrote the following function which might help out.
def duplicated_varnames(df):
"""Return a dict of all variable names that
are duplicated in a given dataframe."""
repeat_dict = {}
var_list = list(df) # list of varnames as strings
for varname in var_list:
# make a list of all instances of that varname
test_list = [v for v in var_list if v == varname]
# if more than one instance, report duplications in repeat_dict
if len(test_list) > 1:
repeat_dict[varname] = len(test_list)
return repeat_dict
Then you can iterate over that dict to report how many duplicates there are, delete the duplicated variables, or rename them in some systematic way.

Wrote a small function to concatenate duplicated column names.
Function cares about sorting if original dataframe is unsorted, the output will be a sorted one.
def concat_duplicate_columns(df):
dupli = {}
# populate dictionary with column names and count for duplicates
for column in df.columns:
dupli[column] = dupli[column] + 1 if column in dupli.keys() else 1
# rename duplicated keys with °°° number suffix
for key, val in dict(dupli).items():
del dupli[key]
if val > 1:
for i in range(val):
dupli[key+'°°°'+str(i)] = val
else: dupli[key] = 1
# rename columns so that we can now access abmigous column names
# sorting in dict is the same as in original table
df.columns = dupli.keys()
# for each duplicated column name
for i in set(re.sub('°°°(.*)','',j) for j in dupli.keys() if '°°°' in j):
i = str(i)
# for each duplicate of a column name
for k in range(dupli[i+'°°°0']-1):
# concatenate values in duplicated columns
df[i+'°°°0'] = df[i+'°°°0'].astype(str) + df[i+'°°°'+str(k+1)].astype(str)
# Drop duplicated columns from which we have aquired data
df = df.drop(i+'°°°'+str(k+1), 1)
# resort column names for proper mapping
df = df.reindex_axis(sorted(df.columns), axis = 1)
# rename columns
df.columns = sorted(set(re.sub('°°°(.*)','',i) for i in dupli.keys()))
return df

You need to have the same header names for all the df you want to concat.
Do it for example with :
headername = list(df)
Data = Data.filter(headername)

How to reproduce above error from pandas.concat(...):
ValueError: Plan shapes are not aligned
The Python (3.6.8) code:
import pandas as pd
df = pd.DataFrame({"foo": [3] })
print(df)
df2 = pd.concat([df, df], axis="columns")
print(df2)
df3 = pd.concat([df2, df], sort=False) #ValueError: Plan shapes are not aligned
which prints:
foo
0 3
foo foo
0 3 3
ValueError: Plan shapes are not aligned
Explanation of error
If the first pandas dataframe (here df2) has a duplicate named column and is sent to pd.concat and the second dataframe isn't of the same dimension as the first, then you get this error.
Solution
Make sure there are no duplicate named columns:
df_onefoo = pd.DataFrame({"foo": [3] })
print(df_onefoo)
df_onebar = pd.DataFrame({"bar": [3] })
print(df_onebar)
df2 = pd.concat([df_onefoo, df_onebar], axis="columns")
print(df2)
df3 = pd.concat([df2, df_onefoo], sort=False)
print(df2)
prints:
foo
0 3
bar
0 3
foo bar
0 3 3
foo bar
0 3 3
Pandas concat could have been more helpful with that error message. It's a straight up bubbleup-implementation-itis, which is textbook python.

I was receiving the ValueError: Plan shapes are not aligned when adding dataframes together. I was trying to loop over Excel sheets and after cleaning concacting them together.
The error was being raised as their were multiple none columns which I dropped with the code below:
df = df.loc[:, df.columns.notnull()] # found on stackoverflow

Error is result of having duplicate columns. Use following function in order to remove duplicate function without impacting data.
def duplicated_varnames(df):
repeat_dict = {}
var_list = list(df) # list of varnames as strings
for varname in var_list:
test_list = [v for v in var_list if v == varname]
if len(test_list) > 1:
repeat_dict[varname] = len(test_list)
if len(repeat_dict)>0:
df = df.loc[:,~df.columns.duplicated()]
return df

Related

Pandas set column names by position

I have the following code:
df1 = pd.read_excel(f, sheet_name=0, header=6)
# Drop Columns by position
df1 = df1.drop([df1.columns[5],df1.columns[8],df1.columns[10],df1.columns[14],df1.columns[15],df1.columns[16],df1.columns[17],df1.columns[18],df1.columns[19],df1.columns[21],df1.columns[22],df1.columns[23],df1.columns[24],df1.columns[25]], axis=1)
# rename cols
This is where I am struggling, as each time I attempt to rename the cols by position it returns "None" which is a <class 'NoneType'> ( when I use print(type(df1)) ). Note that df1 returns the dataframe as expected after dropping the columns
I get this with everything I have tried below:
column_indices = [0,1,2,3,4,5,6,7,8,9,10,11]
new_names = ['AWG Item Code','Description','UPC','PK','Size','Regular Case Cost','Unit Scan','AMAP','Case Bill Back','Monday Start Date','Sunday End Date','Net Unit']
old_names = df1.columns[column_indices]
df1 = df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
And with:
df1 = df1.rename({df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
When I remove the inplace=True essentially setting it to false, it returns the dataframe but without any of the changes I am wanting.
The tricky part is that in this program my column headers will change each time, but the columns the data is in will not. Otherwise I would just use df = df.rename(columns=["a":"newname"])
One simpler version of your code could be :
df1.columns = new_names
It should work as intended, i.e. renaming columns in the index order.
Otherwise, in your own code : if you print df1.columns[column_indices]
You do not get a list but a pandas.core.indexes.base.Index
So to correct your code you just need to change the 2 last lines by :
old_names = df1.columns[column_indices].tolist()
df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
Have a nice day
I was dumb and missing columns=
df1.rename(columns={df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
works fine
I am not sure whether this answers your question:
There is a simple way to rename the columns:
If I have a data frame: say df1. I can see the columns name using the following code:
df.columns.to_list()
which gives me suppose following columns name:
['A', 'B', 'C','D']
And I want to keep the first three columns and rename them as 'E', 'F' and 'G' respectively. The following code gives me the desired outcome:
df = df[['A','B','C']]
df.columns = ['E','F','G]
new outcome:
df.columns.to_list()
output: ['E','F','G']

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

How to iterate over a list of dataframes in pandas?

I have multiple dataframes, on which I want to run this function which mainly drops unnecessary columns from the dataframe and returns a dataframe:
def dropunnamednancols(df):
"""
Drop any columns staring with unnamed and NaN
Args:
df ([dataframe]): dataframe of which columns to be dropped
"""
#first drop nan columns
df = df.loc[:, df.columns.notnull()]
#then search for columns with unnamed
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
return df
Now I iterate over the list of dataframes: [df1, df2, df3]
dfsublist = [df1, df2, df3]
for index in enumerate(dfsublist):
dfsublist[index] = dropunnamednancols(dfsublist[index])
Whereas the items of dfsublist have been changed, the original dataframes df1, df2, df3 still retain the unnecessary columns. How could I achieve this?
If I understand correctly you want to apply a function to multiple dataframes seperately.
The underlaying issue is that in your function you return a new dataframe and replace the stored dataframe in the list with a new own instead of modifying the old orignal one.
If you want to modify the orignal one you have to use the inplace=True parameters of the pandas functions. This is possible, but not recommended, as seen here.
Your code could therefore look like this:
def dropunnamednancols(df):
"""
Drop any columns staring with unnamed and NaN
Args:
df ([dataframe]): dataframe of which columns to be dropped
"""
cols = [col for col in df.columns if (col is None) | (col.startswith('Unnamed'))]
df.drop(cols, axis=1, inplace=True)
As example on sample data:
import pandas as pd
df_1 = pd.DataFrame({'a':[0,1,2,3], 'Unnamed':[9,8,7,6]})
df_2 = pd.DataFrame({'Unnamed':[9,8,7,6], 'b':[0,1,2,3]})
lst_dfs = [df_1, df_2]
[dropunnamednancols(df) for df in lst_dfs]
# df_1
# Out[55]:
# a
# 0 0
# 1 1
# 2 2
# 3 3
# df_2
# Out[56]:
# b
# 0 0
# 1 1
# 2 2
# 3 3
The reason is probably because your are using enumerate wrong. In your case, you just want the index, so what you should do is:
for index in range(len(dfsublist)):
...
Enumerate returns a tuple of an index and the actual value in your list. So in your code, the loop variable index will actually be asigned:
(0, df1) # First iteration
(1, df2) # Second iteration
(2, df3) # Third iteration
So either, you use enumerate correctly and unpack the tuple:
for index, df in enumerate(dfsublist):
...
or you get rid of it altogether because you access the values with the index either way.

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

Python Pandas: How to remove all columns from dataframe that contains the values in a list?

include_cols_path = sys.argv[5]
with open(include_cols_path) as f:
include_cols = f.read().splitlines()
include_cols is a list of strings
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True).toPandas()
df1 is a dataframe of a large file. I would like to only retain the columns with names that contain any of the strings in include_cols.
final_cols = [col for col in df.columns.values if col in include_cols]
df = df[final_cols]
Doing this in pandas is certainly a dupe. However, it seems that you are converting a spark DataFrame to a pandas DataFrame.
Instead of performing the (expensive) collect operation and then filtering the columns you want, it's better to just filter on the spark side using select():
df1 = sqlContext.read.csv(input_path + '/' + lot_number +'.csv', header=True)
pandas_df = df1.select(include_cols).toPandas()
You should also think about whether or not converting to a pandas DataFrame is really what you want to do. Just about anything you can do in pandas can also be done in spark.
EDIT
I misunderstood your question originally. Based on your comments, I think this is what you're looking for:
selected_columns = [c for c in df1.columns if any([x in c for x in include_cols])]
pandas_df = df1.select(selected_columns).toPandas()
Explanation:
Iterate through the columns in df1 and keep only those for which at least one of the strings in include_cols is contained in the column name. The any() functions returns True if at least one of the conditions is True.
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
For example:
df1 = pd.DataFrame(data=np.random.random((5, 5)), columns=list('ABCDE'))
include_cols = ['A', 'C', 'Z']
df1.loc[:, df1.columns.str.contains('|'.join(include_cols))]
>>> A C
0 0.247271 0.761153
1 0.390240 0.050055
2 0.333401 0.823384
3 0.821196 0.929520
4 0.210226 0.406168
The '|'.join(include_cols) part will create an or condition with all elements of the input list. In the above example A|C|Z. This conditions will be True if one of the element is contained in the column names using the .contains() method on the column names.

Categories