Generate Pivoted Pandas Dataframes by Changing 'Values' Argument - python

I have an empty list I want to populate with pivoted dataframes with the intention of looping over the list to generate heatmaps using seaborn.
The original dataframes look something like:
x y ds_ic ele1 ele2 ele3 ele4
0 0 0.394888 18.8099 25.468 7.03E-15 0.417225
0 1 0.3990888 20.5525 23.54 0 0.331358
0 2 0.3901616 22.6762 19.5485 3.63E-11 0.448073
0 3 0.3838604 24.4072 27.781 0 0.406801
0 4 0.387536 21.6036 23.8371 0 0.263638
0 5 0.387536 23.4229 22.542 4.30E-14 0.395689
I'm using the following code to reshape the data and make it suitable for plotting:
def mapShape(dataframe_list):
plotList = []
for df in dataframe_list:
df = df.pivot(index = 'y', columns = 'x', values = 'ds_ic')
plotList.append(df)
return plotList
shaped_dataframes = mapShape(simplified_dataframes)
Where simplified_dataframes is a list of dataframes that have the same shape as the original dataframe. This works fine for pivoting a single column of my choosing (i.e. whenever I manually set values).
The goal is to make a reshaped/pivoted dataframe for all columns after x-y. I thought of passing a column-header string to values of df.pivot(), resembling something like the following:
columns = ['ds_ic', 'ele1', 'ele2', 'ele3', 'ele4']
def mapShape(dataframe_list):
plotList = []
for df in dataframe_list:
for c in columns:
df = df.pivot(index = 'y', columns = 'x', values = c)
plotList.append(df)
return plotList
shaped_dataframes = mapShape(simplified_dataframes)
When I try this, df.pivot() throws a KeyError for 'y'. I tried substituting df.pivot() with df.pivot_table(), but that throws a KeyError for 'ele2'. I have a feeling there is an easier way to do this and look forward to your suggestions. Thanks in advance for the help!

The problem is your assigning the pivot_table back df in this line:df = df.pivot(index = 'y', columns = 'x', values = c). In the next iteration, it will then try to pivot this pivot table, which doesn't have a y-colun. If you assign the pivot to df2 and then append that to your plot list, it works like a charm :)
On another note; I don't know what and how and why you're plotting, but I feel there may be a more straightforward way. If you show your intended output, I could have a look.

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Trying to replace the values of a dataframe with the values of another dataframe

I have two dataframes. One has two important labels that have some associated columns for each label. The second one has the same labels and more useful data for those same labels. I'm trying to replace the values in the first with the values of the second for each appropriate label. For example:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'a_1':[1,2,3,4], 'a_2':[4,2,4,1], 'b_1':[1,2,3,4], 'b_2':[4,2,4,2]}
df_2 = {'n':['x','y','z','t'], 'n_1':[1,2,3,4], 'n_2':[1,2,3,4]}
I want to replace the values for n_1 and n_2 in a_1 and a_2 for a and b that are the same as n. So far i tried using the replace and map functions, and they work when I use them like this:
df.iloc[0] = df.iloc[0].replace({'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[0])
I can make the substitution for one specific line, but if I try to put that in a for loop and change the numbers I get the error Cannot index by location index with a non-integer key. If I take the ilocs from there I get the original df unchanged and without any error messages. I get the same behavior when I use the map function. The way i tried to do the for loop and the map:
for i in df:
df.iloc[i] = df.iloc[i].replace{'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[i])
df.iloc[i] = df.iloc[i].replace{'b_1':df['b_1']}, df_2['n_1'].loc(df['b'].iloc[i])
And so on. And for the map function:
for i in df:
df = df.map(df['b_1']}: df_2['n_1'].loc(df['b'].iloc[i])
df = df.map(df['a_1']}: df_2['n_1'].loc(df['a'].iloc[i])
I would like the resulting dataframe to have the same format as the first but with the values of the second, something like this:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'an_1':[1,2,3,4], 'an_2':[1,2,3,4], 'bn_1':[1,2,3,4], 'bn_2':[1,2,3,4]}
where an and bn are the values for a and b when n is equal to a or b in the second dataframe.
Hope this is comprehensible.

Pandas: best way to replicate df and fill with new values

Suppose I have df1:
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
I would like to create df2 with the same shape, index and columns. I often find myself doing something like this:
df2= pd.DataFrame(np.ones(shape(df1),index = df1.index,columns =df1.columns)
This is less than ideal. What's the pythonic way?
How about this:
df2 = df1.copy()
df2[:] = 1 # Or any other value, for the matter
The last line is not even necessary if all you want is to preserve the shape and the row/column headers.
You can also use the dataframe method "where" which will allow you to keep data based on condition and preserve the shape/index of the original df.
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
df2= df1.where(df1['foo'] % 2 == 0, 9999)
df2

Pandas concat yields ValueError: Plan shapes are not aligned

In pandas, I am attempting to concatenate a set of dataframes and I am getting this error:
ValueError: Plan shapes are not aligned
My understanding of .concat() is that it will join where columns are the same, but for those that it can't find it will fill with NA. This doesn't seem to be the case here.
Here's the concat statement:
dfs = [npo_jun_df, npo_jul_df,npo_may_df,npo_apr_df,npo_feb_df]
alpha = pd.concat(dfs)
In case it helps, I have also hit this error when I tried to concatenate two data frames (and as of the time of writing this is the only related hit I can find on google other than the source code).
I don't know whether this answer would have solved the OP's problem (since he/she didn't post enough information), but for me, this was caused when I tried to concat dataframe df1 with columns ['A', 'B', 'B', 'C'] (see the duplicate column headings?) with dataframe df2 with columns ['A', 'B']. Understandably the duplication caused pandas to throw a wobbly. Change df1 to ['A', 'B', 'C'] (i.e. drop one of the duplicate columns) and everything works fine.
I recently got this message, too, and I found like user #jason and #user3805082 above that I had duplicate columns in several of the hundreds of dataframes I was trying to concat, each with dozens of enigmatic varnames. Manually searching for duplicates was not practical.
In case anyone else has the same problem, I wrote the following function which might help out.
def duplicated_varnames(df):
"""Return a dict of all variable names that
are duplicated in a given dataframe."""
repeat_dict = {}
var_list = list(df) # list of varnames as strings
for varname in var_list:
# make a list of all instances of that varname
test_list = [v for v in var_list if v == varname]
# if more than one instance, report duplications in repeat_dict
if len(test_list) > 1:
repeat_dict[varname] = len(test_list)
return repeat_dict
Then you can iterate over that dict to report how many duplicates there are, delete the duplicated variables, or rename them in some systematic way.
Wrote a small function to concatenate duplicated column names.
Function cares about sorting if original dataframe is unsorted, the output will be a sorted one.
def concat_duplicate_columns(df):
dupli = {}
# populate dictionary with column names and count for duplicates
for column in df.columns:
dupli[column] = dupli[column] + 1 if column in dupli.keys() else 1
# rename duplicated keys with °°° number suffix
for key, val in dict(dupli).items():
del dupli[key]
if val > 1:
for i in range(val):
dupli[key+'°°°'+str(i)] = val
else: dupli[key] = 1
# rename columns so that we can now access abmigous column names
# sorting in dict is the same as in original table
df.columns = dupli.keys()
# for each duplicated column name
for i in set(re.sub('°°°(.*)','',j) for j in dupli.keys() if '°°°' in j):
i = str(i)
# for each duplicate of a column name
for k in range(dupli[i+'°°°0']-1):
# concatenate values in duplicated columns
df[i+'°°°0'] = df[i+'°°°0'].astype(str) + df[i+'°°°'+str(k+1)].astype(str)
# Drop duplicated columns from which we have aquired data
df = df.drop(i+'°°°'+str(k+1), 1)
# resort column names for proper mapping
df = df.reindex_axis(sorted(df.columns), axis = 1)
# rename columns
df.columns = sorted(set(re.sub('°°°(.*)','',i) for i in dupli.keys()))
return df
You need to have the same header names for all the df you want to concat.
Do it for example with :
headername = list(df)
Data = Data.filter(headername)
How to reproduce above error from pandas.concat(...):
ValueError: Plan shapes are not aligned
The Python (3.6.8) code:
import pandas as pd
df = pd.DataFrame({"foo": [3] })
print(df)
df2 = pd.concat([df, df], axis="columns")
print(df2)
df3 = pd.concat([df2, df], sort=False) #ValueError: Plan shapes are not aligned
which prints:
foo
0 3
foo foo
0 3 3
ValueError: Plan shapes are not aligned
Explanation of error
If the first pandas dataframe (here df2) has a duplicate named column and is sent to pd.concat and the second dataframe isn't of the same dimension as the first, then you get this error.
Solution
Make sure there are no duplicate named columns:
df_onefoo = pd.DataFrame({"foo": [3] })
print(df_onefoo)
df_onebar = pd.DataFrame({"bar": [3] })
print(df_onebar)
df2 = pd.concat([df_onefoo, df_onebar], axis="columns")
print(df2)
df3 = pd.concat([df2, df_onefoo], sort=False)
print(df2)
prints:
foo
0 3
bar
0 3
foo bar
0 3 3
foo bar
0 3 3
Pandas concat could have been more helpful with that error message. It's a straight up bubbleup-implementation-itis, which is textbook python.
I was receiving the ValueError: Plan shapes are not aligned when adding dataframes together. I was trying to loop over Excel sheets and after cleaning concacting them together.
The error was being raised as their were multiple none columns which I dropped with the code below:
df = df.loc[:, df.columns.notnull()] # found on stackoverflow
Error is result of having duplicate columns. Use following function in order to remove duplicate function without impacting data.
def duplicated_varnames(df):
repeat_dict = {}
var_list = list(df) # list of varnames as strings
for varname in var_list:
test_list = [v for v in var_list if v == varname]
if len(test_list) > 1:
repeat_dict[varname] = len(test_list)
if len(repeat_dict)>0:
df = df.loc[:,~df.columns.duplicated()]
return df

"Expanding" pandas dataframe by using cell-contained list

I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes

Categories