Inserting a tuple into an empty pandas dataframe - python

I would like to insert a row into an empty DataFrame. However, this seems to fail for a DataFrame with predefined indices and when the elements include a tuple or list prompting the error:
ValueError: setting an array element with a sequence.
The example code is as follows:
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = [3,4,(5,6)]
print(df)

It is not clear that the elements in the list correspond to values in different columns. You can convert the list first to a Series indexed by the DataFrame's columns:
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = pd.Series([3,4,(5,6)], index=df.columns)
print(df)

I tried something like this.
def with_return(row):
t = [5,6]
return t
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = [3,4,5] #dummy element
df['col3'] = df.apply(with_return, axis=1)
print(df)
or simply use series,
df.loc[(1,2),:] = pd.Series([3,4,(5,6)], index=df.columns)
Still not directly inserting a tuple as an element in an empty DataFrame. But just another way to do it. Still, loc should be able to handle it.

Related

Pandas set column names by position

I have the following code:
df1 = pd.read_excel(f, sheet_name=0, header=6)
# Drop Columns by position
df1 = df1.drop([df1.columns[5],df1.columns[8],df1.columns[10],df1.columns[14],df1.columns[15],df1.columns[16],df1.columns[17],df1.columns[18],df1.columns[19],df1.columns[21],df1.columns[22],df1.columns[23],df1.columns[24],df1.columns[25]], axis=1)
# rename cols
This is where I am struggling, as each time I attempt to rename the cols by position it returns "None" which is a <class 'NoneType'> ( when I use print(type(df1)) ). Note that df1 returns the dataframe as expected after dropping the columns
I get this with everything I have tried below:
column_indices = [0,1,2,3,4,5,6,7,8,9,10,11]
new_names = ['AWG Item Code','Description','UPC','PK','Size','Regular Case Cost','Unit Scan','AMAP','Case Bill Back','Monday Start Date','Sunday End Date','Net Unit']
old_names = df1.columns[column_indices]
df1 = df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
And with:
df1 = df1.rename({df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
When I remove the inplace=True essentially setting it to false, it returns the dataframe but without any of the changes I am wanting.
The tricky part is that in this program my column headers will change each time, but the columns the data is in will not. Otherwise I would just use df = df.rename(columns=["a":"newname"])
One simpler version of your code could be :
df1.columns = new_names
It should work as intended, i.e. renaming columns in the index order.
Otherwise, in your own code : if you print df1.columns[column_indices]
You do not get a list but a pandas.core.indexes.base.Index
So to correct your code you just need to change the 2 last lines by :
old_names = df1.columns[column_indices].tolist()
df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
Have a nice day
I was dumb and missing columns=
df1.rename(columns={df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
works fine
I am not sure whether this answers your question:
There is a simple way to rename the columns:
If I have a data frame: say df1. I can see the columns name using the following code:
df.columns.to_list()
which gives me suppose following columns name:
['A', 'B', 'C','D']
And I want to keep the first three columns and rename them as 'E', 'F' and 'G' respectively. The following code gives me the desired outcome:
df = df[['A','B','C']]
df.columns = ['E','F','G]
new outcome:
df.columns.to_list()
output: ['E','F','G']

Add columns to dataframe that are not already in another dataframe

I am trying to add empty columns to a dataframe df1 that are not already in a second dataframe df2. So, given
df2.columns = ['a', 'b', 'c', 'd']
df1.columns = ['a', 'b']
I would like to add columns with names 'c' and 'd' to dataframe df1.
For performance reasons, I would like to avoid a loop with multiple withColumn() statements:
for col in df1.columns:
if col not in df2.columns:
df1= df1.withColumn(col, lit(None).cast(StringType()))
My first attemt
df1 = df1.select(col('*'),
lit(None).alias(col_name) for col_name in df1.columns if col_name not in df2.columns)
is throwing an error
TypeError: Invalid argument, not a string or column: <generator object
myfunction.. at 0x7f60e2bcc8e0> of type <class
'generator'>. For column literals, use 'lit', 'array', 'struct' or
'create_map' function.
You need first to convert generator to list using list() function. After converting pass the list to select().
df1.select(col('*'), *list(lit(None).alias(col_name) for col_name in df2.columns if col_name not in df1.columns))

Get the missing columns from one dataframe and append it to another dataframe

I have a Dataframe df1 with the columns. I need to compare the headers of columns in df1 with a list of headers from df2
df1 =['a','b','c','d','f']
df2 =['a','b','c','d','e','f']
I need to compare the df1 with df2 and if any missing columns, I need to add them to df1 with blank values.
I tried concat and also append and both didn't work. with concat, I'm not able to add the column e and with append, it is appending all the columns from df1 and df2. How would I get only missing column added to df1 in the same order?
df1_cols = df1.columns
df2_cols = df2._combine_match_columns
if (df1_cols == df2_cols).all():
df1.to_csv(path + file_name, sep='|')
else:
print("something is missing, continuing")
#pd.concat([my_df,flat_data_frame], ignore_index=False, sort=False)
all_list = my_df.append(flat_data_frame, ignore_index=False, sort=False)
I wanted to see the results as
a|b|c|d|e|f - > headers
1|2|3|4||5 -> values
pandas.DataFrame.align
df1.align(df2, axis=1)[0]
By default this does an 'outer' join
By specifying axis=1 we focus on columns
This returns a tuple of both an aligned df1 and df2 with the calling dataframe being the first element. So I grab the first element with [0]
pandas.DataFrame.reindex
df1.reindex(columns=df1.columns | df2.columns)
You can treat pandas.Index objects like sets most of the time. So df1.columns | df2.columns is the union of those two index objects. I then reindex using the result.
Lets first create the two dataframes as:
import pandas as pd, numpy as np
df1 = pd.DataFrame(np.random.random((5,5)), columns = ['a','b','c','d','f'])
df2 = pd.DataFrame(np.random.random((5,7)), columns = ['a','b','c','d','e','f','g'])
Now add those columns of df2 to df1 (with nan values), which are not in df1:
for i in list(df2):
if i not in list(df1):
df1[i] = np.nan
Now display the columns of df1 alphabetically:
df1 = df1[sorted(list(df1))]

Sort column for each index for multi-index data frame

I have a dataframe which has multi-level indexing. Here is a snippet of it:
import pandas as pd
data = {'EVENT_ID': [112335580,112335580,112335580,112335580,112335580,112335580,112335580,112335580, 112335582,
112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,
112335582,112335582,112335582],
'SELECTION_ID': [6356576,2554439,2503211,6297034,4233251,2522967,5284417,7660920,8112876,7546023,8175276,8145908,
8175274,7300754,8065540,8175275,8106158,8086265,2291406,8065533,8125015],
'BSP': [5.080818565,6.651493872,6.374683435,24.69510797,7.776082305,11.73219964,270.0383021,4,8.294425408,335.3223613,
14.06040142,2.423340019,126.7205863,70.53780982,21.3328554,225.2711962,92.25113066,193.0151362,3.775394142,
95.3786641,17.86333041],
'WIN_LOSE':[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]}
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)
df.sortlevel(level=0, ascending=True, sort_remaining=True)
I want to sort the BSP column for each EVENT_ID index separately.
I have tried the following:
data.assign(BSP=data.groupby(level=0).rank(ascending=False))
This does not work as it messes up the indexing and doesn't seem to sort the column anyway.
I have also tried just sorting in terms of the column but this also clearly just messes up the indexing.
This sorts by BSP ascending for each event ID:
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df = df.sort_values(["EVENT_ID","BSP"])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)

Resetting column headings after concat

I concatenated a df(6000,13) with a sampleDf(6000,1) and found that my col index in my pandas df as expected ranges from 0 - 12 for the df, and then displays 0 for the concatenated sampleDf.
df = pd.concat([df, sampleDF], axis=1)
I am trying to reset this and have tried the following but nothing seems to have any effect. Any other methods I can try or any thoughts on why this may be happening?
df = df.reset_index(drop=True)
df = df.reindex()
df.index = range(len(df.index))
df.index = pd.RangeIndex(len(df.index))
I have also tried to append .reset_index(drop=True) to my original concat.
The only thing I can think of is that my data frame is 1d in length after processing and should be a pandas series perhaps?
Edit
I found a workaround if I transpose and then transpose again. There has to be a better way than this.
df = pd.concat([df, sampleDF], axis=1)
df = df.transpose()
df.index = range(len(df.index))
df = df.transpose()
You can simply rename your columns directly:
df = pd.concat([df, sampleDF], axis=1)
df.columns = range(len(df.columns))
This will be more efficient than repeatedly transposing df.

Categories