Inserting a tuple into an empty pandas dataframe

Inserting a tuple into an empty pandas dataframe - python

I would like to insert a row into an empty DataFrame. However, this seems to fail for a DataFrame with predefined indices and when the elements include a tuple or list prompting the error:
ValueError: setting an array element with a sequence.
The example code is as follows:
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = [3,4,(5,6)]
print(df)

It is not clear that the elements in the list correspond to values in different columns. You can convert the list first to a Series indexed by the DataFrame's columns:
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = pd.Series([3,4,(5,6)], index=df.columns)
print(df)

I tried something like this.
def with_return(row):
t = [5,6]
return t
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = [3,4,5] #dummy element
df['col3'] = df.apply(with_return, axis=1)
print(df)
or simply use series,
df.loc[(1,2),:] = pd.Series([3,4,(5,6)], index=df.columns)
Still not directly inserting a tuple as an element in an empty DataFrame. But just another way to do it. Still, loc should be able to handle it.

Related

Pandas set column names by position

I have the following code:
df1 = pd.read_excel(f, sheet_name=0, header=6)
# Drop Columns by position
df1 = df1.drop([df1.columns[5],df1.columns[8],df1.columns[10],df1.columns[14],df1.columns[15],df1.columns[16],df1.columns[17],df1.columns[18],df1.columns[19],df1.columns[21],df1.columns[22],df1.columns[23],df1.columns[24],df1.columns[25]], axis=1)
# rename cols
This is where I am struggling, as each time I attempt to rename the cols by position it returns "None" which is a <class 'NoneType'> ( when I use print(type(df1)) ). Note that df1 returns the dataframe as expected after dropping the columns
I get this with everything I have tried below:
column_indices = [0,1,2,3,4,5,6,7,8,9,10,11]
new_names = ['AWG Item Code','Description','UPC','PK','Size','Regular Case Cost','Unit Scan','AMAP','Case Bill Back','Monday Start Date','Sunday End Date','Net Unit']
old_names = df1.columns[column_indices]
df1 = df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
And with:
df1 = df1.rename({df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
When I remove the inplace=True essentially setting it to false, it returns the dataframe but without any of the changes I am wanting.
The tricky part is that in this program my column headers will change each time, but the columns the data is in will not. Otherwise I would just use df = df.rename(columns=["a":"newname"])

One simpler version of your code could be :
df1.columns = new_names
It should work as intended, i.e. renaming columns in the index order.
Otherwise, in your own code : if you print df1.columns[column_indices]
You do not get a list but a pandas.core.indexes.base.Index
So to correct your code you just need to change the 2 last lines by :
old_names = df1.columns[column_indices].tolist()
df1.rename(columns=dict(zip(old_names, new_names)), inplace=True)
Have a nice day

I was dumb and missing columns=
df1.rename(columns={df1.columns[0]:"AWG Item Code",df1.columns[1]:"Description",df1.columns[2]:"UPC",df1.columns[3]:"PK",df1.columns[4]:"Size",df1.columns[5]:"Regular Case Cost",df1.columns[6]:"Unit Scan",df1.columns[7]:"AMAP",df1.columns[8]:"Case Bill Back",df1.columns[9]:"Monday Start Date",df1.columns[10]:"Sunday End Date",df1.columns[11]:"Net Unit"}, inplace = True)
works fine

I am not sure whether this answers your question:
There is a simple way to rename the columns:
If I have a data frame: say df1. I can see the columns name using the following code:
df.columns.to_list()
which gives me suppose following columns name:
['A', 'B', 'C','D']
And I want to keep the first three columns and rename them as 'E', 'F' and 'G' respectively. The following code gives me the desired outcome:
df = df[['A','B','C']]
df.columns = ['E','F','G]
new outcome:
df.columns.to_list()
output: ['E','F','G']

Add columns to dataframe that are not already in another dataframe

I am trying to add empty columns to a dataframe df1 that are not already in a second dataframe df2. So, given
df2.columns = ['a', 'b', 'c', 'd']
df1.columns = ['a', 'b']
I would like to add columns with names 'c' and 'd' to dataframe df1.
For performance reasons, I would like to avoid a loop with multiple withColumn() statements:
for col in df1.columns:
if col not in df2.columns:
df1= df1.withColumn(col, lit(None).cast(StringType()))
My first attemt
df1 = df1.select(col('*'),
lit(None).alias(col_name) for col_name in df1.columns if col_name not in df2.columns)
is throwing an error
TypeError: Invalid argument, not a string or column: <generator object
myfunction.. at 0x7f60e2bcc8e0> of type <class
'generator'>. For column literals, use 'lit', 'array', 'struct' or
'create_map' function.

You need first to convert generator to list using list() function. After converting pass the list to select().
df1.select(col('*'), *list(lit(None).alias(col_name) for col_name in df2.columns if col_name not in df1.columns))

Get the missing columns from one dataframe and append it to another dataframe

I have a Dataframe df1 with the columns. I need to compare the headers of columns in df1 with a list of headers from df2
df1 =['a','b','c','d','f']
df2 =['a','b','c','d','e','f']
I need to compare the df1 with df2 and if any missing columns, I need to add them to df1 with blank values.
I tried concat and also append and both didn't work. with concat, I'm not able to add the column e and with append, it is appending all the columns from df1 and df2. How would I get only missing column added to df1 in the same order?
df1_cols = df1.columns
df2_cols = df2._combine_match_columns
if (df1_cols == df2_cols).all():
df1.to_csv(path + file_name, sep='|')
else:
print("something is missing, continuing")
#pd.concat([my_df,flat_data_frame], ignore_index=False, sort=False)
all_list = my_df.append(flat_data_frame, ignore_index=False, sort=False)
I wanted to see the results as
a|b|c|d|e|f - > headers
1|2|3|4||5 -> values

pandas.DataFrame.align
df1.align(df2, axis=1)[0]
By default this does an 'outer' join
By specifying axis=1 we focus on columns
This returns a tuple of both an aligned df1 and df2 with the calling dataframe being the first element. So I grab the first element with [0]
pandas.DataFrame.reindex
df1.reindex(columns=df1.columns | df2.columns)
You can treat pandas.Index objects like sets most of the time. So df1.columns | df2.columns is the union of those two index objects. I then reindex using the result.

Lets first create the two dataframes as:
import pandas as pd, numpy as np
df1 = pd.DataFrame(np.random.random((5,5)), columns = ['a','b','c','d','f'])
df2 = pd.DataFrame(np.random.random((5,7)), columns = ['a','b','c','d','e','f','g'])
Now add those columns of df2 to df1 (with nan values), which are not in df1:
for i in list(df2):
if i not in list(df1):
df1[i] = np.nan
Now display the columns of df1 alphabetically:
df1 = df1[sorted(list(df1))]

Sort column for each index for multi-index data frame

I have a dataframe which has multi-level indexing. Here is a snippet of it:
import pandas as pd
data = {'EVENT_ID': [112335580,112335580,112335580,112335580,112335580,112335580,112335580,112335580, 112335582,
112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,
112335582,112335582,112335582],
'SELECTION_ID': [6356576,2554439,2503211,6297034,4233251,2522967,5284417,7660920,8112876,7546023,8175276,8145908,
8175274,7300754,8065540,8175275,8106158,8086265,2291406,8065533,8125015],
'BSP': [5.080818565,6.651493872,6.374683435,24.69510797,7.776082305,11.73219964,270.0383021,4,8.294425408,335.3223613,
14.06040142,2.423340019,126.7205863,70.53780982,21.3328554,225.2711962,92.25113066,193.0151362,3.775394142,
95.3786641,17.86333041],
'WIN_LOSE':[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]}
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)
df.sortlevel(level=0, ascending=True, sort_remaining=True)
I want to sort the BSP column for each EVENT_ID index separately.
I have tried the following:
data.assign(BSP=data.groupby(level=0).rank(ascending=False))
This does not work as it messes up the indexing and doesn't seem to sort the column anyway.
I have also tried just sorting in terms of the column but this also clearly just messes up the indexing.

This sorts by BSP ascending for each event ID:
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df = df.sort_values(["EVENT_ID","BSP"])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)

Resetting column headings after concat

I concatenated a df(6000,13) with a sampleDf(6000,1) and found that my col index in my pandas df as expected ranges from 0 - 12 for the df, and then displays 0 for the concatenated sampleDf.
df = pd.concat([df, sampleDF], axis=1)
I am trying to reset this and have tried the following but nothing seems to have any effect. Any other methods I can try or any thoughts on why this may be happening?
df = df.reset_index(drop=True)
df = df.reindex()
df.index = range(len(df.index))
df.index = pd.RangeIndex(len(df.index))
I have also tried to append .reset_index(drop=True) to my original concat.
The only thing I can think of is that my data frame is 1d in length after processing and should be a pandas series perhaps?
Edit
I found a workaround if I transpose and then transpose again. There has to be a better way than this.
df = pd.concat([df, sampleDF], axis=1)
df = df.transpose()
df.index = range(len(df.index))
df = df.transpose()

You can simply rename your columns directly:
df = pd.concat([df, sampleDF], axis=1)
df.columns = range(len(df.columns))
This will be more efficient than repeatedly transposing df.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inserting a tuple into an empty pandas dataframe - python

Related

Pandas set column names by position

Add columns to dataframe that are not already in another dataframe

Get the missing columns from one dataframe and append it to another dataframe

Sort column for each index for multi-index data frame

Resetting column headings after concat

Categories

Resources