Insert list inside pandas dataframe column names - python

I have a dataframe that I initiate like this:
df = pd.DatFrame(columns=('col_A', 'col_B', 'col_C', 'col_D'))
I want to insert a list of column names in this dataframe, but this does not work:
list_col_names = ['aa', 'bb']
df = pd.DatFrame(columns=('col_A', 'col_B', list_col_names, 'col_C', 'col_D'))
I get this error: *** TypeError: unhashable type: 'list'
How do I fix it? I want all the items in list_col_names to become column names in the pandas dataframe

You are effectively passing in ('col_A', 'col_B', ['aa', 'bb'], 'col_C', 'col_D') as an argument; so for example, try df = pd.DataFrame(columns =
['col_A', 'col_B'] + list_col_names + ['col_C', 'col_D']) instead.
You got an error because pandas tried to create a single column from a list ['aa', 'bb'], which doesn't work.

Related

Unable to update Pandas dataframe element with dictionary

I have a Pandas dataframe where its just 2 columns: the first being a name, and the second being a dictionary of information relevant to the name. Adding new rows works fine, but if I try to updates the dictionary column by assigning a new dictionary in place, I get
ValueError: Incompatible indexer with Series
So, to be exact, this is what I was doing to produce the error:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[pos, 'attributes'] = {'c':2}
I was able to find another solution that seems to work:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[:,'attributes'].at[pos] = {'c':2}
but I was hoping to get an answer as to why the first method doesn't work, or if there was something wrong with how I had it initially.
Since you are trying to access a dataframe with an index 'pos', you have to use iloc to access the row. So changing your last row as following would work as intended:
df.iloc[pos]['attributes'] = {'c':2}
For me working DataFrame.at:
df.at[pos, 'attributes'] = {'c':2}
print (df)
name attributes
0 a {'c': 2}

Python, Remove duplicate values from dataframe column of lists

I've got a dataframe column containing lists, and I want to remove duplicate values from the individual lists.
d = {'colA': [['UVB', 'NER', 'GGR', 'NER'], ['KO'], ['ERK1', 'ERK1', 'ERK2'], []]}
df = pd.DataFrame(data=d)
I want to remove the duplicate 'NER' and 'ERK1' from the lists.
I've tried:
df['colA'] = set(tuple(df['colA']))
I get the error message:
TypeError: unhashable type: 'list'
You can remove duplicates values from the list using apply() method of pandas function as follows.
import pandas as pd
d = {'colA': [['UVB', 'NER', 'GGR', 'NER'], ['KO'], ['ERK1', 'ERK1', 'ERK2'], []]}
df = pd.DataFrame(data=d)
df['colA'].apply(lambda x: list(set(x)))
#output
0 [NER, UVB, GGR]
1 [KO]
2 [ERK2, ERK1]
3 []
Name: colA, dtype: object
problem is that you have a tuple of lists, thats why set command doesnt work. You should iterate over entire tuple.
ans = tuple(df['colA']) for i in range(len(ans)) df['colA'].iloc[i]=set(ans[i])

Add columns to dataframe that are not already in another dataframe

I am trying to add empty columns to a dataframe df1 that are not already in a second dataframe df2. So, given
df2.columns = ['a', 'b', 'c', 'd']
df1.columns = ['a', 'b']
I would like to add columns with names 'c' and 'd' to dataframe df1.
For performance reasons, I would like to avoid a loop with multiple withColumn() statements:
for col in df1.columns:
if col not in df2.columns:
df1= df1.withColumn(col, lit(None).cast(StringType()))
My first attemt
df1 = df1.select(col('*'),
lit(None).alias(col_name) for col_name in df1.columns if col_name not in df2.columns)
is throwing an error
TypeError: Invalid argument, not a string or column: <generator object
myfunction.. at 0x7f60e2bcc8e0> of type <class
'generator'>. For column literals, use 'lit', 'array', 'struct' or
'create_map' function.
You need first to convert generator to list using list() function. After converting pass the list to select().
df1.select(col('*'), *list(lit(None).alias(col_name) for col_name in df2.columns if col_name not in df1.columns))

remove first 2 rows in a dataframe based on the value in another column

I have a df with stock tickers in a column and the next column is called 'Fast Add' which will either be populated with the value 'Add' or be empty.
I want to remove the 2 stocks tickers but only where the fast add column = ADD. the below code will remove the first 2 lines but i need to add a argument which only removes the first 2 lines where the 'Fast Add' column = 'Add'. Can someone help please
new_df = df_obj[2:]
You can use the drop function in Pandas to remove specific indices from a DataFrame. Here's a code example for your use case:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ticker': ['A', 'B', 'C', 'D'],
'Fast Add': ['Add', np.nan, 'Add', 'Add']
})
new_df = df.drop(df[df['Fast Add'] == 'Add'][:2].index)
new_df is a DataFrame with the following contents:
Ticker Fast Add
1 B NaN
3 D Add
The approach here is to select all the rows you want to remove and then pass their indices into DataFrame.drop() to remove them.
References:
https://showmecode.info/pandas/DataFrame/remove-rows/ (personal site)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
IIUC something like this should work:
df_obj["record_idx"] = df_obj.groupby('FastAdd').cumcount()
new_df = df_obj.query("record_idx >= 2 & FastAdd == 'ADD'")
You can also use a cheap hack like below:
df_obj.sort_values("FastAdd", inplace = True)
new_df = df_obj.iloc[2:].copy()

Inserting a tuple into an empty pandas dataframe

I would like to insert a row into an empty DataFrame. However, this seems to fail for a DataFrame with predefined indices and when the elements include a tuple or list prompting the error:
ValueError: setting an array element with a sequence.
The example code is as follows:
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = [3,4,(5,6)]
print(df)
It is not clear that the elements in the list correspond to values in different columns. You can convert the list first to a Series indexed by the DataFrame's columns:
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = pd.Series([3,4,(5,6)], index=df.columns)
print(df)
I tried something like this.
def with_return(row):
t = [5,6]
return t
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = [3,4,5] #dummy element
df['col3'] = df.apply(with_return, axis=1)
print(df)
or simply use series,
df.loc[(1,2),:] = pd.Series([3,4,(5,6)], index=df.columns)
Still not directly inserting a tuple as an element in an empty DataFrame. But just another way to do it. Still, loc should be able to handle it.

Categories