Insert list inside pandas dataframe column names

Insert list inside pandas dataframe column names - python

I have a dataframe that I initiate like this:
df = pd.DatFrame(columns=('col_A', 'col_B', 'col_C', 'col_D'))
I want to insert a list of column names in this dataframe, but this does not work:
list_col_names = ['aa', 'bb']
df = pd.DatFrame(columns=('col_A', 'col_B', list_col_names, 'col_C', 'col_D'))
I get this error: *** TypeError: unhashable type: 'list'
How do I fix it? I want all the items in list_col_names to become column names in the pandas dataframe

You are effectively passing in ('col_A', 'col_B', ['aa', 'bb'], 'col_C', 'col_D') as an argument; so for example, try df = pd.DataFrame(columns =
['col_A', 'col_B'] + list_col_names + ['col_C', 'col_D']) instead.
You got an error because pandas tried to create a single column from a list ['aa', 'bb'], which doesn't work.

Related

Unable to update Pandas dataframe element with dictionary

I have a Pandas dataframe where its just 2 columns: the first being a name, and the second being a dictionary of information relevant to the name. Adding new rows works fine, but if I try to updates the dictionary column by assigning a new dictionary in place, I get
ValueError: Incompatible indexer with Series
So, to be exact, this is what I was doing to produce the error:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[pos, 'attributes'] = {'c':2}
I was able to find another solution that seems to work:
import pandas as pd
df = pd.DataFrame(data=[['a', {'b':1}]], columns=['name', 'attributes'])
pos = df[df.loc[:,'name']=='a'].index[0]
df.loc[:,'attributes'].at[pos] = {'c':2}
but I was hoping to get an answer as to why the first method doesn't work, or if there was something wrong with how I had it initially.

Since you are trying to access a dataframe with an index 'pos', you have to use iloc to access the row. So changing your last row as following would work as intended:
df.iloc[pos]['attributes'] = {'c':2}

For me working DataFrame.at:
df.at[pos, 'attributes'] = {'c':2}
print (df)
name attributes
0 a {'c': 2}

Python, Remove duplicate values from dataframe column of lists

I've got a dataframe column containing lists, and I want to remove duplicate values from the individual lists.
d = {'colA': [['UVB', 'NER', 'GGR', 'NER'], ['KO'], ['ERK1', 'ERK1', 'ERK2'], []]}
df = pd.DataFrame(data=d)
I want to remove the duplicate 'NER' and 'ERK1' from the lists.
I've tried:
df['colA'] = set(tuple(df['colA']))
I get the error message:
TypeError: unhashable type: 'list'

You can remove duplicates values from the list using apply() method of pandas function as follows.
import pandas as pd
d = {'colA': [['UVB', 'NER', 'GGR', 'NER'], ['KO'], ['ERK1', 'ERK1', 'ERK2'], []]}
df = pd.DataFrame(data=d)
df['colA'].apply(lambda x: list(set(x)))
#output
0 [NER, UVB, GGR]
1 [KO]
2 [ERK2, ERK1]
3 []
Name: colA, dtype: object

problem is that you have a tuple of lists, thats why set command doesnt work. You should iterate over entire tuple.
ans = tuple(df['colA']) for i in range(len(ans)) df['colA'].iloc[i]=set(ans[i])

Add columns to dataframe that are not already in another dataframe

I am trying to add empty columns to a dataframe df1 that are not already in a second dataframe df2. So, given
df2.columns = ['a', 'b', 'c', 'd']
df1.columns = ['a', 'b']
I would like to add columns with names 'c' and 'd' to dataframe df1.
For performance reasons, I would like to avoid a loop with multiple withColumn() statements:
for col in df1.columns:
if col not in df2.columns:
df1= df1.withColumn(col, lit(None).cast(StringType()))
My first attemt
df1 = df1.select(col('*'),
lit(None).alias(col_name) for col_name in df1.columns if col_name not in df2.columns)
is throwing an error
TypeError: Invalid argument, not a string or column: <generator object
myfunction.. at 0x7f60e2bcc8e0> of type <class
'generator'>. For column literals, use 'lit', 'array', 'struct' or
'create_map' function.

You need first to convert generator to list using list() function. After converting pass the list to select().
df1.select(col('*'), *list(lit(None).alias(col_name) for col_name in df2.columns if col_name not in df1.columns))

remove first 2 rows in a dataframe based on the value in another column

I have a df with stock tickers in a column and the next column is called 'Fast Add' which will either be populated with the value 'Add' or be empty.
I want to remove the 2 stocks tickers but only where the fast add column = ADD. the below code will remove the first 2 lines but i need to add a argument which only removes the first 2 lines where the 'Fast Add' column = 'Add'. Can someone help please
new_df = df_obj[2:]

You can use the drop function in Pandas to remove specific indices from a DataFrame. Here's a code example for your use case:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ticker': ['A', 'B', 'C', 'D'],
'Fast Add': ['Add', np.nan, 'Add', 'Add']
})
new_df = df.drop(df[df['Fast Add'] == 'Add'][:2].index)
new_df is a DataFrame with the following contents:
Ticker Fast Add
1 B NaN
3 D Add
The approach here is to select all the rows you want to remove and then pass their indices into DataFrame.drop() to remove them.
References:
https://showmecode.info/pandas/DataFrame/remove-rows/ (personal site)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

IIUC something like this should work:
df_obj["record_idx"] = df_obj.groupby('FastAdd').cumcount()
new_df = df_obj.query("record_idx >= 2 & FastAdd == 'ADD'")
You can also use a cheap hack like below:
df_obj.sort_values("FastAdd", inplace = True)
new_df = df_obj.iloc[2:].copy()

Inserting a tuple into an empty pandas dataframe

I would like to insert a row into an empty DataFrame. However, this seems to fail for a DataFrame with predefined indices and when the elements include a tuple or list prompting the error:
ValueError: setting an array element with a sequence.
The example code is as follows:
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = [3,4,(5,6)]
print(df)

It is not clear that the elements in the list correspond to values in different columns. You can convert the list first to a Series indexed by the DataFrame's columns:
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = pd.Series([3,4,(5,6)], index=df.columns)
print(df)

I tried something like this.
def with_return(row):
t = [5,6]
return t
df = pd.DataFrame(columns=['idx1', 'idx2', 'col1', 'col2', 'col3'])
df.set_index(['idx1', 'idx2'], inplace=True)
df.loc[(1,2),:] = [3,4,5] #dummy element
df['col3'] = df.apply(with_return, axis=1)
print(df)
or simply use series,
df.loc[(1,2),:] = pd.Series([3,4,(5,6)], index=df.columns)
Still not directly inserting a tuple as an element in an empty DataFrame. But just another way to do it. Still, loc should be able to handle it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Insert list inside pandas dataframe column names - python

Related

Unable to update Pandas dataframe element with dictionary

Python, Remove duplicate values from dataframe column of lists

Add columns to dataframe that are not already in another dataframe

remove first 2 rows in a dataframe based on the value in another column

Inserting a tuple into an empty pandas dataframe

Categories

Resources