Python, Remove duplicate values from dataframe column of lists - python

I've got a dataframe column containing lists, and I want to remove duplicate values from the individual lists.
d = {'colA': [['UVB', 'NER', 'GGR', 'NER'], ['KO'], ['ERK1', 'ERK1', 'ERK2'], []]}
df = pd.DataFrame(data=d)
I want to remove the duplicate 'NER' and 'ERK1' from the lists.
I've tried:
df['colA'] = set(tuple(df['colA']))
I get the error message:
TypeError: unhashable type: 'list'

You can remove duplicates values from the list using apply() method of pandas function as follows.
import pandas as pd
d = {'colA': [['UVB', 'NER', 'GGR', 'NER'], ['KO'], ['ERK1', 'ERK1', 'ERK2'], []]}
df = pd.DataFrame(data=d)
df['colA'].apply(lambda x: list(set(x)))
#output
0 [NER, UVB, GGR]
1 [KO]
2 [ERK2, ERK1]
3 []
Name: colA, dtype: object

problem is that you have a tuple of lists, thats why set command doesnt work. You should iterate over entire tuple.
ans = tuple(df['colA']) for i in range(len(ans)) df['colA'].iloc[i]=set(ans[i])

Related

Create array from dataframe columns in python - error when iterating

I have created this dataframe
d = {'col1': [1], 'col2': [3]}
df = pd.DataFrame(data=d)
print(df)
I have then created a field called "columnA" which is supposed to be an array made of the two elements contained in col1 and col2:
filter_col = [col for col in df if col.startswith('col')]
df["columnA"] = df[filter_col].values.tolist()
print(df)
Now, I was expecting the columnA to be a list (or an array), but when I check the length of that field I get 1 (not 2, as I expected):
print("Lenght: ",str(len(df['columnA'])))
Length: 1
What do I need to do to get a value of 2 and therefore be able to iterate through that array?
For example, I would be able to do this iteration:
for i in range(len(df['columnA'])):
print(i)
Result:
0
1
Can anyone help me, please?
You are on right track, instead of direct using len() on dataframe, you have to take values and then apply
print("Lenght: ",len(df['columnA'].values[0]))
for item in df["columnA"]:
for num in item:
print(num)
This will iterate directly over the column

How to correctly append values of a dictionary to an empty dataframe?

Hi I am trying to create a dataframe that will have rows added to it in a for loop. So I decided to first create an empty version of the dataframe, and then create a dictionary for a new row and append that dictionary to the dataframe in each iteration of the loop. The problem is that the values in the dataframe do not properly match the values that were in the dictionary:
I create an empty dataframe as follows:
import pandas
df = pandas.DataFrame({"a":[], "b":[], "c":[]})
I would then create a dictionary and append it to the dataframe like:
dict = {"a":1, "b":2, "c":True}
df = df.append(dict, ignore_index=True)
The problem is that instead of getting a=1, b=1, c=True the dataframe has a=1.0, b=1.0, c=1.0
So how can I make the columns of a and b integers, and the column of c a boolean value?
You should not convert dictionary to dataframe in the first line. Better convert it at last
import pandas
dicts = {"a":[], "b":[], "c":[]} #do not convert dict to df
new_dict = {"a":1, "b":2, "c":True}
for k,v in new_dict.items():
dicts[k].append(new_dict[k])
df = pd.DataFrame(dicts)
Output:
df
a b c
0 1 2 True

How to query a Pandas Dataframe based on column values

I have a dataframe:
ID
Name
1
A
2
B
3
C
I defined a list:
mylist =[A,C]
If I want to extract only the rows where Name is equal to A and C (namely, mylist), I am trying to use the following code:
df_new = df[(df['Name'].isin(mylist))]
>>> df_new
As result, I get an empty table.
Any suggestion regarding why I get this error?
Just remove the additional open bracket before the df['Name']
df_new = df[df['Name'].isin(lst)]
Found the solution, It was a problem related to the list that caused the result of the empty table.
The format of the list should be:
mylist =['A','C']
instead of
mylist =[A,C]
You could use .loc and lambda as it’s more readable
import pandas as pd
dataf = pd.DataFrame({'ID':[1,2,3],'Name':['A','B','C']})
names = ['A','C']
# lock rows where column Name in names
df = dataf.loc[lambda d: d['Name'].isin(names)]
print(df)

Dictionary to Dataframe Error: "If using all scalar values, you must pass an index"

Currently, I am using a for loop to read csv files from a folder.
After reading the csv file, I am storing the data into one row of a dictionary.
When I print the data types using "print(list_of_dfs.dtypes)" I receive:
dtype: object
DATETIME : object
VALUE : float64
ID : int64
ID Name: object.
Note that this is a nested dictionary with thousands of values stored in each of these data fields. I have 26 rows of the structure listed above. I am trying to append the dictionary rows into a dataframe where I will have only 1 row consisting of the datafields:
Index DATETIME VALUE ID ID Name.
Note: I am learning python as I go.
I tried using an array to store the data and then convert the array to a dataframe but I could not append the rows of the dataframe.
Using the dictionary method I attempted "df = pd.Dataframe(list_of_dfs)"
This throws an error.
list_of_dfs = {}
for I in range(0,len(regionLoadArray)
list_of_dfs[I] = pd.read_csv(regionLoadArray[I])
#regionLoadArray contains my- file names from list directory.
dataframe = pd.DataFrame(list_of_dfs)
#this method was suggested at thispoint.com for nested dictionaries.
#This is where my error occurs^
ValueError: If using all scalar values, you must pass an index
I appreciate any assistance with this issue as I am new to python.
My current goals is to simply produce a dataframe with my Headers that I can then send to a csv.
Depending on your needs, a simple workaround could be:
dct = {'col1': 'abc', 'col2': 123}
dct = {k:[v] for k,v in dct.items()} # WORKAROUND
df = pd.DataFrame(dct)
which results in
print(df)
col1 col2
0 abc 123
This error occurs because pandas needs an index. At first this seems sort of confusing because you think of list indexing. What this is essentially asking for is a column number for each dictionary to correspond to each dictionary. You can set this like so:
import pandas as pd
list = ['a', 'b', 'c', 'd']
df = pd.DataFrame(list, index = [0, 1, 2, 3])
The data frame then yields:
0
0 'a'
1 'b'
2 'c'
3 'd'
For you specifically, this might look something like this using numpy (not tested):
list_of_dfs = {}
for I in range(0,len(regionLoadArray)):
list_of_dfs[I] = pd.read_csv(regionLoadArray[I])
ind = np.arange[len(list_of_dfs)]
dataframe = pd.DataFrame(list_of_dfs, index = ind)
Pandas unfortunately always needs an index when creating a DataFrame.
You can either set it yourself, or use an object with the following structure so pandas can determine the index itself:
data= {'a':[1],'b':[2]}
Since it won't be easy to edit the data in your case,
A hacky solution is to wrap the data into a list
dataframe = pd.DataFrame([list_of_dfs])
import pandas as pd
d = [{"a": 1, "b":2, "c": 3},
{"a": 4, "b":5, "c": 6},
{"a": 7, "b":8, "c": 9}
]
pd.DataFrame(d, index=list(range(len(d))))
returns:
a b c
0 1 2 3
1 4 5 6
2 7 8 9

"Expanding" pandas dataframe by using cell-contained list

I have a dataframe in which third column is a list:
import pandas as pd
pd.DataFrame([[1,2,['a','b','c']]])
I would like to separate that nest and create more rows with identical values of first and second column.
The end result should be something like:
pd.DataFrame([[[1,2,'a']],[[1,2,'b']],[[1,2,'c']]])
Note, this is simplified example. In reality I have multiple rows that I would like to "expand".
Regarding my progress, I have no idea how to solve this. Well, I imagine that I could take each member of nested list while having other column values in mind. Then I would use the list comprehension to make more list. I would continue so by and add many lists to create a new dataframe... But this seems just a bit too complex. What about simpler solution?
Create the dataframe with a single column, then add columns with constant values:
import pandas as pd
df = pd.DataFrame({"data": ['a', 'b', 'c']})
df['col1'] = 1
df['col2'] = 2
print df
This prints:
data col1 col2
0 a 1 2
1 b 1 2
2 c 1 2
Not exactly the same issue that the OR described, but related - and more pandas-like - is the situation where you have a dict of lists with lists of unequal lengths. In that case, you can create a DataFrame like this in long format.
import pandas as pd
my_dict = {'a': [1,2,3,4], 'b': [2,3]}
df = pd.DataFrame.from_dict(my_dict, orient='index')
df = df.unstack() # to format it in long form
df = df.dropna() # to drop nan values which were generated by having lists of unequal length
df.index = df.index.droplevel(level=0) # if you don't want to store the index in the list
# NOTE this last step results duplicate indexes

Categories