counting unique elements in lists - python

I have a dataframe containing one column of lists.
names unique_values
[B-PER,I-PER,I-PER,B-PER] 2
[I-PER,N-PER,B-PER,I-PER,A-PER] 4
[B-PER,A-PER,I-PER] 3
[B-PER, A-PER,A-PER,A-PER] 2
I have to count each distinct value in a column of lists and If value appears more than once count it as one. How can I achieve it
Thanks

Combine explode with nunique
df["unique_values"] = df.names.explode().groupby(level = 0).nunique()

You can use the inbulit set data type to do this -
df['unique_values'] = df['names'].apply(lambda a : len(set(a)))
This works as sets do not allow any duplicate elements in their construction so when you convert a list to a set it strips all duplicate elements and all you need to do is get the length of the resultant set.
to ignore NaN values in a list you can do the following -
df['unique_values'] = df['names'].apply(lambda a : len([x for x in set(a) if str(x) != 'nan']))

Try:
df["unique_values"] = df.names.explode().groupby(level = 0).unique().str.len()
Output
df
names unique_values
0 [B-PER, I-PER, I-PER, B-PER] 2
1 [I-PER, N-PER, B-PER, I-PER, A-PER] 4
2 [B-PER, A-PER, I-PER] 3
3 [B-PER, A-PER, A-PER, A-PER] 2

Related

Manipulate row values based on lists

i have actually a problem and I do not know how to solve it.
I have two lists, which have always the same lengths:
max_values = [333,30,10]
min_values = [30,10,0]
every index of the lists represents the cluster number of a range of the max and the min values, so:
Index/Cluster 0: 0-10
Index/Cluster 1: 10-30
Index/Cluster 2: 30-333
Furthermore I have one dataframe as follows:
Dataframe
Within the df, I have a column called "AVG_MPH_AREA"
It should be checked between which cluster range the value is. After the "Cluster" column should be set to the correct index of the list. The old values should be dropped.
In this case it's a list of 3 clusters, but it could also be more or less...
Any idea how to switch that or with which functions?
Came up with a small function that could do the task
max_values = [333,30,10]
min_values = [30,10,0]
Make a dictionary that contains Cluster_num as key and (min_values, max_values) as value.
def temp_func(x):
# constructing the dict inside to apply this func to AVG_MPH_AREA column in dataframe
dt = {}
cluster_list=list(zip(min_values, max_values))
for i in range(len(cluster_list)):
dt[i] = cluster_list[i]
for key, value in dt.items():
x = int(round(x))
if x in list(range(value[0], value[1])):
return key
else:
continue
Now apply the function to the AVG_MPH_AREA column
df["Cluster"] = df["AVG_MPH_AREA"].apply(temp_func)
Output:
AVG_MPH_AREA cluster
0 10.770 1
1 10.770 1
2 10.780 1
3 5.780 2
4 24.960 1
5 267.865 0

only keep part of a list in pandas series

I have a Pandas series containing a list of strings like so:
series_of_list.head()
0 ['hello','there','my','name']
1 ['hello','hi','my','name']
2 ['hello','howdy','my','name']
3 ['hello','mate','my','name']
4 ['hello','hello','my','name']
type(series_of_list)
pandas.core.series.Series
I would like to only keep the first to entries of the list like so:
series_of_list.head()
0 ['hello','there']
1 ['hello','hi']
2 ['hello','howdy']
3 ['hello','mate']
4 ['hello','hello']
I have tried slicing it, series_of_list=series_of_list[:2], but doing so just returns the first two indexes of the series...
series_of_list.head()
0 ['hello','there','my','name']
1 ['hello','hi','my','name']
I have also tried .drop and other slicing but the outcome is not what I want.
How can I only keep the first two items of the list for the entire pandas series?
Thank you!
pandas.Series.apply() the function on each element.
series_of_list = series_of_list.apply(lambda x: x[:2])

column of list values to one flat list in Python

I have pandas column with multiple string values in it, I want to convert them into one list so that I can take count of it
df.columnX
Row 1 ['A','B','A','C']
Row 2 ['A','C']
Row 3 ['D','A']
I want output like
Tag Count
A 4
B 1
C 2
D 1
I am trying to pull them to list but double quote is coming
df.columnX.values = ["'A','B',,,,,,,,,'A'"]
Thanks in advance
What about this ?
df.explode('columnX').columnX.value_counts().to_frame()
Note that you need pandas > 0.25.0 for explode to work.
If your lists are in fact strings, you can first convert them to lists (as suggested by #Jon Clements) :
import ast
df.columnX = df.columnX.map(ast.literal_eval)
I got it
flatList = [item for sublist in list(df.ColumnX.map(ast.literal_eval)) for item in sublist]
dict((x,flatList.count(x)) for x in set(flatList))

How to collapse the values of a Series where the values are a list into a unique list

Given a Pandas Series like the one below:
0 [ID01]
1 [ID02]
2 [ID05, ID08]
3 [ID09, ID56, ID32]
4 [ID03]
The objective is to get a single list like the one below:
[ID01, ID02, ID05, ID08, ID09, ID56, ID32, ID03]
How do you achieve that in a pythonic way in Python?
Assuming that is a pandas.Series object
Option 1
Full list
np.concatenate(s).tolist()
Option 1.1
Unique list
np.unique(np.concatenate(s)).tolist()
Option 2
Works if elements are lists. Doesn't work if they are numpy arrays.
Full list
s.sum()
Option 2.1
Unique list
pd.unique(s.sum()).tolist()
Option 3
Full list
[x for y in s for x in y]
Option 3.1
Unique list (Thanks #pault)
list({x for y in s for x in y})
#Wen's Option
list(set.union(*map(set, s)))
Setup
s = pd.Series([
['ID01'],
['ID02'],
['ID05', 'ID08'],
['ID09', 'ID56', 'ID32'],
['ID03']
])
s
0 [ID01]
1 [ID02]
2 [ID05, ID08]
3 [ID09, ID56, ID32]
4 [ID03]
dtype: object

Remove empty lists in pandas series

I have a long series like the following:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
In [151]: series
Out[151]:
0 [(1, 2)]
1 [(3, 5)]
2 []
3 [(3, 5)]
dtype: object
I want to remove all entries with an empty list. For some reason, boolean indexing does not work.
The following tests both give the same error:
series == [[(1,2)]]
series == [(1,2)]
ValueError: Arrays were different lengths: 4 vs 1
This is very strange, because in the simple example below, indexing works just like above:
In [146]: pd.Series([1,2,3]) == [3]
Out[146]:
0 False
1 False
2 True
dtype: bool
P.S. ideally, I'd like to split the tuples in the series into a DataFrame of two columns also.
You could check to see if the lists are empty using str.len():
series.str.len() == 0
and then use this boolean series to remove the rows containing empty lists.
If each of your entries is a list containing a two-tuple (or else empty), you could create a two-column DataFrame by using the str accessor twice (once to select the first element of the list, then to access the elements of the tuple):
pd.DataFrame({'a': series.str[0].str[0], 'b': series.str[0].str[1]})
Missing entries default to NaN with this method.
Using the built in apply you can filter by the length of the list:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
series = series[series.apply(len) > 0]
Your series is in a bad state -- having a Series of lists of tuples of ints
buries the useful data, the ints, inside too many layers of containers.
However, to form the desired DataFrame, you could use
df = series.apply(lambda x: pd.Series(x[0]) if x else pd.Series()).dropna()
which yields
0 1
0 1 2
1 3 5
2 3 5
A better way would be to avoid building the malformed series altogether and
form df directly from the data:
data = [[(1,2)],[(3,5)],[],[(3,5)]]
data = [pair for row in data for pair in row]
df = pd.DataFrame(data)

Categories