Counting all unique values in a data set

Counting all unique values in a data set - python

I have a data set of 60 plus computers with each column being the computer and the rows being the collection of all the software installed from each PC. I want to be able to count each unique value(software), so I can see how many of each software is currently installed.
data = [['a','a','c'],['a','b','d'],['a','c','c']]
df = pd.DataFrame(data,columns=['col1','col2','col3'])
df
col1 col2 col3
a a c
a b d
a c c
I expect the following output
a 4
b 1
c 3

value_counts after melt
df.melt().value.value_counts()
Out[648]:
a 4
c 3
b 1
d 1
Name: value, dtype: int64
numpy.unique for speed up
pd.Series(*np.unique(df.values.ravel(),return_counts=True)[::-1])
Out[653]:
a 4
b 1
c 3
d 1
dtype: int64

Related

Replace each value in Series with its relative ranking

I have a sorted Series, is there a simple way to change it from
A 0.064467
B 0.042283
C 0.037581
D 0.017410
dtype: float64
to
A 1
B 2
C 3
D 4

You can just do rank
df.rank(ascending=False)

groupby count same values in two columns in pandas?

I have the following Pandas dataframe:
name1 name2
A B
A A
A C
A A
B B
B A
I want to add a column named new which counts name1 OR name2 keeping the merged columns (distinct values in both name1 and name2). Hence, the expected output is the following dataframe:
name new
A 7
B 4
C 1
I've tried
df.groupby(["name1"]).count().groupby(["name2"]).count(), among many other things... but although that last one seems to give me the correct results, I cant get the joined datasets.

You can use value_counts with df.stack():
df[['name1','name2']].stack().value_counts()
#df.stack().value_counts() for all cols
A 7
B 4
C 1
Specifically:
(df[['name1','name2']].stack().value_counts().
to_frame('new').rename_axis('name').reset_index())
name new
0 A 7
1 B 4
2 C 1

Let us try melt
df.melt().value.value_counts()
Out[17]:
A 7
B 4
C 1
Name: value, dtype: int64

Alternatively,
df.name1.value_counts().add(df.name2.value_counts(), fill_value=0).astype(int)
gives you
A 7
B 4
C 1
dtype: int64

Using Series.append with Series.value_counts:
df['name1'].append(df['name2']).value_counts()
A 7
B 4
C 1
dtype: int64
value_counts converts the aggregated column to index. To get your desired output, use rename_axis with reset_index:
df['name1'].append(df['name2']).value_counts().rename_axis('name').reset_index(name='new')
name new
0 A 7
1 B 4
2 C 1

python Counter is another solution
from collections import Counter
s = pd.Series(Counter(df.to_numpy().flatten()))
In [1325]: s
Out[1325]:
A 7
B 4
C 1
dtype: int64

Is there a faster way to count values across multiple columns, excluding duplicated values on the same row?

Given the following df
id val1 val2 val3
0 1 A A B
1 1 A B B
2 1 B C NaN
3 1 NaN B D
4 2 A D NaN
I would like to sum the value counts within each id group for all columns; however, I need to only count values that appear on the same row once, so the expected output is:
id
1 B 4
A 2
C 1
D 1
2 A 1
D 1
I can accomplish this with
import pandas as pd
df.set_index('id').apply(lambda x: list(set(x)), axis=1).apply(pd.Series).stack().groupby(level=0).value_counts()
but the apply(...axis=1) (and perhaps apply(pd.Series)) really kills the performance on large DataFrames. Since I have a small number of columns, I guess I could just check for all pairwise duplicates, replace one with np.NaN and then just use df.set_index('id').stack().groupby(level=0).value_counts() but that doesn't seem like the right approach when the number of columns get large.
Any ideas on a faster way around this?

Here's the missing steps that remove row duplicates from your dataframe:
nodups = df.stack().reset_index(level=0).drop_duplicates()
nodups = nodups.set_index(['level_0', nodups.index]).unstack()
nodups.columns = nodups.columns.levels[1]
# id val1 val2 val3
#level_0
#0 1 A None B
#1 1 A B None
#2 1 B C None
#3 1 None B D
#4 2 A D None
Now you can follow with:
nodups.set_index('id').stack().groupby(level=0).value_counts()
Perhaps you can further optimize the code.

I am using get_dummies
s=df.set_index('id',append=True).stack().str.get_dummies().sum(level=[0,1]).gt(0).sum(level=1).stack().astype(int)
s[s.gt(0)]
Out[234]:
id
1 A 2
B 4
C 1
D 1
2 A 1
D 1
dtype: int32

PySpark : How to duplicate the rows of a dataframe based on the values in one column

From a simple dataframe like that in PySpark :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
C 1 6
I would like to duplicate the rows in order to have each value of col1 with each value of col2 and the column count filled with 0 for those we don't have the original value. It would be like that :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
B 2 0
B 3 0
C 1 6
C 2 0
C 3 0
Do you have any idea how to do that efficiently ?

You're looking for crossJoin.
data = df.select('col1', 'col2')
// this one gives you all combinations of col1+col2
all_combinations = data.alias('a').crossJoin(data.alias('b')).select('a.col1', 'b.col2')
// this one will append with count column from original dataset, and null for all other records.
all_combinations.alias('a').join(df.alias('b'), on=(col(a.col1)==col(b.col1) & col(a.col2)==col(b.col2)), how='left').select('a.*', b.count)

In pandas, how to display the most frequent diagnoses in dataframe, but only count 1 occurrence of the same diagnoses per patient

In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]

You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B

Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64

You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting all unique values in a data set - python

value_counts after melt df.melt().value.value_counts() Out[648]: a 4 c 3 b 1 d 1 Name: value, dtype: int64 numpy.unique for speed up pd.Series(*np.unique(df.values.ravel(),return_counts=True)[::-1]) Out[653]: a 4 b 1 c 3 d 1 dtype: int64

Related

Replace each value in Series with its relative ranking

groupby count same values in two columns in pandas?

Is there a faster way to count values across multiple columns, excluding duplicated values on the same row?

PySpark : How to duplicate the rows of a dataframe based on the values in one column

In pandas, how to display the most frequent diagnoses in dataframe, but only count 1 occurrence of the same diagnoses per patient

Categories

Resources