pandas - number of unique rows occurrences in dataframe - python

How can I count number of occurrences of each unique row in a DataFrame?
data = {'x1': ['A','B','A','A','B','A','A','A'], 'x2': [1,3,2,2,3,1,2,3]}
df = pd.DataFrame(data)
df
x1 x2
0 A 1
1 B 3
2 A 2
3 A 2
4 B 3
5 A 1
6 A 2
7 A 3
And I would like to obtain
x1 x2 count
0 A 1 2
1 A 2 3
2 A 3 1
3 B 3 2

IIUC you can pass param as_index=False as an arg to groupby:
In [100]:
df.groupby(['x1','x2'], as_index=False).count()
Out[100]:
x1 x2 count
0 A 1 2
1 A 2 3
2 A 3 1
3 B 3 2

You could also drop duplicated rows:
In [4]: df.shape[0]
Out[4]: 8
In [5]: df.drop_duplicates().shape[0]
Out[5]: 4

There are two ways you can find unique occurence in your dataframe.
1st: Using drop_duplicates
df.drop_duplicates().sort_values('x1',ignore_index=True)
2nd: Using groupby.nunique
df.groupby(['x1','x2'], as_index=False).nunique()
For finding the number of occurrences, the answer from #EdChum will work precisely.

Related

How to check if all possible combinations of columns exist in dataframe (Pandas)?

I have the following dataframe
A B ...
0 1 1
1 1 2
2 1 3
0 2 1
1 2 2
2 2 3
And I would like to check if the dataframe is a complete combination of the entries in each column. In the above dataframe this is the case. A = {1,2} B = {1,2,3} and the dataframe contains all possible combinations. Following example would result in a false.
A B
0 1 1
1 1 2
0 2 1
The number of columns should be flexible.
Many thanks for your help!
df = pd.DataFrame({'A': [1,1,1,2,2,2],
'B': [1,2,3,1,2,3]})
Create a data frame with all combinations of unique values in all columns
uniques = [df[i].unique().tolist() for i in df.columns]
df_combo = pd.DataFrame(product(*uniques), columns = df.columns)
print(df_combo)
A B
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
Test if two dataframes contain the same elements
df.equals(df_combo)
True
For False scenario,
df = pd.DataFrame({'A': [1,1,2],
'B': [1,2,1]})
df_combo
A B
0 1 1
1 1 2
2 2 1
3 2 2
df.equals(df_combo)
False

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

delete pandas dataframe row if every value is equal

If I have a pandas dataframe which has a row containing float values and all the values are equal in the row, how do I delete that row from the dataframe?
Use DataFrame.nunique for test number of unique values per rows with Series.ne for filter out unique rows by boolean indexing:
df1 = df[df.nunique(axis=1).ne(1)]
Or test if not equal first column and test if at least one True per rows by DataFrame.any:
df1 = df[df.ne(df.iloc[:, 0], axis=0).any(axis=1)]
EDIT: If want remove all rows and all columns with same values solution should be changed for test columns with loc and axis=0:
df = pd.DataFrame({
'B':[4,4,4,4,4,4],
'C':[4,4,9,4,2,3],
'D':[4,4,5,7,1,0],
})
print (df)
B C D
0 4 4 4
1 4 4 4
2 4 9 5
3 4 4 7
4 4 2 1
5 4 3 0
df2 = df.loc[df.nunique(axis=1).ne(1), df.nunique(axis=0).ne(1)]
And for second solution:
df2 = df.loc[df.ne(df.iloc[:, 0], axis=0).any(axis=1), df.ne(df.iloc[0], axis=1).any(axis=0)]
print (df2)
C D
2 9 5
3 4 7
4 2 1
5 3 0
You can use DataFrame.diff over axis=1 (per row):
# Example dataframe:
df = pd.DataFrame({'Col1':[1,2,3],
'Col2':[2,2,5],
'Col3':[4,2,9]})
Col1 Col2 Col3
0 1 2 4
1 2 2 2 # <-- row with all same values
2 3 5 9
df[df.diff(axis=1).fillna(0).ne(0).any(axis=1)]
Col1 Col2 Col3
0 1 2 4
2 3 5 9

Pandas Series with column names for each value above a minimum

I try to get a new series from a DataFrame. This series should contain the column names of the DataFrame's values that are above some value for each row of the DataFrame. But beginning from the left of the DataFrame, like this:
df = pd.DataFrame(np.random.randint(0,10,size=(5, 6)), columns=list('ABCDEF'))
>>> df
A B C D E F
0 2 4 6 8 8 4
1 2 0 9 7 7 1
2 1 7 7 7 3 0
3 5 4 4 0 1 7
4 9 6 1 5 1 5
min = 3
Expected Output:
0 B
1 C
2 B
3 A
4 A
dtype: object
Here the output's row 0 is "B" because in the DataFrame row index 0 column "B" is the most left column that has a value that is equal or bigger than min = 3.
I know that I an use df.idxmin(axis = 1) to get the column names of the minimum for each row but I have now clue at all how to tackle this more complex problem.
Thanks for help or hints!
UPDATE - index of the first element in each row, satisfying condition:
more elegant and more efficient version from #DSM:
In [156]: (df>=3).idxmax(1)
Out[156]:
0 B
1 C
2 B
3 A
4 A
dtype: object
my version:
In [149]: df[df>=3].apply(lambda x: x.first_valid_index(), axis=1)
Out[149]:
0 B
1 C
2 B
3 A
4 A
dtype: object
Old answer - index of the minimum element for each row:
In [27]: df[df>=3].idxmin(1)
Out[27]:
0 E
1 A
2 C
3 C
4 F
dtype: object

How to select cells greater than a value in a multi-index Pandas dataframe?

Try 1:
df[ df > 1.0 ] : this returned all cells in NAN.
Try2:
df.loc[ df > 1.0 ] : this returned KeyError: 0
df[df['A']> 1.0] : this works - But I want to apply the filter condition to all columns.
If what you are trying to do is to select only rows where any one column meets the condition , you can use DataFrame.any() along with axis=1 (to do row-wise grouping) . Example -
In [3]: df
Out[3]:
A B C
0 1 2 3
1 3 4 5
2 3 1 4
In [6]: df[(df <= 2).any(axis=1)]
Out[6]:
A B C
0 1 2 3
2 3 1 4
Alternatively, if you are trying for filtering rows where all columns meet the condition , use .all() inplace of .any() . Example of all -
In [8]: df = pd.DataFrame([[1,2,3],[3,4,5],[3,1,4],[1,2,1]],columns=['A','B','C'])
In [9]: df
Out[9]:
A B C
0 1 2 3
1 3 4 5
2 3 1 4
3 1 2 1
In [11]: df[(df <= 2).all(axis=1)]
Out[11]:
A B C
3 1 2 1

Categories