All possibilities with groupby and value_counts(), issue with Multindex

All possibilities with groupby and value_counts(), issue with Multindex - python

I have a table that looks like this:
ACCOUNT_ID | OPTION
1 A
2 A
2 B
2 B
2 C
I want to count the groups for each so I ran df.groupby(['ACCOUNT_ID'])['OPTION'].value_counts() and the result looks like this:
ACCOUNT_ID | OPTION
1 A 1
2 A 1
2 B 2
2 C 1
This works well but I want every possible option to be shown (so A, B, C counts for each account_id) like:
ACCOUNT_ID | OPTION
1 A 1
1 B 0
1 C 0
2 A 1
2 B 2
2 C 1
I found this response, using .sort_index().reindex(uniques, fill_value = 0), which looks great, but doesn't work since I am using a MultiIndex.
Any tips would be amazing!!

One solution is to unstack the inner level of the MultiIndex into columns. This gives you a DataFrame whose columns have float64 dtype, with NaN values for missing combinations of ACCOUNT_ID and OPTION. Fill NaNs with 0, convert back to integer dtype with astype, and stack the columns back into the index to recreate the MultiIndex:
df.unstack().fillna(0).astype(int).stack()
ACCOUNT_ID OPTION
1 A 1
B 0
C 0
2 A 1
B 2
C 1
dtype: int64

Related

How to drop conflicted rows in Dataframe?

I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.

I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0

Pandas - Create a symmetric matrix that counts the number of records

I have a dataframe that looks like the one below
ID | Value
1 | A
1 | B
1 | C
2 | B
2 | C
I want to create a symmetric matrix of based on Value:
A B C
A 1 1 1
B 1 2 2
C 1 2 2
This basically indicates how many people have both values (v1,v2). I am currently using for loops to scan the dataframe for every combination but was wondering if there was an easier way to do it using pandas.

Use merge with cross join by ID column with crosstab and DataFrame.rename_axis for remove index and columns names:
df = pd.merge(df, df, on='ID')
df = pd.crosstab(df['Value_x'], df['Value_y']).rename_axis(None).rename_axis(None, axis=1)
print (df)
A B C
A 1 1 1
B 1 2 2
C 1 2 2

Is there an elegant way to remap group values into an incremental series in a pandas DataFrame?

Is there an elegant way to reassign group values to increasing ones?
I have a table which has is already in order:
X = pandas.DataFrame([['a',2],['b',4],['ba',4],['c',8]],columns=['value','group'])
X
Out[18]:
value group
0 a 2
1 b 4
2 ba 4
3 c 8
But I would like to remap group values to that they would increase one by one. The end result would look like:
value group
0 a 1
1 b 2
2 ba 2
3 c 3

Using category or factorize
X.group.astype('category').cat.codes+1 # pd.factorize(X.group)[0]+1
Out[105]:
0 1
1 2
2 2
3 3
dtype: int8

Is there a faster way to count values across multiple columns, excluding duplicated values on the same row?

Given the following df
id val1 val2 val3
0 1 A A B
1 1 A B B
2 1 B C NaN
3 1 NaN B D
4 2 A D NaN
I would like to sum the value counts within each id group for all columns; however, I need to only count values that appear on the same row once, so the expected output is:
id
1 B 4
A 2
C 1
D 1
2 A 1
D 1
I can accomplish this with
import pandas as pd
df.set_index('id').apply(lambda x: list(set(x)), axis=1).apply(pd.Series).stack().groupby(level=0).value_counts()
but the apply(...axis=1) (and perhaps apply(pd.Series)) really kills the performance on large DataFrames. Since I have a small number of columns, I guess I could just check for all pairwise duplicates, replace one with np.NaN and then just use df.set_index('id').stack().groupby(level=0).value_counts() but that doesn't seem like the right approach when the number of columns get large.
Any ideas on a faster way around this?

Here's the missing steps that remove row duplicates from your dataframe:
nodups = df.stack().reset_index(level=0).drop_duplicates()
nodups = nodups.set_index(['level_0', nodups.index]).unstack()
nodups.columns = nodups.columns.levels[1]
# id val1 val2 val3
#level_0
#0 1 A None B
#1 1 A B None
#2 1 B C None
#3 1 None B D
#4 2 A D None
Now you can follow with:
nodups.set_index('id').stack().groupby(level=0).value_counts()
Perhaps you can further optimize the code.

I am using get_dummies
s=df.set_index('id',append=True).stack().str.get_dummies().sum(level=[0,1]).gt(0).sum(level=1).stack().astype(int)
s[s.gt(0)]
Out[234]:
id
1 A 2
B 4
C 1
D 1
2 A 1
D 1
dtype: int32

Df.drop/delete duplicate rows

How can I drop the exact duplicates of a row. So if I have a data frame that looks like so:
A B C
1 2 3
3 2 2
1 2 3
now my data frame is a lot larger than this but is their a way that we can have python look at every row and if the values in the rows are the exact same as another row just drop or delete that row. I want to take in to account for the whole data frame i don't want to specify the column I want to get unique values for.

you can use DataFrame.drop_duplicates() method:
In [23]: df
Out[23]:
A B C
0 1 2 3
1 3 2 2
2 1 2 3
In [24]: df.drop_duplicates()
Out[24]:
A B C
0 1 2 3
1 3 2 2

You can get a de-duplicated dataframe with the inverse of .duplicated:
df[~df.duplicated(['A','B','C'])]
Returns:
>>> df[~df.duplicated(['A','B','C'])]
A B C
0 1 2 3
1 3 2 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

All possibilities with groupby and value_counts(), issue with Multindex - python

Related

How to drop conflicted rows in Dataframe?

Pandas - Create a symmetric matrix that counts the number of records

Is there an elegant way to remap group values into an incremental series in a pandas DataFrame?

Is there a faster way to count values across multiple columns, excluding duplicated values on the same row?

Df.drop/delete duplicate rows

Categories

Resources