nunique compare two Pandas dataframe with duplicates and pivot them

nunique compare two Pandas dataframe with duplicates and pivot them - python

My input:
df1 = pd.DataFrame({'frame':[ 1,1,1,2,3,0,1,2,2,2,3,4,4,5,5,5,8,9,9,10,],
'label':['GO','PL','ICV','CL','AO','AO','AO','ICV','PL','TI','PL','TI','PL','CL','CL','AO','TI','PL','ICV','ICV'],
'user': ['user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1']})
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4,0,1,2,2,2,4,4,5,6,6,7,8,9,10,11],
'label':['ICV','GO', 'CL','TI','PI','AO','GO','ICV','TI','PL','ICV','TI','PL','CL','CL','CL','AO','AO','PL','ICV'],
'user': ['user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2']})
df_c = pd.concat([df1,df2])
I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)
That my code:
m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched'
if row['label_x']==row['label_y']
else 'Mismatched', axis='columns')
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0,margins=True)
pv_mc = pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.count,fill_value=0,margins=True)
but this creates a some problem:
first, I can calqulate "simple" total (column All) of matched and missmatched as descipted in picture, or its "duplicated" as AO in pv_m or wrong number as in CL in pv_m_unq
and second, I think merge method as I use int not clever way, because I get if frame+label repetead in df(its happens often), in merged df I get number row in df1 X number of rows in df2 for this specific frame+label
I think maybe there is a smarter way to compare df and pivot them?

You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)
Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)
Result
cross Matched Mismatched All
label_x
AO 0 1 1
CL 1 0 1
GO 1 1 2
ICV 1 1 2
PL 0 2 2
All 3 5 8
Edit
If all you need is to have the All column being the sum of Matched and Mismatched, you can do it as follows:
Change your code of generating pv_m_unq without building margin:
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0)
Then, we create the column All as the sum of Matched and Mismatched for each row, as follows:
pv_m_unq['All'] = pv_m_unq['Matched'] + pv_m_unq['Mismatched']
Finally, create the row All as the sum of Matched and Mismatched for each column and append it as the last row, as follows:
row_All = pd.Series({'Matched': pv_m_unq['Matched'].sum(),
'Mismatched': pv_m_unq['Mismatched'].sum(),
'All': pv_m_unq['All'].sum()},
name='All')
pv_m_unq = pv_m_unq.append(row_All)
Result:
print(pv_m_unq)
Matched Mismatched All
label_x
AO 1 3 4
CL 1 2 3
GO 1 1 2
ICV 2 4 6
PL 1 5 6
TI 2 3 5
All 8 18 26

You can use isin() function like this:
df3 =df1[df1.label.isin(df2.label)]

Related

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.

First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")

You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

Pandas: add number of unique values to other dataset (as shown in picture):

I need to add the number of unique values in column C (right table) to the related row in the left table based on the values in common column A (as shown in the picture):
thank you in advance

Groupby column A in second dataset and calculate count of each unique value in column C. merge it with first dataset on column A. Rename column C to C-count if needed:
>>> count_df = df2.groupby('A', as_index=False).C.nunique()
>>> output = pd.merge(df1, count_df, on='A')
>>> output.rename(columns={'C':'C-count'}, inplace=True)
>>> output
A B C-count
0 2 22 3
1 3 23 2
2 5 21 1
3 1 24 1
4 6 21 1

Use DataFrameGroupBy.nunique with Series.map for new column in df1:
df1['C-count'] = df1['A'].map(df2.groupby('A')['C'].nunique())

This may not be the most effective way of doing this, so if your databases are too big be careful.
Define the following function:
def c_value(a_value, right_table):
c_ids = []
for index, row in right_table.iterrows():
if row['A'] == a_value:
if row['C'] not in c_ids:
c_ids.append(row['C'])
return len(c_ids)
For this function I'm supposing that the right_table is a pandas.Dataframe.
Now, you do the following to build the new column (assuming that the left table is a pandas.Dataframe):
new_column = []
for index, row in left_table.iterrows():
new_column.append(c_value(row['A'],right_table))
left_table["C-count"] = new_column
After this, the left_table Dataframe should be the one dessired (as far as I understand what you need).

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)

What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?

The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')

Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.

I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!

df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

nunique compare two Pandas dataframe with duplicates and pivot them - python

You can use isin() function like this: df3 =df1[df1.label.isin(df2.label)]

Related

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

Pandas: add number of unique values to other dataset (as shown in picture):

python panda new column with order of values

How can I keep all columns in a dataframe, plus add groupby, and sum?

Create a column based on multiple column distinct count pandas [duplicate]

Categories

Resources