Group data, count unique values and append this value to row [duplicate] - python

This question already has answers here:
pandas add column to groupby dataframe
(3 answers)
Closed 4 years ago.
I am trying to find the unique number of items in each 'group' of ID's. So in the code below I am trying to find the unique number of demographics (A, B, C) for each value of id_match (101, 201, 26).
tst = pd.DataFrame({'demographic' : ['A', 'B', 'B', 'A', 'C', 'C'],
'id_match' : ['101', '101', '201', '201', '26', '26']})
tst['num_unq'] = tst.groupby('demographic')['id_match'].nunique()
Expected output
demographic id_match num_unq
1 A 101 2
2 B 101 2
3 B 201 2
4 A 201 2
5 C 26 1
6 C 26 1
However instead of the expected output i simply get a columns of NaN's. Does anyone know why this happens and also an alternative method?
Thanks J

Use transform:
tst = pd.DataFrame({'demographic' : ['A', 'B', 'B', 'A', 'C', 'C'],
'id_match' : ['101', '101', '201', '201', '26', '26']})
tst['num_unq'] = tst.groupby('demographic')['id_match'].transform('nunique')
print(tst)
Output
demographic id_match num_unq
0 A 101 2
1 B 101 2
2 B 201 2
3 A 201 2
4 C 26 1
5 C 26 1

Related

Drop duplicate IDs keeping if value = certain value , otherwise keep first duplicate

>>> df = pd.DataFrame({'id': ['1', '1', '2', '2', '3', '4', '4', '5', '5'],
... 'value': ['keep', 'y', 'x', 'keep', 'x', 'Keep', 'x', 'y', 'x']})
>>> print(df)
id value
0 1 keep
1 1 y
2 2 x
3 2 keep
4 3 x
5 4 Keep
6 4 x
7 5 y
8 5 x
In this example, the idea would be to keep index values 0, 3, 4, 5 since they are asscoiated with a duplicate id with a particular value == 'Keep' and 7 (since it is the first of the duplicates for id 5).
In your case try with idxmax
out = df.loc[df['value'].eq('keep').groupby(df.id).idxmax()]
Out[24]:
id value
0 1 keep
3 2 keep
4 3 x
5 4 Keep
7 5 y

How to label duplicated groups in a pandas dataframe

Based on this problem: find duplicated groups in dataframe and this dataframe
df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value2': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
'value3': ['1', '2', '3', '4', '1', '2', '1', '2', '3', '4', '1', '2', '3'],
})
How can i mark in this dataframe in the additional column duplicated the different duplicate groups (in the value columns) by unique label, like "1" for one duplicated group, "2" for the next and so on? I found examples here on slack to identify them as false and true, but one only with "ngroup", but did not work.
My real example has 20+ columns and also NaNs in between. I have created the wide format by pivot_table from original long format, since i thought getting duplicated entries is the better from wide. Duplicates should be found in N-1 columns, which names I summarize by using subset on a list comprehension excluding this identifier column
That is what i had so far:
df = df_long.pivot_table(index="Y",columns="Z",values="value").reset_index()
subset = [c for c in df.columns if not c=="id"]
df = df.loc[df.duplicated(subset=subset,keep=False)].copy()
We use pandas 0.22, if that does matter.
The problem is, that when I use
for i, group in df.groupby(subset):
print(group)
I basically don't get back any group.
Use groupby_ngroup as suggested by #Chris:
df['duplicated'] = df.groupby(df.filter(like='value').columns.tolist()).ngroup()
print(df)
# Output:
id value1 value2 value3 duplicated
0 A 1 1 1 0 # Group 0 (all 1)
1 A 2 2 2 1
2 A 3 3 3 2
3 A 4 4 4 3
4 B 1 1 1 0 # Group 0 (all 1)
5 B 2 2 2 1
6 C 1 1 1 0 # Group 0 (all 1)
7 C 2 2 2 1
8 C 3 3 3 2
9 C 4 4 4 3
10 D 1 1 1 0 # Group 0 (all 1)
11 D 2 2 2 1
12 D 3 3 3 2
Ok the last comment above was the correct hint: The NaNs in my real data are the problems, which also groupby does not allow for identifying groups. By using fillna() before using groupby, the groups can be identified and ngroup does add me the group numbers.
df['duplicated'] = df.fillna(-1).groupby(df.filter(like='value').columns.tolist()).ngroup()

Pandas match name to id

I have a data set where there are name and id columns. In theory the name should always correspond to the same id, but due to some system errors and data quality issues in practice this is not always the case.
Generally the scenario is that the wrong id's occur at an extremely negligible rate compare to the right id's. So for example there will be a 1000 rows where the name 'a' and id '1' match but there will be 2 rows where the name is 'a' and id '7'.
So the logic to resolve what the proper id would simply be to find the most frequently occurring id for each name.
d = {'id': ['1', '1', '2', '2',], 'name': ['a', 'a', 'a', 'b'], 'value': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
store name value
0 1 a 1
1 1 a 2
2 2 a 3
3 2 b 4
The first question is what is the best way to find the proper id for each name and drop the rows where the proper id does not occur, the result being the following:
store name value
0 1 a 1
1 1 a 2
2 2 b 4
The second part is, in the scenarios where the mismatched id is actually the id of another name, then fix the name to match the proper id, example output:
store name value
0 1 a 1
1 1 a 2
2 2 b 3
3 2 b 4
The actual data has thousands of names/ids, the example is just a simplification.
Here is my solution. It's a bit a makeshift job but it should work as a temporary solution
d = {'id': ['1', '1', '2', '2', '2', '3','3', '4', '4'],
'name': ['a', 'a', 'a', 'b', 'b', 'b','c', 'c', 'c'],
'value': ['1', '2', '3', '4', '5', '6', '7', '8', '9']}
df = pd.DataFrame(data=d)
Following the raw DataFrame, without id changes:
id name value
0 1 a 1
1 1 a 2
2 2 a 3
3 2 b 4
4 2 b 5
5 3 b 6
6 3 c 7
7 4 c 8
8 4 c 9
Workflow:
# convert id, value from string to flat
df['id'] = [float(id) for id in df['id']]
df['value'] = [float(value) for value in df['value']]
# extract most repeated id for one name
def most_common(lst):
return max(set(lst), key=lst.count)
count = dict()
for name in pd.unique(df['name']):
temp = {name: most_common(list(df[df['name'] == name]['id']))}
count.update(temp)
# correct wrong id
replace = [[count[name], name] if id != count[name] else [id, name] for id, name in zip(df['id'],df['name'])]
df['id'] = [item[0] for item in replace]
df['name'] = [item[1] for item in replace]
output:
In [3]: count
Out[3]: {'a': 1.0, 'b': 2.0, 'c': 4.0}
In [1]: df
Out[1]:
id name value
0 1.0 a 1.0
1 1.0 a 2.0
2 1.0 a 3.0
3 2.0 b 4.0
4 2.0 b 5.0
5 2.0 b 6.0
6 4.0 c 7.0
7 4.0 c 8.0
8 4.0 c 9.0
This solution might not work if you have the exact same count of two differents 'id' for the same 'name'

how to reorder of rows of a dataframe based on values in a column

I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)

Python - How to sort except one index [duplicate]

This question already has an answer here:
Pandas Python: sort dataframe but don't include given row
(1 answer)
Closed 4 years ago.
columns=['NAME', 'AB', 'H']
import pandas as pd
df = pd.DataFrame([['Harper', '10', '5'], ['Trout', '10', '5'], ['Ohtani', '10', '5'], ['TOTAL', '30', '15']], columns=columns)
df1 = df.sort_values(by='NAME')
print(df1)
the result is
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
3 TOTAL 30 15
1 Trout 10 5
I want to sort the dataframe except index of 'TOTAL'.
Try following code to sort the df by 'NAME' by excluding 'Total':
df1 = df[df.NAME!='TOTAL'].sort_values(by='NAME')
Output:
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
1 Trout 10 5
You can append back the 'Total' after sorting by:
df1 = df1.append(df[df.NAME=='TOTAL'])
Output:
NAME AB H
0 Harper 10 5
2 Ohtani 10 5
1 Trout 10 5
3 TOTAL 30 15

Categories