Aggregate a dataframe column based on a hierarichal condition from another column

Aggregate a dataframe column based on a hierarichal condition from another column - python

I have an interesting problem and thought I will share it here for everyone. Let's assume we have a pandas DataFrame like this (dummy data):
Category
Samples
A,B,123
6
A,B,456
3
A,B,789
1
X,Y,123
18
X,Y,456
7
X,Y,789
2
P,Q,123
1
P,Q,456
2
P,Q,789
2
L,M,123
1
L,M,456
3
S,T,123
5
S,T,456
5
S,T,789
3
The value in category are basically hierarchal in nature. Think of A as country, B as state, and 123 as zip-code. What I want is to greedily match for each category which has less than 5 samples and merge it with the nearest one. The final example DataFrame should be like:
Category
Samples
A,B,123
10
X,Y,123
18
X,Y,456
9
P,Q,456
5
L,M,456
4
S,T,123
8
S,T,456
5
These are the possible rules I see that will be needed:
Case A,B : Sub-categories 456, 789 have less than 5 categories so we merge it but then the merged one will also have 4 which is less than 5 so it gets further merged with 123 and thus finally we get A,B,123 with 10.
Case X,Y : Subcategory 789 is the only one with less than 5 so it will merge with the category 456 (the one closest to 5 samples) to become X,Y,456 as 9 where X,Y,123 always had more than 5 so it remains as is.
Case P,Q : Here all the sub-categories have less then 5 but the idea is to merge it one at a time and it has nothing to do with the sequence. 123 here is having one sample, so it will merge with 789 to form a sample size of 3 which is still less than 5 so it will merge with 456 to form P,Q,456 with sample size of 5 but it can also be P,Q,789 as 5. Either is fine.
Case L,M : Only two sub-categories and both even merged will remain less than 5 but that's the best we can have so it should be L,M,456 as 4.
Case S,T : Only 789 has less than 5 so it can go with either 123 or 456 (as both have same samples), but not both. So the answer should be either S,T,123 as 5 and S,T,456 as 8 or S,T,123 as 8 and S,T,456 as 5.
What happens if there is a third column with values and based on the logic we want them to be merged too - add up if its an integer, and concatenate if that's a string - based on whatever condition we use on these columns?
I have been trying to break the column category then work with samples to add up but so far no luck. Any help is greatly appreciated.

Very tricky question especially with the structure of your data(because your grouper which is really the parts "A,B", "X,Y", etc. are not in a separate column. But I think you can do:
df.sort_values(by='Samples', inplace=True, ignore_index=True)
#grouper containing groupby keys ['A,B', 'X,Y', etc.)
g = df['Category'].str.extract("(.*),+")[0]
#create a column to keep the category together
df['sample_category'] = list(zip(df['Samples'], df['Category']))
Then use functools.reduce to reduce the list by iteratively grabbing the next tuple if sample is less than 5:
df2 = df.groupby(g, as_index=False).agg(
{'sample_category': lambda s:
functools.reduce(lambda x, y: (x[0] + y[0], y[1]) if x[0] < 5 else (x, y), s)})
Then do some munging to modify the elements to a list type:
df2['sample_category'] = df2['sample_category'].apply(
lambda x: [x] if isinstance(x[0], int) else list(x))
Then explode, extract columns and finally drop the intermediate column 'sample_category'
df2 = df2.explode('sample_category', ignore_index=True)
df2['Sample'] = df2['sample_category'].str[0]
df2['Category'] = df2['sample_category'].str[1]
df2.drop('sample_category', axis=1, inplace=True)
print(df2):
Sample Category
0 10 A,B,123
1 4 L,M,456
2 5 P,Q,789
3 8 S,T,123
4 5 S,T,456
5 9 X,Y,456
6 18 X,Y,123

Related

How to create a dataframe from a python groupby

My first dataframe contains a column consisting of a unique ID of a card (card_id):
treino.head(5)
card_id feature_1 feature_2 feature_3
0 C_ID_92a2005557 5 2 1
1 C_ID_3d0044924f 4 1 0
2 C_ID_d639edf6cd 2 2 0
3 C_ID_186d6a6901 4 3 0
4 C_ID_cdbd2c0db2 1 3 0
my second dataframe is the history of where these cards were passed:
df2.head(5)
authorized_flag card_id city_id category_1 merchant_id
0 Y C_ID_92a2005557 88 N M_ID_e020e9b302
1 Y C_ID_d639edf6cd 88 N M_ID_86ec983688
2 Y C_ID_92a2005557 88 N M_ID_979ed661fc
3 Y C_ID_92a2005557 88 N M_ID_e6d5ae8ea6
4 Y C_ID_92a2005557 88 N M_ID_e020e9b302
5 Y C_ID_4e6213e9bc 333 N M_ID_50af771f8d
6 Y C_ID_92a2005557 88 N M_ID_5e8220e564
7 Y C_ID_4e6213e9bc 3 N M_ID_9d41786a50
8 Y C_ID_d639edf6cd 88 N M_ID_979ed661fc
when using:
merged_left = pd.merge (left = df1, right = df2, how = left, left_on = 'card_id', right_on = 'card_id')
it multiplies the lines of card_id because in the second dataframe a card_id appears several times. I already put it to do the join on the left to just leave the card_id uniquely of the first dataframe but my problem continues.
I already understood that it multiplies the lines because df2 is a shopping history and the card_id appear several times but I can not solve it.
already tried something like:
df2.groupby (['card_id', 'merchant_id']). size (). reset_index ()
but I still have several rows of the same card_id, could they help me to create a dataframe with only 1 line of each unique card_id and merchant_id, will I have to create a new variable with their data summarized?

If you want just a list of card_id / merchant_id (which user has bought
something from which merchant), it is enough to draw data from df2:
df2[['card_id', 'merchant_id']].drop_duplicates()
As you can see, no groupby is needed, just read the columns in question and
drop duplicates.
A little more complex case is when you want e.g. how many times particular
card_id has bought something from particular merchant_id.
Then groupby is needed and the value wanted you will get using size() function:
df2.groupby(['card_id', 'merchant_id']).size()
possibly completed with .reset_index() as you did.
Of course, particular card_id occurs in several output row, but each time
with different merchant_id (and relevant number of transactions
between these 2 subjects).
So make up your mind what information you want besides card_id and merchant_id.
This is necessary to decide what code is needed to generate the answer.

search for a set of values on rows (& not |) from a column

I'm new to python and I'm trying to find the entries from the first column that have in the second column all the entries I'm searching for. ex: I want entries {155, 137} and I expect to get 5 and 6 from column id1 in return.
id1 id2
----------
1. 10
2. 10
3. 10
4. 9
5. 137
5. 150
5. 155
6. 10
6. 137
6. 155
....
I've searched a lot on google, but couldn't solve it. I read these entries from an excel, I tried with multiple for loops, but it doesn't look nice because I'm searching for a lot of entries
I tried this:
df = pd.read_excel('path/temp.xlsx') #now I have two Columns and many rows
d1 = df.values.T[0].tolist()
d2 = df.values.T[1].tolist()
d1[d2.index(115) & d2.index(187)& d2.index(276) & d2.index(239) & d2.index(200) & d2.index(24) & d2.index(83)]
and it returned 1
I started to work this week, so I'm very new

Assuming you have two lists for both the IDs (i.e. one list for id1, and one for id2), and the lists correspond to each other (that means, the value at index i in list1 corresponds to the value at index i of list2).
If that is your case, then you simply have to find out the index of the element, you want to search for, and the corresponding index in the other list will be the answer to your query.
To get the index of the element, you can use Python's inbuilt feature to get an index, namely:
list.index(<element>)
It will return the zero-based index of the element you wanted in the list.
To get the corresponding ID from id1, you can simply use this index (because of one-one correspondence). In your case, it can be written as:
id1[id2.index(137)] #it will return 5
NOTE:
index() method will return the index of the first matching entry from the list.

best to use pandas
import pandas as pd
import numpy as np
Random data
n = 10
I = [i for i in range(1,7)]
df1 = pd.DataFrame({'Id1': [Id[np.random.randint(len(I))] for i in range(n)],
'Id2': np.random.randint(0,1000,n)})
df1.head(5)
Id1 Id2
0 4 170
1 6 170
2 6 479
3 4 413
4 6 52
Query using
df1.loc[~df1.Id2.isin([170,479])].Id1
Out[344]:
3 4
4 6
5 6
6 3
7 1
8 5
9 6
Name: Id1, dtype: int64

for now, I've solved it by doing this

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!

df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

Subtract 2 values from one another within 1 column after groupby

I am very sorry if this is a very basic question but unfortunately, I'm failing miserably at figuring out the solution.
I need to subtract the first value within a column (in this case column 8 in my df) from the last value & divide this by a number (e.g. 60) after having applied groupby to my pandas df to get one value per id. The final output would ideally look something like this:
id
1 1523
2 1644
I have the actual equation which works on its own when applied to the entire column of the df:
(df.iloc[-1,8] - df.iloc[0,8])/60
However I fail to combine this part with the groupby function. Among others, I tried apply, which doesn't work.
df.groupby(['id']).apply((df.iloc[-1,8] - df.iloc[0,8])/60)
I also tried creating a function with the equation part and then do apply(func)but so far none of my attempts have worked. Any help is much appreciated, thank you!

Demo:
In [204]: df
Out[204]:
id val
0 1 12
1 1 13
2 1 19
3 2 20
4 2 30
5 2 40
In [205]: df.groupby(['id'])['val'].agg(lambda x: (x.iloc[-1] - x.iloc[0])/60)
Out[205]:
id
1 0.116667
2 0.333333
Name: val, dtype: float64

Iteration order with pandas groupby on a pre-sorted DataFrame

The Situation
I'm classifying the rows in a DataFrame using a certain classifier based on the values in a particular column. My goal is to append the results to one new column or another depending on certain conditions. The code, as it stands looks something like this:
df = pd.DataFrame({'A': [list with classifier ids], # Only 3 ids, One word strings
'B': [List of text to be classified], # Millions of unique rows, lines of text around 5-25 words long
'C': [List of the old classes]} # Hundreds of possible classes, four digit integers stored as strings
df.sort_values('A', inplace=True)
new_col1, new_col2 = [], []
for name, group in df.groupby('A', sort=False):
classifier = classy_dict[name]
vectors = vectorize(group.B.values)
preds = classifier.predict(vectors)
scores = classifier.decision_function(vectors)
for tup in zip(preds, scores, group.C.values):
if tup[2] == tup[0]:
new_col1.append(np.nan)
new_col2.append(tup[2])
else:
new_col1.append(str(classifier.classes_[tup[1].argsort()[-5:]]))
new_col2.append(np.nan)
df['D'] = new_col1
df['E'] = new_col2
The Issue
I am concerned that groupby will not iterate in a top-down, order-of-appearance manner as I expect. Iteration order when sort=False is not covered in the docs
My Expectations
All I'm looking for here is some affirmation that groupby('col', sort=False) does iterate in the top-down order-of-appearance way that I expect. If there is a better way to make all of this work, suggestions are appreciated.
Here is the code I used to test my theory on sort=False iteration order:
from numpy.random import randint
import pandas as pd
from string import ascii_lowercase as lowers
df = pd.DataFrame({'A': [lowers[randint(3)] for _ in range(100)],
'B': randint(10, size=100)})
print(df.A.unique()) # unique values in order of appearance per the docs
for name, group in df.groupby('A', sort=False):
print(name)
Edit: The above code makes it appear as though it acts in the manner that I expect, but I would like some more undeniable proof, if it is available.

Yes, when you pass sort=False the order of first appearance is preserved. The groupby source code is a little opaque, but there is one function groupby.ngroup which fully answers this question, as it directly tells you the order in which iteration occurs.
def ngroup(self, ascending=True):
"""
Number each group from 0 to the number of groups - 1.
This is the enumerative complement of cumcount. Note that the
numbers given to the groups match the order in which the groups
would be seen when iterating over the groupby object, not the
order they are first observed.
""
Data from #coldspeed
df['sort=False'] = df.groupby('col', sort=False).ngroup()
df['sort=True'] = df.groupby('col', sort=True).ngroup()
Output:
col sort=False sort=True
0 16 0 7
1 1 1 0
2 10 2 5
3 20 3 8
4 3 4 2
5 13 5 6
6 2 6 1
7 5 7 3
8 7 8 4
When sort=False you iterate based on the first appearance, when sort=True it sorts the groups, and then iterates.

Let's do a little empirical test. You can iterate over groupby and see the order in which groups are iterated over.
df
col
0 16
1 1
2 10
3 20
4 3
5 13
6 2
7 5
8 7
for c, g in df.groupby('col', sort=False):
print(c)
16
1
10
20
3
13
2
5
7
It appears that the order is preserved.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.