How to create a dataframe from a python groupby - python

My first dataframe contains a column consisting of a unique ID of a card (card_id):
treino.head(5)
card_id feature_1 feature_2 feature_3
0 C_ID_92a2005557 5 2 1
1 C_ID_3d0044924f 4 1 0
2 C_ID_d639edf6cd 2 2 0
3 C_ID_186d6a6901 4 3 0
4 C_ID_cdbd2c0db2 1 3 0
my second dataframe is the history of where these cards were passed:
df2.head(5)
authorized_flag card_id city_id category_1 merchant_id
0 Y C_ID_92a2005557 88 N M_ID_e020e9b302
1 Y C_ID_d639edf6cd 88 N M_ID_86ec983688
2 Y C_ID_92a2005557 88 N M_ID_979ed661fc
3 Y C_ID_92a2005557 88 N M_ID_e6d5ae8ea6
4 Y C_ID_92a2005557 88 N M_ID_e020e9b302
5 Y C_ID_4e6213e9bc 333 N M_ID_50af771f8d
6 Y C_ID_92a2005557 88 N M_ID_5e8220e564
7 Y C_ID_4e6213e9bc 3 N M_ID_9d41786a50
8 Y C_ID_d639edf6cd 88 N M_ID_979ed661fc
when using:
merged_left = pd.merge (left = df1, right = df2, how = left, left_on = 'card_id', right_on = 'card_id')
it multiplies the lines of card_id because in the second dataframe a card_id appears several times. I already put it to do the join on the left to just leave the card_id uniquely of the first dataframe but my problem continues.
I already understood that it multiplies the lines because df2 is a shopping history and the card_id appear several times but I can not solve it.
already tried something like:
df2.groupby (['card_id', 'merchant_id']). size (). reset_index ()
but I still have several rows of the same card_id, could they help me to create a dataframe with only 1 line of each unique card_id and merchant_id, will I have to create a new variable with their data summarized?

If you want just a list of card_id / merchant_id (which user has bought
something from which merchant), it is enough to draw data from df2:
df2[['card_id', 'merchant_id']].drop_duplicates()
As you can see, no groupby is needed, just read the columns in question and
drop duplicates.
A little more complex case is when you want e.g. how many times particular
card_id has bought something from particular merchant_id.
Then groupby is needed and the value wanted you will get using size() function:
df2.groupby(['card_id', 'merchant_id']).size()
possibly completed with .reset_index() as you did.
Of course, particular card_id occurs in several output row, but each time
with different merchant_id (and relevant number of transactions
between these 2 subjects).
So make up your mind what information you want besides card_id and merchant_id.
This is necessary to decide what code is needed to generate the answer.

Related

How to call a column by combining a string and another variable in a python dataframe?

Imagine I have a dataframe with these variables and values:
ID
Weight
LR Weight
UR Weight
Age
LS Age
US Age
Height
LS Height
US Height
1
63
50
80
20
18
21
165
160
175
2
75
50
80
22
18
21
172
160
170
3
49
45
80
17
18
21
180
160
180
I want to create the additional following variables:
ID
Flag_Weight
Flag_Age
Flag_Height
1
1
1
1
2
1
0
0
3
1
0
1
These flags simbolize that the main variable values (e.g.: Weight, Age and Height) are between the correspondent Lower or Upper limits, which may start with different 2 digits (in this dataframe I gave four examples: LR, UR, LS, US, but in my real dataframe I have more), and whose limit values sometimes differ from ID to ID.
Can you help me create these flags, please?
Thank you in advance.
You can use reshaping using a temporary MultiIndex:
(df.set_index('ID')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract('(^[LU]?).*?\s*(\S+)$')),
axis=1)
)
.stack()
.assign(flag=lambda d: d[''].between(d['L'], d['U']).astype(int))
['flag'].unstack().add_prefix('Flag_').reset_index()
)
Output:
ID Flag_Age Flag_Height Flag_Weight
0 1 1 1 1
1 2 0 0 1
2 3 0 1 1
So, if I understood correctly, you want to add columns with these new variables. The simplest solution to this would be df.insert().
You could make it something like this:
df.insert(number of column after which you want to insert the new column, name of the column, values of the new column)
You can make up the new values in pretty much everyway you can imagine. So just copying a column or simple mathematical operations like +,-,*,/, can be performed. But you can also apply a whole function, which returns the flags based on your conditions as values of the new column.
If the new columsn can just be appended, you can even just make up a new column like this:
df['new column name'] = any values you want
I hope this helped.

Aggregate a dataframe column based on a hierarichal condition from another column

I have an interesting problem and thought I will share it here for everyone. Let's assume we have a pandas DataFrame like this (dummy data):
Category
Samples
A,B,123
6
A,B,456
3
A,B,789
1
X,Y,123
18
X,Y,456
7
X,Y,789
2
P,Q,123
1
P,Q,456
2
P,Q,789
2
L,M,123
1
L,M,456
3
S,T,123
5
S,T,456
5
S,T,789
3
The value in category are basically hierarchal in nature. Think of A as country, B as state, and 123 as zip-code. What I want is to greedily match for each category which has less than 5 samples and merge it with the nearest one. The final example DataFrame should be like:
Category
Samples
A,B,123
10
X,Y,123
18
X,Y,456
9
P,Q,456
5
L,M,456
4
S,T,123
8
S,T,456
5
These are the possible rules I see that will be needed:
Case A,B : Sub-categories 456, 789 have less than 5 categories so we merge it but then the merged one will also have 4 which is less than 5 so it gets further merged with 123 and thus finally we get A,B,123 with 10.
Case X,Y : Subcategory 789 is the only one with less than 5 so it will merge with the category 456 (the one closest to 5 samples) to become X,Y,456 as 9 where X,Y,123 always had more than 5 so it remains as is.
Case P,Q : Here all the sub-categories have less then 5 but the idea is to merge it one at a time and it has nothing to do with the sequence. 123 here is having one sample, so it will merge with 789 to form a sample size of 3 which is still less than 5 so it will merge with 456 to form P,Q,456 with sample size of 5 but it can also be P,Q,789 as 5. Either is fine.
Case L,M : Only two sub-categories and both even merged will remain less than 5 but that's the best we can have so it should be L,M,456 as 4.
Case S,T : Only 789 has less than 5 so it can go with either 123 or 456 (as both have same samples), but not both. So the answer should be either S,T,123 as 5 and S,T,456 as 8 or S,T,123 as 8 and S,T,456 as 5.
What happens if there is a third column with values and based on the logic we want them to be merged too - add up if its an integer, and concatenate if that's a string - based on whatever condition we use on these columns?
I have been trying to break the column category then work with samples to add up but so far no luck. Any help is greatly appreciated.
Very tricky question especially with the structure of your data(because your grouper which is really the parts "A,B", "X,Y", etc. are not in a separate column. But I think you can do:
df.sort_values(by='Samples', inplace=True, ignore_index=True)
#grouper containing groupby keys ['A,B', 'X,Y', etc.)
g = df['Category'].str.extract("(.*),+")[0]
#create a column to keep the category together
df['sample_category'] = list(zip(df['Samples'], df['Category']))
Then use functools.reduce to reduce the list by iteratively grabbing the next tuple if sample is less than 5:
df2 = df.groupby(g, as_index=False).agg(
{'sample_category': lambda s:
functools.reduce(lambda x, y: (x[0] + y[0], y[1]) if x[0] < 5 else (x, y), s)})
Then do some munging to modify the elements to a list type:
df2['sample_category'] = df2['sample_category'].apply(
lambda x: [x] if isinstance(x[0], int) else list(x))
Then explode, extract columns and finally drop the intermediate column 'sample_category'
df2 = df2.explode('sample_category', ignore_index=True)
df2['Sample'] = df2['sample_category'].str[0]
df2['Category'] = df2['sample_category'].str[1]
df2.drop('sample_category', axis=1, inplace=True)
print(df2):
Sample Category
0 10 A,B,123
1 4 L,M,456
2 5 P,Q,789
3 8 S,T,123
4 5 S,T,456
5 9 X,Y,456
6 18 X,Y,123

Conditionally dropping columns in a pandas dataframe

I have this dataframe and my goal is to remove any columns that have less than 1000 entries.
Prior to to pivoting the df I know I have 880 unique well_id's with entries ranging from 4 to 60k+. I know should end up with 102 well_id's.
I tried to accomplish this in a very naïve way by collecting the wells that I am trying to remove in an array and using a loop but I keep getting a 'TypeError: Level type mismatch' but when I just use del without a for loop it works.
#this works
del df[164301.0]
del df['TB-0071']
# this doesn't work
for id in unwanted_id:
del df[id]
Any help is appreciated, Thanks.
You can use dropna method:
df.dropna(thresh=[]) #specify [here] how many non-na values you require to keep the row
The advantage of this method is that you don't need to create a list.
Also don't forget to add the usual inplace = True if you want the changes to be made in place.
You can use pandas drop method:
df.drop(columns=['colName'], inplace=True)
You can actually pass a list of columns names:
unwanted_id = [164301.0, 'TB-0071']
df.drop(columns=unwanted_ids, inplace=True)
Sample:
df[:5]
from to freq
0 A X 20
1 B Z 9
2 A Y 2
3 A Z 5
4 A X 8
df.drop(columns=['from', 'to'])
freq
0 20
1 9
2 2
3 5
4 8
And to get those column names with more than 1000 unique values, you can use something like this:
counts = df.nunique()[df.nunique()>1000].to_frame('uCounts').reset_index().rename(columns={'index':'colName'})
counts
colName uCounts
0 to 1001
1 freq 1050

Accessing Pandas groupby() function

I have the below data frame with me after doing the following:
train_X = icon[['property', 'room', 'date', 'month', 'amount']]
train_frame = train_X.groupby(['property', 'month', 'date', 'room']).median()
print(train_frame)
amount
property month date room
1 6 6 2 3195.000
12 3 2977.000
18 2 3195.000
24 3 3581.000
36 2 3146.000
3 3321.500
42 2 3096.000
3 3580.000
54 2 3195.000
3 3580.000
60 2 3000.000
66 3 3810.000
78 2 3000.000
84 2 3461.320
3 2872.800
90 2 3461.320
3 3580.000
96 2 3534.000
3 2872.800
102 3 3581.000
108 3 3580.000
114 2 3195.000
My objective is to track the median amount based on the (property, month, date, room)
I did this:
big_list = [[property, month, date, room], ...]
test_list = [property, month, date, room]
if test_list == big_list:
#I want to get the median amount wrt to that row which matches the test_list
How do I do this?
What I did is, tried the below...
count = 0
test_list = [2, 6, 36, 2]
for j in big_list:
if test_list == j:
break
count += 1
Now, after getting the count how do I access the median amount by count in dataframe? Is their a way to access dataframe by index?
Please note:
big_list is the list of lists where each list is [property, month, date, room] from the above dataframe
test_list is an incoming list to be matched with the big_list in case it does.
Answering the last question:
Is their a way to access dataframe by index?
Of course there is - you should use df.iloc or loc
depends if you want to get purerly by integer (I guess this is the situation) - you should use 'iloc' or by for example string type index - then you can use loc.
Documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
Edit:
Coming back to the question.
I assume that 'amount' is your searched median, then.
You can use reset_index() method on grouped dataframe, like
train_frame_reset = train_frame.reset_index()
and then you can again access your column names, so you should be albe to do the following (assuming j is index of found row):
train_frame_reset.iloc[j]['amount'] <- will give you median
If I understand your problem correctly you don't need to count at all, you can access the values via loc directly.
Look at:
A=pd.DataFrame([[5,6,9],[5,7,10],[6,3,11],[6,5,12]],columns=(['lev0','lev1','val']))
Then you did:
test=A.groupby(['lev0','lev1']).median()
Accessing, say, the median to the group lev0=6 and lev1 =1 can be done via:
test.loc[6,5]

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Categories