lookup within filtered range - python

I have a dataframe with data from ecommerce panel.
It has orders and returns mixed together.
Each row has orderID - it's the same number for normal orders and for corresponding returns that come back from customers.
My data looks like this:
orderID
Shop
Revenue
Note
44
0
-32
Return
45
0
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
For each return I want to find a 'Shop' column value that corresponds to original order.
For example : 'orderID' == 44 comes twice: once as return (with 'Shop' == 0) and once as normal order (with 'Shop' == 1).
I want to replace all the 0 values with 'Shop' column with values from earlier orders
My desired output looks like this:
orderID
Shop
Revenue
Note
44
1
-32
Return
45
3
-100
Return
44
1
14
45
3
20
Something else
46
2
50
47
1
80
Something
48
2
222
I know how to do it in Google Sheets (first I filter table removing 'Shop'==0 values and then I vlookup for numbers in this filtered array)
I know how to filter this table using Pandas but I don't know how to write it.
I assume that I will need to write a temporary column first, where I store both types of values - for normal orders (just copied) and for returns.
Original dataframe is 1 000 000+ rows
My data in .csv is available here:
https://docs.google.com/spreadsheets/d/e/2PACX-1vQAJ4tMc_Bcvv-4FsUy3E7sG0m9hm-nLTVLj-LwlSEns-YJ1pbq6gSKp5mj5lZqRI2EgHOsOutwnn1I/pub?gid=0&single=true&output=csv
Thank you for any advice!

IIUC, using map:
m = df.query('Shop != 0').set_index('orderID')['Shop']
df['Shop'] = df['orderID'].map(m)
print(df)
Output:
orderID Shop Revenue Note
0 44 1 -32 Return
1 45 3 -100 Return
2 44 1 14 NaN
3 45 3 20 Something else
4 46 2 50 NaN
5 47 1 80 Something
6 48 2 222 NaN
Create a pd.Series using query to filter out zero shops then set_index and map shops to orderID​.
This works if there is a 1-1 shop to order mapping. If you have multiple shops per order, then you'll need logic to determine which shop valid.
If you have duplicate order to the same shop, then you need to drop_duplicates first.

Related

how to combine the first 2 column in pandas/python with n/a value

I have some questions about combining the first 2 columns in pandas/python with n/a value
long story: I need to read an excel and alter those changes. I can not change anything in excel, so any change has been done by python.
Here is the excel input
and the expected expect output will be
I manage to read it in, but when I try to combine the first 2 columns, I have some problems. since in excel, the first row is merged, so once it is read in. only one row has value, but the rest of the row is all N/A.
such as below:
Year number 2016
Month Jan
Month 2016-01
Grade 1 100
NaN 2 99
NaN 3 98
NaN 4 96
NaN 5 92
NaN Total 485
Is there any function that can easily help me to combine the first two columns and make it as below:
Year 2016
Month Jan
Month 2016-01
Grade 1 100
Grade 2 99
Grade 3 98
Grade 4 96
Grade 5 92
Grade Total 485
Anything will be really appreciated.
I searched and google the key word for so long but did not find any answer that fits my situation here.
d = '''
Year,number,2016
Month,,Jan
Month,,2016-01
Grade,1, 100
NaN,2, 99
NaN,3, 98
NaN,4, 96
NaN,5, 92
NaN,Total,485
'''
df = pd.read_csv(StringIO(d))
df
df['Year'] = df.Year.fillna(method='ffill')
df = df.fillna('') # skip this step if your data from excel does not have nan in col 2.
df['Year'] = df.Year + ' ' + df.number.astype('str')
df = df.drop('number',axis=1)
df

How to group by a df in Python by a column with the difference between the max value of one column and the min of another column?

I have a data frame which looks like this:
student_id
session_id
reading_level_id
st_week
end_week
1
3334
3
3
3
1
3335
2
4
4
2
3335
2
2
2
2
3336
2
2
3
2
3337
2
3
3
2
3339
2
3
4
...
There are multiple session_id's, st_weeks and end_weeks for every student_id. Im trying to group the data by 'student_id' and I want to calculate the difference between the maximum(end_week) and the minimum (st_week) for each student.
Aiming for an output that would look something like this:
Student_id
Diff
1
1
2
2
....
I am relatively new to Python as well as Stack Overflow and have been trying to find an appropriate solution - any help is appreciated.
Using the data you shared, a simpler solution is possible:
Group by student_id, and pass False argument to the as_index parameter (this works for a dataframe, and returns a dataframe);
Next, use a named aggregation to get the `max week for end week and the min week for st_week for each group
Get the difference between max_wk and end_wk
Finally, keep only the required columns
(
df.groupby("student_id", as_index=False)
.agg(max_wk=("end_week", "max"), min_wk=("st_week", "min"))
.assign(Diff=lambda x: x["max_wk"] - x["min_wk"])
.loc[:, ["student_id", "Diff"]]
)
student_id Diff
0 1 1
1 2 2
There's probably a more efficient way to do this, but I broke this into separate steps for the grouping to get max and min values for each id, and then created a new column representing the difference. I used numpy's randint() function in this example because I didn't have access to a sample dataframe.
import pandas as pd
import numpy as np
# generate dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(1200, 4)), columns=['student_id', 'session_id', 'st_week', 'end_week'])
# use groupby to get max and min for each student_id
max_vals = df.groupby(['student_id'], sort=False)['end_week'].max().to_frame()
min_vals = df.groupby(['student_id'], sort=False)['st_week'].min().to_frame()
# use join to put max and min back together in one dataframe
merged = min_vals.join(max_vals)
# use assign() to calculate difference as new column
merged = merged.assign(difference=lambda x: x.end_week - x.st_week).reset_index()
merged
student_id st_week end_week difference
0 40 2 99 97
1 23 5 74 69
2 78 9 93 84
3 11 1 97 96
4 97 24 88 64
... ... ... ... ...
95 54 0 96 96
96 18 0 99 99
97 8 18 97 79
98 75 21 97 76
99 33 14 93 79
You can create a custom function and apply it to a group-by over students:
def week_diff(g):
return g.end_week.max() - g.st_week.min()
df.groupby("student_id").apply(week_diff)
Result:
student_id
1 1
2 2
dtype: int64

Why is my groupby returning incorrect values for the 'product' column?

I am trying to get the maximum price paid by each user, as well as which product was purchased, into a dataFrame. When I run the below code, it returns exactly what I'd expect, but the 'product' column is incorrect.
Original data:
df = pd.DataFrame([[123,'xt23',20],
[123,'q45',2],
[123,'a89',25],
[77,'q45',3],
[77,'a89',30],
[92,'xt23',24],
[92,'m33',60],
[92,'a89',28]], columns=['userid','product','price'])
df
which generates this original dataFrame:
userid product price
0 123 xt23 20
1 123 q45 2
2 123 a89 25
3 77 q45 3
4 77 a89 30
5 92 xt23 24
6 92 m33 60
7 92 a89 28
This is what's not working:
df.groupby('userid').max()
EXPECTED OUTPUT:
userid product price
77 a89 30
92 m33 60
123 a89 25
ACTUAL OUTPUT:
userid product price
77 q45 30
92 xt23 60
123 xt23 25
The values in the product column are incorrect. If I add 'product' to the groupby, the max prices are still correct, but I only want to see one price & product per user. I also tried setting numeric_only=True but that did not solve the issue.
Does anyone know why the product values don't align with the original data?
Use this to get the desired output efficiently.
idx = df.groupby(['userid'])['price'].transform(max) == df['price']
print(df[idx])

How to summarize only certain columns of dataframe (python pandas)

I want to get new dataframe, in which I need to see sum of certain columns for rows which have same value of 'Index' columns (campaign_id and group_name in my example)
This is sample (example) of my dataframe:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 20 5 50 bar 12
102 red 7 3 25 bar 12
102 brown 5 0 18 bar 12
this is what I want to get:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 27 8 75 bar 12
102 brown 5 0 18 bar 12
I tried:
df = df.groupby(['campaign_id','group_name'])['clicks','conversions','cost'].sum().reset_index()
but this gives my only mentioned (summarized) columns (and Index), like this:
campaign_id group_name clicks conversions cost
101 blue 40 15 100
102 red 27 8 75
102 brown 5 0 18
I can try to add leftover columns after this operation, but I'm not sure if this will be optimal and adequate way to solve the problem
Is there simple way to summarize certain columns and leave other columns untouched (I don't care if they would differ, because in my data all leftover columns have same data for rows with same corresponding values in 'Index' columns (which are campaign_id and group_name)
When I finished my post I saw the answer right away: since all columns except those which I want to summarize - have matching values - I just need to take all those columns as part of multi-index, for this operation. Like this:
df = df.groupby(['campaign_id','group_name','lavel','city_id'])['clicks','conversions','cost'].sum().reset_index()
In this case I got exacty what I wanted.

Performing calculations on subset of data frame subset in Python

user_id char_id rating
100 33 3
100 44 2
100 33 1
100 44 4
111 55 5
111 44 4
111 55 5
I have a data frame formatted similarly to this one and am trying to perform calculations on the ratings after they have been grouped by user_id and char_id.
It doesn't work but I need to do something like data.groupby('user_id', 'char_id') and then calculate the moving average for each char_id for each user_id. Any help? I have several thousand user_id so I can't go through and select one at a time for the calculations.
I need to somehow iterate over the user_id column and group all the same user_ids together, and save that format so that user_ids are separate. Then I need to do the same thing, iterating over char_id for each user_id subset and saving that format so that I can finally perform calculations on the subsets of subsets of ratings. So far all my attempts have been unsuccessful. The closest I came was:
def divide_by_user(data):
for user in data['user_id']:
user_data = data.where(data['user_id'] == user)
return user_data
There's no need to do this manually, creating and summarizing subsets like this is exactly what DataFrame.groupby() is for. Create your groupby:
grouped = df.groupby(['user_id', 'char_id'])
Then you can apply a function to each subset. It sounds like you want either rolling_mean or expanding_mean, both of which are already available in pandas:
df['cum_average'] = grouped['rating'].apply(pd.expanding_mean)
# New column now contains the average rating for each subset,
# including all values that have been seen so far.
df
Out[43]:
user_id char_id rating cum_average
0 100 33 3 3
1 100 44 2 2
2 100 33 1 2
3 100 44 4 3
4 111 55 5 5
5 111 44 4 4
6 111 55 5 5
Using a larger randomly-generated dataset to demonstrate rolling_window():
df = pd.DataFrame({
'user_id': [random.choice([100, 111, 112]) for n in range(n_rows)],
'char_id': [random.choice([33, 44, 55]) for n in range(n_rows)],
'rating': [random.choice([1, 2, 3, 4, 5]) for n in range(n_rows)]
})
grouped = df.groupby(['user_id', 'char_id'])
df['cum_average'] = grouped['rating'].apply(pd.rolling_mean, window=7)
# Output. The rolling average will be NaN until enough values have been
# observed for that subset, you can change this using the
# min_periods argument to rolling_window
df.sort(columns=['user_id', 'char_id'])
char_id rating user_id cum_average
3 33 1 100 NaN
19 33 2 100 NaN
22 33 5 100 NaN
34 33 1 100 NaN
47 33 1 100 NaN
48 33 1 100 NaN
49 33 1 100 1.714286
51 33 4 100 2.142857
55 33 2 100 2.142857
60 33 2 100 1.714286
66 33 2 100 1.857143
...
etc.
Try this:
"df" is the dataFrame
mean=pd.rolling_mean(df.rating, 7)

Categories