Groupby 2 columns and find .min of multiple other columns (python pandas) - python

My data frame looks like this:
|Months | Places | Sales_X | Sales_Y | Sales_Z |
|----------------------------------------------------|
|**month1 | Place1 | 10000 | 12000 | 13000 |
|month1 | Place2 | 300 | 200 | 1000 |
|month1 | Place3 | 350 | 1000 | 1200** |
|month2 | Place2 | 1400 | 12300 | 14000 |
|month2 | Place3 | 9000 | 8500 | 150 |
|month2 | Place1 | 90 | 4000 | 3000 |
|month3 | Place2 | 12350 | 8590 | 4000 |
|month3 | Place1 | 4500 | 7020 | 8800 |
|month3 | Place3 | 351 | 6500 | 4567 |
I need to find the highest number from the three sales columns by month and show the name of the place with the highest number.
I have been trying to solve it by using pandas.DataFrame.idxmax and groupby but it does not seem to work.
I created a new df with the highest number/row which may help
|Months | Places | Highest_sales|
|-----------------------------------|
|**month1| Place1 | 10000 |
|month1 | Place2 | 200 |
|month1 | Place3 | 350** |
| | | |
|**month2| Place2 | 1400 |
|month2 | Place3 | 150 |
|month2 | Place1 | 90** |
| | | |
|**month3| Place2 | 4000 |
|month3 | Place1 | 4500 |
|month3 | Place3 | 351** |
|-----------------------------------|
Now I just need the highest number/ month and the name of the place. When using groupby on two columns and getting the min of the Lowest_sales, the result
df.groupby(['Months', 'Places'])['Highest_sales'].max()
when I run this
Months Places Highest Sales
1 Place1 1549.0
Place2 2214.0
Place3 2074.0
...
12 Place1 1500.0
Place2 8090.0
Place3 2074.0
the format I am looking for would be
|**Months|Places |Highest Sales**|
|--------|--------------------------|---------------|
|Month1 |Place(*of highest sales*) |100000 |
|Month2 |Place(*of highest sales*) |900000 |
|Month3 |Place(*of highest sales*) |3232000 |
|Month4 |Place(*of highest sales*) |1300833 |
|.... | | |
|Month12 |Place(*of highest sales*) | |
-----------------------------------------------------
12 rows and 3 columns

Use DataFrame.filter for Sales columns, create Highest column adn then aggregate DataFrameGroupBy.idxmax only for Months and select rows and columns by list in DataFrame.loc:
#columns with substring Sales
df1 = df.filter(like='Sales')
#or all columns from third position
#df1 = df.iloc[: 2:]
df['Highest'] = df1.min(axis=1)
df = df.loc[df.groupby('Months')['Highest'].idxmax(), ['Months','Places','Highest']]
print (df)
Months Places Highest
0 month1 Place1 10000
3 month2 Place2 1400
7 month3 Place1 4500

Related

pd.MultiIndex: How do I add 1 more level (0) to a multi-index column?

This sounds trivial, but I just can't add 1 more level of index to the columns of a multi-level column df.
Current State
Category | Cat1 | Cat2 |
|Total Assets| AUMs |
Firm 1 | 100 | 300 |
Firm 2 | 200 | 3400 |
Firm 3 | 300 | 800 |
Firm 4 | NaN | 800 |
Desired State
Importance | H | H |
Category | Cat1 | Cat2 |
|Total Assets| AUMs |
Firm 1 | 100 | 300 |
Firm 2 | 200 | 3400 |
Firm 3 | 300 | 800 |
Firm 4 | NaN | 800 |
When I use the below code
Code 1: Error: isnull is not defined for MultiIndex
df.columns=pd.MultiIndex.from_arrays([['H','H'],df.columns])
Code 2: Error 1st level Name become a combination
df.columns=pd.MultiIndex.from_arrays([['H','H'],df.columns.value])
Importance | H | H |
Category | (Cat1, Total Assets) | (Cat2, AUMs) |
Firm 1 | 100 | 300 |
Firm 2 | 200 | 3400 |
Firm 3 | 300 | 800 |
Firm 4 | NaN | 800 |
Use concat():
df=pd.concat([df],keys=['H'],names=['Importance'],axis=1)

Replicate rows in Pandas and add a new (month) column

I am struggling with what I am sure is a simple problem. I have a dataframe that has around 1000 rows that are unique.
This shows expenses for the year by category by location. Each has location the same group of categories.
I want to create a monthly budget column for each expense per site.
I also want to create a year to date budget column that takes the total budget for the year and divides it by 12 to give a monthly figure. This is then multiplied by the month (April = month 1) to give a year to date value - e.g May will be monthly figure * 2 etc.
I am trying to use pandas to do this. I have tried
pd.DataFrame(np.repeat(budget.values,12,axis=0)) #replicate each row by 12
My plan was then to iterate through each row in each group to add the month but I am struggling to achieve anything.
Any help would be appreciated.
(apologies I can not get the tables to paste properly - please see pictures)
Current
+------------+-------------+--------+
| Location | Expense | Amount |
+------------+-------------+--------+
| Sheffield | Electricity | 10000 |
| Sheffield | Gas | 12000 |
| Manchester | Electricity | 15000 |
| Manchester | Electricity | 13000 |
+------------+-------------+--------+
Desired
+------------+-------------+--------+--------+---------+-------+
| Location | Expense | Amount | Budget | Month | YTD |
+------------+-------------+--------+--------+---------+-------+
| Sheffield | Electricity | 10000 | 10000 | April | 1000 |
| Sheffield | Electricity | 10000 | 10000 | May | 2000 |
| Sheffield | Electricity | 10000 | 10000 | June | 3000 |
| Sheffield | Electricity | 10000 | 10000 | July | 4000 |
| Sheffield | Electricity | 10000 | 10000 | August | 5000 |
| Sheffield | Electricity | 10000 | 10000 | Sep | 6000 |
| Sheffield | Electricity | 10000 | 10000 | Oct | 7000 |
| Sheffield | Electricity | 10000 | 10000 | Dec | 8000 |
| Sheffield | Electricity | 10000 | 10000 | Jan | 9000 |
| Sheffield | Electricity | 10000 | 10000 | Feb | 10000 |
| Sheffield | Electricity | 10000 | 10000 | March | 11000 |
| Sheffield | Gas | 12000 | 20000 | April | 2000 |
| Sheffield | Gas | 12000 | 20000 | May | 4000 |
| Sheffield | Gas | 12000 | 20000 | June... | 6000 |
| Sheffield | Gas | 12000 | 20000 | ..March | 8000 |
| Manchester | Electricity | 15000 | 36000 | April | 4000 |
| Manchester | Electricity | 15000 | 36000 | May | 8000 |
+------------+-------------+--------+--------+---------+-------+
You can create your specific month-table with April as the number 1 for the starting month in the fiscal year.
import pandas as pd
# intialise data from list.
data = {'Month':['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December', 'January', 'February', 'March'], \
'Number':range(1,13), \
'key': [1] * 12 }
# Create DataFrame
df_months = pd.DataFrame(data)
Resulting in a table like this:
+------------+--------+-----+
| Month | Number | key |
+------------+--------+-----+
| April | 1 | 1 |
| May | 2 | 1 |
| June | 3 | 1 |
| July | 4 | 1 |
| August | 5 | 1 |
| September | 6 | 1 |
| October | 7 | 1 |
| November | 8 | 1 |
| December | 9 | 1 |
| January | 10 | 1 |
| February | 11 | 1 |
| March | 12 | 1 |
+------------+--------+-----+
Now adapt your second table, let's call it df_amounts to have a ficitonal key-column (key) which will ensure every month gets joined to every location/expense combination:
df_amounts['key'] = 1
df_amounts:
+------------+-------------+--------+-----+
| Location | Expense | Amount | key |
+------------+-------------+--------+-----+
| Sheffield | Electricity | 10000 | 1 |
| Sheffield | Gas | 12000 | 1 |
| Manchester | Electricity | 15000 | 1 |
| Manchester | Electricity | 13000 | 1 |
+------------+-------------+--------+-----+
Then join the tables on key:
df = pd.merge(df_amounts, df_months, on="key", how="left")
To get the following table:
+------------+-------------+--------+-----+-------+--------+
| Location | Expense | Amount | key | Month | Number |
+------------+-------------+--------+-----+-------+--------+
Now divide the column number by 12 and multiply the value with Amount to get your new column YTD.
df['YTD']= df['Amount'] * (df['Number'] / 12)
Your monthly Budget column works similarly:
df['Budget']= df['Amount'] / 12

transform pandas dataframe and combine rows

I have a pandas dataframe which looks like that:
|---------------------|------------------|------------------|
| student-id | subject-id | grade |
|---------------------|------------------|------------------|
| 1 | 1234 | 4 |
|---------------------|------------------|------------------|
| 1 | 2234 | 3 |
|---------------------|------------------|------------------|
| 1 | 3234 | 3 |
|---------------------|------------------|------------------|
| 2 | 1234 | 2 |
|---------------------|------------------|------------------|
| 2 | 2234 | 1 |
|---------------------|------------------|------------------|
| 2 | 3234 | 4 |
|---------------------|------------------|------------------|
now I want to transform it, that I get only one row for every student-id with every grade from this student in this row like that:
|---------------------|------------------|------------------|------------------|
| student-id | grade 1 | grade 2 | grade 3 |
|---------------------|------------------|------------------|------------------|
| 1 | 4 | 3 | 3 |
|---------------------|------------------|------------------|------------------|
| 2 | 2 | 1 | 4 |
|---------------------|------------------|------------------|------------------|
thx for help!
you may drop subject-id by del df['column_name'] and then df.groupBy['student-id'] will give grades with respect to student-id.

Merging two dataframes which has duplicated 'on' value on one side

I have two dataframes, and the standard dataframe has some same values(=id) which i have to use as merging point.
+----+------------+------------+------------+
| id | res_number | type | payment |
+----+------------+------------+------------+
| a | 1 | toys | 20000 |
| a | 2 | clothing | 30000 |
| a | 3 | food | 40000 |
| b | 4 | food | 40000 |
| c | 5 | laptop | 30000 |
+----+------------+------------+------------+
\I want to merge this dataframe with below dataframe.
+----+------------+------------+
| id | group | unique_num |
+----+------------+------------+
| a | 1 | 1231 |
| b | 2 | 1234 |
| c | 1 | 1241 |
+----+------------+------------+
and i want to make dataframe like this.
+----+------------+------------+------------+------------+------------+
| id | res_number | type | payment | group | unique_num |
+----+------------+------------+------------+------------+------------+
| a | 1 | toys | 20000 | 1 | 1231 |
| a | 2 | clothing | 30000 | 1 | 1231 |
| a | 3 | food | 40000 | 1 | 1231 |
| b | 4 | food | 40000 | 2 | 1234 |
| c | 5 | laptop | 30000 | 3 | 1241 |
+----+------------+------------+------------+------------+------------+
As you can notice i want to merge dataframes with 'id', but the standard dataframe has some same values on 'id'. My target is just pasting values whatever values on 'id' has.
Can you give me good example of this problem?
I think you need merge with left join:
df = pd.merge(df1, df2, how='left')
Or if possible more common columns names in both DataFrames:
df = pd.merge(df1, df2, how='left', on='id')
print (df)
id payment res_number type group unique_num
0 a 20000 1 toys 1 1231
1 a 30000 2 clothing 1 1231
2 a 40000 3 food 1 1231
3 b 40000 4 food 2 1234
4 c 30000 5 laptop 1 1241

Discount for repeated for rows

User creates one plan to purchase some items on N (five) diff dates.
+--------------+---------+------------+
| plan_date_id | plan_id | ad_date |
+--------------+---------+------------+
| 1 | 1 | 2015-09-13 |
| 2 | 1 | 2015-09-15 |
| 3 | 1 | 2015-09-17 |
| 4 | 1 | 2015-09-21 |
| 5 | 1 | 2015-09-24 |
+--------------+---------+------------+
Week Span: For each product week span will be calculated based on the date on which the product was sold for the first time + 6 to date part.
i.e., for product_ID 10, first purchase was made on 2015-09-13, so
the week span will be from 2015-09-13 to 2015-09-19(2015-09-13 + 6).
Discount logic for a product: (Total No. of Repetition in a Week Span - 1) * 10%.
But maximum discount can be 30%.
+---------------+--------------+-------------+
| plan_product_id | plan_date_id | product_id |
+-----------------+--------------+------------+
| 1 | 1 | 10 |
| 2 | 2 | 5715 |
| 3 | 2 | 10 |
| 4 | 3 | 10 |
| 5 | 3 | 128900 |
| 6 | 4 | 10 |
| 7 | 5 | 10 |
+-----------------+--------------+------------+
So in my example I want discount as follow.
+---------------+--------------+-------------+------------+
| plan_product_id | plan_date_id | product_id | discount |
+-----------------+--------------+------------+------------+
| 1 | 1 | 10 | 0% |
| 2 | 2 | 5715 | 0% |
| 3 | 2 | 10 | 10% |
| 4 | 3 | 10 | 20% |
| 5 | 3 | 128900 | 0% |
| 6 | 4 | 10 | 0% |
| 7 | 5 | 10 | 0% |
+-----------------+--------------+------------+------------+
Please Note there will be 0% discount in plan_product_id 6 and 7
Currently, I am doing discount calculation in python.
First get all required records. Then create a dict with product_id as key,
in value there is another dict holding base date and repeated times in a week. Then loop to all record.
What will be the best way to do it?
Is it possible to do it only from MySQL or Django Orm?
Will looping in MySQL be more performance efficient?

Categories