Replicate rows in Pandas and add a new (month) column - python

I am struggling with what I am sure is a simple problem. I have a dataframe that has around 1000 rows that are unique.
This shows expenses for the year by category by location. Each has location the same group of categories.
I want to create a monthly budget column for each expense per site.
I also want to create a year to date budget column that takes the total budget for the year and divides it by 12 to give a monthly figure. This is then multiplied by the month (April = month 1) to give a year to date value - e.g May will be monthly figure * 2 etc.
I am trying to use pandas to do this. I have tried
pd.DataFrame(np.repeat(budget.values,12,axis=0)) #replicate each row by 12
My plan was then to iterate through each row in each group to add the month but I am struggling to achieve anything.
Any help would be appreciated.
(apologies I can not get the tables to paste properly - please see pictures)
Current
+------------+-------------+--------+
| Location | Expense | Amount |
+------------+-------------+--------+
| Sheffield | Electricity | 10000 |
| Sheffield | Gas | 12000 |
| Manchester | Electricity | 15000 |
| Manchester | Electricity | 13000 |
+------------+-------------+--------+
Desired
+------------+-------------+--------+--------+---------+-------+
| Location | Expense | Amount | Budget | Month | YTD |
+------------+-------------+--------+--------+---------+-------+
| Sheffield | Electricity | 10000 | 10000 | April | 1000 |
| Sheffield | Electricity | 10000 | 10000 | May | 2000 |
| Sheffield | Electricity | 10000 | 10000 | June | 3000 |
| Sheffield | Electricity | 10000 | 10000 | July | 4000 |
| Sheffield | Electricity | 10000 | 10000 | August | 5000 |
| Sheffield | Electricity | 10000 | 10000 | Sep | 6000 |
| Sheffield | Electricity | 10000 | 10000 | Oct | 7000 |
| Sheffield | Electricity | 10000 | 10000 | Dec | 8000 |
| Sheffield | Electricity | 10000 | 10000 | Jan | 9000 |
| Sheffield | Electricity | 10000 | 10000 | Feb | 10000 |
| Sheffield | Electricity | 10000 | 10000 | March | 11000 |
| Sheffield | Gas | 12000 | 20000 | April | 2000 |
| Sheffield | Gas | 12000 | 20000 | May | 4000 |
| Sheffield | Gas | 12000 | 20000 | June... | 6000 |
| Sheffield | Gas | 12000 | 20000 | ..March | 8000 |
| Manchester | Electricity | 15000 | 36000 | April | 4000 |
| Manchester | Electricity | 15000 | 36000 | May | 8000 |
+------------+-------------+--------+--------+---------+-------+

You can create your specific month-table with April as the number 1 for the starting month in the fiscal year.
import pandas as pd
# intialise data from list.
data = {'Month':['April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December', 'January', 'February', 'March'], \
'Number':range(1,13), \
'key': [1] * 12 }
# Create DataFrame
df_months = pd.DataFrame(data)
Resulting in a table like this:
+------------+--------+-----+
| Month | Number | key |
+------------+--------+-----+
| April | 1 | 1 |
| May | 2 | 1 |
| June | 3 | 1 |
| July | 4 | 1 |
| August | 5 | 1 |
| September | 6 | 1 |
| October | 7 | 1 |
| November | 8 | 1 |
| December | 9 | 1 |
| January | 10 | 1 |
| February | 11 | 1 |
| March | 12 | 1 |
+------------+--------+-----+
Now adapt your second table, let's call it df_amounts to have a ficitonal key-column (key) which will ensure every month gets joined to every location/expense combination:
df_amounts['key'] = 1
df_amounts:
+------------+-------------+--------+-----+
| Location | Expense | Amount | key |
+------------+-------------+--------+-----+
| Sheffield | Electricity | 10000 | 1 |
| Sheffield | Gas | 12000 | 1 |
| Manchester | Electricity | 15000 | 1 |
| Manchester | Electricity | 13000 | 1 |
+------------+-------------+--------+-----+
Then join the tables on key:
df = pd.merge(df_amounts, df_months, on="key", how="left")
To get the following table:
+------------+-------------+--------+-----+-------+--------+
| Location | Expense | Amount | key | Month | Number |
+------------+-------------+--------+-----+-------+--------+
Now divide the column number by 12 and multiply the value with Amount to get your new column YTD.
df['YTD']= df['Amount'] * (df['Number'] / 12)
Your monthly Budget column works similarly:
df['Budget']= df['Amount'] / 12

Related

Groupby 2 columns and find .min of multiple other columns (python pandas)

My data frame looks like this:
|Months | Places | Sales_X | Sales_Y | Sales_Z |
|----------------------------------------------------|
|**month1 | Place1 | 10000 | 12000 | 13000 |
|month1 | Place2 | 300 | 200 | 1000 |
|month1 | Place3 | 350 | 1000 | 1200** |
|month2 | Place2 | 1400 | 12300 | 14000 |
|month2 | Place3 | 9000 | 8500 | 150 |
|month2 | Place1 | 90 | 4000 | 3000 |
|month3 | Place2 | 12350 | 8590 | 4000 |
|month3 | Place1 | 4500 | 7020 | 8800 |
|month3 | Place3 | 351 | 6500 | 4567 |
I need to find the highest number from the three sales columns by month and show the name of the place with the highest number.
I have been trying to solve it by using pandas.DataFrame.idxmax and groupby but it does not seem to work.
I created a new df with the highest number/row which may help
|Months | Places | Highest_sales|
|-----------------------------------|
|**month1| Place1 | 10000 |
|month1 | Place2 | 200 |
|month1 | Place3 | 350** |
| | | |
|**month2| Place2 | 1400 |
|month2 | Place3 | 150 |
|month2 | Place1 | 90** |
| | | |
|**month3| Place2 | 4000 |
|month3 | Place1 | 4500 |
|month3 | Place3 | 351** |
|-----------------------------------|
Now I just need the highest number/ month and the name of the place. When using groupby on two columns and getting the min of the Lowest_sales, the result
df.groupby(['Months', 'Places'])['Highest_sales'].max()
when I run this
Months Places Highest Sales
1 Place1 1549.0
Place2 2214.0
Place3 2074.0
...
12 Place1 1500.0
Place2 8090.0
Place3 2074.0
the format I am looking for would be
|**Months|Places |Highest Sales**|
|--------|--------------------------|---------------|
|Month1 |Place(*of highest sales*) |100000 |
|Month2 |Place(*of highest sales*) |900000 |
|Month3 |Place(*of highest sales*) |3232000 |
|Month4 |Place(*of highest sales*) |1300833 |
|.... | | |
|Month12 |Place(*of highest sales*) | |
-----------------------------------------------------
12 rows and 3 columns
Use DataFrame.filter for Sales columns, create Highest column adn then aggregate DataFrameGroupBy.idxmax only for Months and select rows and columns by list in DataFrame.loc:
#columns with substring Sales
df1 = df.filter(like='Sales')
#or all columns from third position
#df1 = df.iloc[: 2:]
df['Highest'] = df1.min(axis=1)
df = df.loc[df.groupby('Months')['Highest'].idxmax(), ['Months','Places','Highest']]
print (df)
Months Places Highest
0 month1 Place1 10000
3 month2 Place2 1400
7 month3 Place1 4500

How can I Group By Month from a Date field

I have a data frame similar to this one
| date | Murders | State |
|-----------|--------- |------- |
| 6/2/2017 | 100 | Ags |
| 5/23/2017 | 200 | Ags |
| 5/20/2017 | 300 | BC |
| 6/22/2017 | 400 | BC |
| 6/21/2017 | 500 | Ags |
I would like to group the above data by month and state to get an output as:
| date | Murders(SUM) | State |
|-----------|--------- |------- |
| January | 100 | Ags |
| February | 200 | Ags |
| March | 300 | Ags |
| .... | .... | Ags |
| January | 400 | BC |
| February | 500 | BC |
.... .... ..
I tried with this:
dg = DF.groupby(pd.Grouper(key='date', freq='1M')).sum() # groupby each 1 month
dg.index = dg.index.strftime('%B')
But these lines are only add the murders by month but without taking in count the State
We can do
df.groupby([pd.to_datetime(df.date).dt.strftime('%B'),df.State]).Murders.sum().reset_index()

sqlalchemy how to divide 2 columns from different table

I have 2 tables named as company_info and company_income:
company_info :
| id | company_name | staff_num | year |
|----|--------------|-----------|------|
| 0 | A | 10 | 2010 |
| 1 | A | 10 | 2011 |
| 2 | A | 20 | 2012 |
| 3 | B | 20 | 2010 |
| 4 | B | 5 | 2011 |
company_income :
| id | company_name | income | year |
|----|--------------|--------|------|
| 0 | A | 10 | 2010 |
| 1 | A | 20 | 2011 |
| 2 | A | 30 | 2012 |
| 3 | B | 20 | 2010 |
| 4 | B | 15 | 2011 |
Now I want to calculate average staff income of each company, the result looks like this:
result :
| id | company_name | avg_income | year |
|----|--------------|------------|------|
| 0 | A | 1 | 2010 |
| 1 | A | 2 | 2011 |
| 2 | A | 1.5 | 2012 |
| 3 | B | 1 | 2010 |
| 4 | B | 3 | 2011 |
how to get this result using python SQLalchemy ? The database of the table is MySQL.
Join the tables and do a standard sum. You'd want to either set yourself up a view in MySQL with this query or create straight in your program.
SELECT
a.CompanyName,
a.year,
(a.staff_num / b.income) as avg_income
FROM
company_info as a
LEFT JOIN
company_income as b
ON
a.company_name = b.company_name
AND
a.year = b.year
You'd want a few wheres as well (such as where staff_num is not null or not equal to 0 and same as income. Also if you can have multiple values for the same company / year in both columns then you'll want to do a SUM of the values in the column, then group by companyname and year)
Try this:
SELECT
info.company_name,
(inc.income / info.staff_num) as avg,
info.year
FROM
company_info info JOIN company_income inc
ON
info.company_name = inc.company_name
AND
info.year = inc.year

How do I get the change from the same quarter in the previous year in a pandas datatable grouped by more than 1 column

I have a datatable that looks like this (but with more than 1 country and many more years worth of data):
| Country | Year | Quarter | Amount |
-------------------------------------------
| UK | 2014 | 1 | 200 |
| UK | 2014 | 2 | 250 |
| UK | 2014 | 3 | 200 |
| UK | 2014 | 4 | 150 |
| UK | 2015 | 1 | 230 |
| UK | 2015 | 2 | 200 |
| UK | 2015 | 3 | 200 |
| UK | 2015 | 4 | 160 |
-------------------------------------------
I want to get the change for each row from the same quarter in the previous year. So for the first 4 rows in the example the change would be null (because there is no previous data for that quarter). For 2015 quarter 1, the difference would be 30 (because quarter 1 for the previous year is 200, so 230 - 200 = 30). So the data table I'm trying to get is:
| Country | Year | Quarter | Amount | Change |
---------------------------------------------------|
| UK | 2014 | 1 | 200 | NaN |
| UK | 2014 | 2 | 250 | NaN |
| UK | 2014 | 3 | 200 | NaN |
| UK | 2014 | 4 | 150 | NaN |
| UK | 2015 | 1 | 230 | 30 |
| UK | 2015 | 2 | 200 | -50 |
| UK | 2015 | 3 | 200 | 0 |
| UK | 2015 | 4 | 160 | 10 |
---------------------------------------------------|
From looking at other questions I've tried using the .diff() method but I'm not quite sure how to get it to do what I want (or if I'll actually need to do something more brute force to work this out), e.g. I've tried:
df.groupby(by=["Country", "Year", "Quarter"]).sum().diff().head(10)
This yields the difference from the previous row in the table as a whole though, rather than the difference from the same quarter for the previous year.
Since you want the change over Country and quarter and not the year, you have to remove the year from the group.
df['Change'] = df.groupby(['Country', 'Quarter']).Amount.diff()

Discount for repeated for rows

User creates one plan to purchase some items on N (five) diff dates.
+--------------+---------+------------+
| plan_date_id | plan_id | ad_date |
+--------------+---------+------------+
| 1 | 1 | 2015-09-13 |
| 2 | 1 | 2015-09-15 |
| 3 | 1 | 2015-09-17 |
| 4 | 1 | 2015-09-21 |
| 5 | 1 | 2015-09-24 |
+--------------+---------+------------+
Week Span: For each product week span will be calculated based on the date on which the product was sold for the first time + 6 to date part.
i.e., for product_ID 10, first purchase was made on 2015-09-13, so
the week span will be from 2015-09-13 to 2015-09-19(2015-09-13 + 6).
Discount logic for a product: (Total No. of Repetition in a Week Span - 1) * 10%.
But maximum discount can be 30%.
+---------------+--------------+-------------+
| plan_product_id | plan_date_id | product_id |
+-----------------+--------------+------------+
| 1 | 1 | 10 |
| 2 | 2 | 5715 |
| 3 | 2 | 10 |
| 4 | 3 | 10 |
| 5 | 3 | 128900 |
| 6 | 4 | 10 |
| 7 | 5 | 10 |
+-----------------+--------------+------------+
So in my example I want discount as follow.
+---------------+--------------+-------------+------------+
| plan_product_id | plan_date_id | product_id | discount |
+-----------------+--------------+------------+------------+
| 1 | 1 | 10 | 0% |
| 2 | 2 | 5715 | 0% |
| 3 | 2 | 10 | 10% |
| 4 | 3 | 10 | 20% |
| 5 | 3 | 128900 | 0% |
| 6 | 4 | 10 | 0% |
| 7 | 5 | 10 | 0% |
+-----------------+--------------+------------+------------+
Please Note there will be 0% discount in plan_product_id 6 and 7
Currently, I am doing discount calculation in python.
First get all required records. Then create a dict with product_id as key,
in value there is another dict holding base date and repeated times in a week. Then loop to all record.
What will be the best way to do it?
Is it possible to do it only from MySQL or Django Orm?
Will looping in MySQL be more performance efficient?

Categories