Python: Increment if a name and year appears in a dataframe - python

I am if there is an efficient way in python to increment the occurrences of the horse's name based on the year they raced?
For example, consider the dataframe:
| | horse | year_race |
|---:|:--------------|------------:|
| 0 | Hackney | 2012 |
| 1 | Orlov Trotter | 2016 |
| 2 | Marwari | 2011 |
| 3 | Hackney | 2012 |
| 4 | Marwari | 2018 |
| 5 | Hackney | 2015 |
| 6 | Marwari | 2014 |
I would like the result to show the following:
{
"Hackney": 1,
"Orlov Trotter": 0
"Marwari": 2
}
If a horse only raced once, then let the occurrence be 0. Hackney only has an occurrence of 1 because there is a duplicate entry in 2012. Marwari has occurrences of 2 because the horse has raced 3 different years.
Is there a python way to solve this?
Thanks in advance

Use DataFrameGroupBy.nunique per horse, subtract 1 and convert to dictionary:
d = df.groupby('horse', sort=False)['year_race'].nunique().sub(1).to_dict()
print (d)
{'Hackney': 1, 'Orlov Trotter': 0, 'Marwari': 2}

Related

Reorganize data using Dataframe

Hi! I received a very disorganized and messy excel file and I will need to re-organize them into more presentable format. However I am stuck on how to proceed. :(
Data received:
| 2020 | 2021 | 2022 | 2022 % Total | 2023E | 2024E | 2025E | ... |
| ---- | ---- | ---- | ------------ | ----- | ----- | ----- | --- |
| 0 | 3 | 6 | 9 | 12 | 15 | 18 | ... |
| 1 | 4 | 7 | 10 | 13 | 16 | 19 | ... |
| 2 | 5 | 8 | 11 | 14 | 17 | 20 | ... |
Expected output:
| Year | Value |
| ---- | ----- |
| 2020 | 0 |
| 2020 | 1 |
| 2020 | 2 |
| 2021 | 3 |
| 2021 | 4 |
| 2021 | 5 |
| 2022 | 6 |
| 2022 | 7 |
| 2022 | 8 |
The headers of the received file contains various year, starting from 2020.
How it works is I will only need the data starting from oldest year (2020) to latest valid year (2022), any data that comes after header with latest valid year is not required (e.g. starting with header containing " % Total"). The latest valid year will keep adding on every year, so next year I will expect a new "2023" column on the 4th column.
After that I will need to append data from "2020", "2021" and "2022" to a new "Value" column. A new "Year" column will also be created for the corresponding year header.
I am not sure whether it is something that can be achieved using Dataframe.
Any suggestions will be greatly appreciated!
Regards,
Shan
If you know the keys you want (i.e. 2020, 2021, 2022), you can do this:
import pandas as pd
# Create a dummy dataframe
df = pd.DataFrame({
"2020": [0, 1, 2],
"2021": [3, 4, 5],
"2022": [6, 7, 8]
})
keys = ["2020", "2021", "2022"]
df[keys[0]]
arr = [] # 2D array to hold restructured data
for key in keys:
arr.extend([[key, v] for v in df[key]])
new_df = pd.DataFrame(arr, columns=["Year", "Value"])
new_df.head()
You could generate the years list in code too, instead of hard coding that:
start = 2020
end = 2022
keys = [str(i) for i in range(start, end + 1)]

Pandas Dataframe keep rows where values of 2 columns are in a list of couples

I have a list of couples :
year_month = [(2020,8), (2021,1), (2021,6)]
and a dataframe df
| ID | Year | Month |
| 1 | 2020 | 1 |
| ... |
| 1 | 2020 | 12 |
| 1 | 2021 | 1 |
| ... |
| 1 | 2021 | 12 |
| 2 | 2020 | 1 |
| ... |
| 2 | 2020 | 12 |
| 2 | 2021 | 1 |
| ... |
| 2 | 2021 | 12 |
| 3 | 2021 | 1 |
| ... |
I want to select rows where Year and Month are corresponding to one of the couples in the year_month list :
Output df :
| ID | Year | Month |
| 1 | 2020 | 8 |
| 1 | 2021 | 1 |
| 1 | 2021 | 6 |
| 2 | 2020 | 8 |
| 2 | 2021 | 1 |
| 2 | 2021 | 6 |
| 3 | 2020 | 8 |
| ... |
Any idea on how to automate it, so I have only to change year_month couples ?
I want to put many couples in year_month, so I want to keep a list of couples, and not to list all possibilities in df :
I don't want to do such :
df = df[((df['Year'] == 2020) & (df['Month'] == 8)) |
((df['Year'] == 2021) & (df['Month'] == 1)) | ((df['Year'] == 2021) & (df['Month'] == 6))]
You can use a list comprehension and filter your dataframe with your list of tuples as below:
year_month = [(2020,8), (2021,1), (2021,6)]
df[[i in year_month for i in zip(df.Year,df.Month)]]
Which gives only the paired values back:
ID Year Month
2 1 2021 1
6 2 2021 1
8 3 2021 1
One way using pandas.DataFrame.merge:
df.merge(pd.DataFrame(year_month, columns=["Year", "Month"]))
Output:
ID Year Month
0 1 2021 1
1 2 2021 1
2 3 2021 1

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

sqlalchemy how to divide 2 columns from different table

I have 2 tables named as company_info and company_income:
company_info :
| id | company_name | staff_num | year |
|----|--------------|-----------|------|
| 0 | A | 10 | 2010 |
| 1 | A | 10 | 2011 |
| 2 | A | 20 | 2012 |
| 3 | B | 20 | 2010 |
| 4 | B | 5 | 2011 |
company_income :
| id | company_name | income | year |
|----|--------------|--------|------|
| 0 | A | 10 | 2010 |
| 1 | A | 20 | 2011 |
| 2 | A | 30 | 2012 |
| 3 | B | 20 | 2010 |
| 4 | B | 15 | 2011 |
Now I want to calculate average staff income of each company, the result looks like this:
result :
| id | company_name | avg_income | year |
|----|--------------|------------|------|
| 0 | A | 1 | 2010 |
| 1 | A | 2 | 2011 |
| 2 | A | 1.5 | 2012 |
| 3 | B | 1 | 2010 |
| 4 | B | 3 | 2011 |
how to get this result using python SQLalchemy ? The database of the table is MySQL.
Join the tables and do a standard sum. You'd want to either set yourself up a view in MySQL with this query or create straight in your program.
SELECT
a.CompanyName,
a.year,
(a.staff_num / b.income) as avg_income
FROM
company_info as a
LEFT JOIN
company_income as b
ON
a.company_name = b.company_name
AND
a.year = b.year
You'd want a few wheres as well (such as where staff_num is not null or not equal to 0 and same as income. Also if you can have multiple values for the same company / year in both columns then you'll want to do a SUM of the values in the column, then group by companyname and year)
Try this:
SELECT
info.company_name,
(inc.income / info.staff_num) as avg,
info.year
FROM
company_info info JOIN company_income inc
ON
info.company_name = inc.company_name
AND
info.year = inc.year

How do I get the change from the same quarter in the previous year in a pandas datatable grouped by more than 1 column

I have a datatable that looks like this (but with more than 1 country and many more years worth of data):
| Country | Year | Quarter | Amount |
-------------------------------------------
| UK | 2014 | 1 | 200 |
| UK | 2014 | 2 | 250 |
| UK | 2014 | 3 | 200 |
| UK | 2014 | 4 | 150 |
| UK | 2015 | 1 | 230 |
| UK | 2015 | 2 | 200 |
| UK | 2015 | 3 | 200 |
| UK | 2015 | 4 | 160 |
-------------------------------------------
I want to get the change for each row from the same quarter in the previous year. So for the first 4 rows in the example the change would be null (because there is no previous data for that quarter). For 2015 quarter 1, the difference would be 30 (because quarter 1 for the previous year is 200, so 230 - 200 = 30). So the data table I'm trying to get is:
| Country | Year | Quarter | Amount | Change |
---------------------------------------------------|
| UK | 2014 | 1 | 200 | NaN |
| UK | 2014 | 2 | 250 | NaN |
| UK | 2014 | 3 | 200 | NaN |
| UK | 2014 | 4 | 150 | NaN |
| UK | 2015 | 1 | 230 | 30 |
| UK | 2015 | 2 | 200 | -50 |
| UK | 2015 | 3 | 200 | 0 |
| UK | 2015 | 4 | 160 | 10 |
---------------------------------------------------|
From looking at other questions I've tried using the .diff() method but I'm not quite sure how to get it to do what I want (or if I'll actually need to do something more brute force to work this out), e.g. I've tried:
df.groupby(by=["Country", "Year", "Quarter"]).sum().diff().head(10)
This yields the difference from the previous row in the table as a whole though, rather than the difference from the same quarter for the previous year.
Since you want the change over Country and quarter and not the year, you have to remove the year from the group.
df['Change'] = df.groupby(['Country', 'Quarter']).Amount.diff()

Categories