Reorganize data using Dataframe - python

Hi! I received a very disorganized and messy excel file and I will need to re-organize them into more presentable format. However I am stuck on how to proceed. :(
Data received:
| 2020 | 2021 | 2022 | 2022 % Total | 2023E | 2024E | 2025E | ... |
| ---- | ---- | ---- | ------------ | ----- | ----- | ----- | --- |
| 0 | 3 | 6 | 9 | 12 | 15 | 18 | ... |
| 1 | 4 | 7 | 10 | 13 | 16 | 19 | ... |
| 2 | 5 | 8 | 11 | 14 | 17 | 20 | ... |
Expected output:
| Year | Value |
| ---- | ----- |
| 2020 | 0 |
| 2020 | 1 |
| 2020 | 2 |
| 2021 | 3 |
| 2021 | 4 |
| 2021 | 5 |
| 2022 | 6 |
| 2022 | 7 |
| 2022 | 8 |
The headers of the received file contains various year, starting from 2020.
How it works is I will only need the data starting from oldest year (2020) to latest valid year (2022), any data that comes after header with latest valid year is not required (e.g. starting with header containing " % Total"). The latest valid year will keep adding on every year, so next year I will expect a new "2023" column on the 4th column.
After that I will need to append data from "2020", "2021" and "2022" to a new "Value" column. A new "Year" column will also be created for the corresponding year header.
I am not sure whether it is something that can be achieved using Dataframe.
Any suggestions will be greatly appreciated!
Regards,
Shan

If you know the keys you want (i.e. 2020, 2021, 2022), you can do this:
import pandas as pd
# Create a dummy dataframe
df = pd.DataFrame({
"2020": [0, 1, 2],
"2021": [3, 4, 5],
"2022": [6, 7, 8]
})
keys = ["2020", "2021", "2022"]
df[keys[0]]
arr = [] # 2D array to hold restructured data
for key in keys:
arr.extend([[key, v] for v in df[key]])
new_df = pd.DataFrame(arr, columns=["Year", "Value"])
new_df.head()
You could generate the years list in code too, instead of hard coding that:
start = 2020
end = 2022
keys = [str(i) for i in range(start, end + 1)]

Related

Python: Increment if a name and year appears in a dataframe

I am if there is an efficient way in python to increment the occurrences of the horse's name based on the year they raced?
For example, consider the dataframe:
| | horse | year_race |
|---:|:--------------|------------:|
| 0 | Hackney | 2012 |
| 1 | Orlov Trotter | 2016 |
| 2 | Marwari | 2011 |
| 3 | Hackney | 2012 |
| 4 | Marwari | 2018 |
| 5 | Hackney | 2015 |
| 6 | Marwari | 2014 |
I would like the result to show the following:
{
"Hackney": 1,
"Orlov Trotter": 0
"Marwari": 2
}
If a horse only raced once, then let the occurrence be 0. Hackney only has an occurrence of 1 because there is a duplicate entry in 2012. Marwari has occurrences of 2 because the horse has raced 3 different years.
Is there a python way to solve this?
Thanks in advance
Use DataFrameGroupBy.nunique per horse, subtract 1 and convert to dictionary:
d = df.groupby('horse', sort=False)['year_race'].nunique().sub(1).to_dict()
print (d)
{'Hackney': 1, 'Orlov Trotter': 0, 'Marwari': 2}

Groupby to compare, identify, and make notes on max date

I am working with the following table:
+------+------+------+------+---------------+---------+-------+
| ID 1 | ID 2 | Date | Type | Marked_Latest | Updated | Notes |
+------+------+------+------+---------------+---------+-------+
| 1 | 100 | 2001 | SMT | | | |
| 1 | 101 | 2005 | SMT | | | |
| 1 | 102 | 2020 | SMT | Latest | | |
| 1 | 103 | 2020 | SMT | | | |
| 1 | 103 | 2020 | ABT | | | |
| 2 | 201 | 2009 | CMT | Latest | | |
| 2 | 202 | 2022 | SMT | | | |
| 2 | 203 | 2022 | SMT | | | |
+------+------+------+------+---------------+---------+-------+
I am trying to perform the following steps using a df.query() but since there are so many caveats I am not sure how to fit them all in.
Step 1: Only looking at Type == "SMT" or Type == "CMT", group by ID 1 and identify latest date, compare this (grouped ID 1 data) to date of Marked_Latest == "Latest (essentially, just verifying that the date is correct)
Step 2: If the date values are the same, do nothing. If different, then supply ID 2 next to original Marked_Latest == "Latest" in Updated
Step 3: If multiple Latest have the same max Date, put a note in Notes that says "multiple".
This will result in the following table:
+------+------+------+------+---------------+---------+----------+
| ID 1 | ID 2 | Date | Type | Marked_Latest | Updated | Notes |
+------+------+------+------+---------------+---------+----------+
| 1 | 100 | 2001 | SMT | | | |
| 1 | 101 | 2005 | SMT | | | |
| 1 | 102 | 2020 | SMT | Latest | | multiple |
| 1 | 103 | 2020 | SMT | | | multiple |
| 1 | 103 | 2020 | ABT | | | |
| 2 | 201 | 2009 | CMT | Latest | 203 | |
| 2 | 202 | 2022 | SMT | | | multiple |
| 2 | 203 | 2022 | SMT | | | multiple |
+------+------+------+------+---------------+---------+----------+
To summarize: check that the latest date is actually marked as latest date. If it is not marked as latest date, write the updated ID 2 next to the original (incorrect) latest date. And when there are multiple cases of latest date, inputting "multiple" for each ID of latest date.
I have gotten only as far as identifying the actual latest date, using
q = df.query('Type' == "SMT" or 'Type' == "CMT").groupby('ID 1').last('ID 2')
q
This will return a subset with the latest dates marked, but I am not sure how to proceed from here, i.e. how to now compare this dataframe with the date field corresponding to Marked_Latest.
All help appreciated.
Use:
#ID from ID 1 only if match conditions
df['ID'] = df['ID 1'].where(df['Type'].isin(['SMT','CMT']))
#get last Date, ID 2 per `ID` to columns Notes, Updates
df[['Notes', 'Updated']] = df.groupby('ID')[['Date', 'ID 2']].transform('last')
#comapre latest date in Notes with original Date
m1 = df['Notes'].ne(df['Date'])
#if no match set empty string
df['Updated'] = df['Updated'].where(m1 & df['Marked_Latest'].eq('Latest'), '')
#if latest date is duplicated set value multiple
df['Notes'] = np.where(df.duplicated(['ID 1','Date'], keep=False) & ~m1, 'multiple','')
df = df.drop('ID', axis=1)
print (df)
ID 1 ID 2 Date Type Marked_Latest Updated Notes
0 1 100 2001 SMT NaN
1 1 101 2005 SMT NaN
2 1 102 2020 SMT Latest multiple
3 1 103 2020 SMT NaN multiple
4 1 103 2020 ABT NaN
5 2 201 2009 CMT Latest 203.0
6 2 202 2022 SMT NaN multiple
7 2 203 2022 SMT NaN multiple
Try:
cols = ['ID 1', 'ID 2', 'Date', 'Type', 'Marked_Latest', 'Updated', 'Notes']
data = [[1, 100, 2001, 'SMT', '', '', ''],
[1, 101, 2005, 'SMT', '', '', ''],
[1, 102, 2020, 'SMT', 'Latest', '', ''],
[1, 103, 2020, 'SMT', '', '', ''],
[1, 103, 2020, 'ABT', '', '', '']]
df = pd.DataFrame(data, columns = cols)
temp = df[(df['Type'] == "SMT")|(df['Type'] == "CMT")]
new = temp.groupby('ID 1')['ID 2'].last().values[0]
latest = temp[temp['Marked_Latest'] == 'Latest']
nind = temp[temp['ID 2'] == new].index
if new != latest['ID 2'].values[0]:
df.loc[latest.index,'Updated']=new
df.loc[latest.index, 'Notes'] = 'multiple'
df.loc[nind, 'Notes'] = 'multiple'
Output:

Pandas Dataframe keep rows where values of 2 columns are in a list of couples

I have a list of couples :
year_month = [(2020,8), (2021,1), (2021,6)]
and a dataframe df
| ID | Year | Month |
| 1 | 2020 | 1 |
| ... |
| 1 | 2020 | 12 |
| 1 | 2021 | 1 |
| ... |
| 1 | 2021 | 12 |
| 2 | 2020 | 1 |
| ... |
| 2 | 2020 | 12 |
| 2 | 2021 | 1 |
| ... |
| 2 | 2021 | 12 |
| 3 | 2021 | 1 |
| ... |
I want to select rows where Year and Month are corresponding to one of the couples in the year_month list :
Output df :
| ID | Year | Month |
| 1 | 2020 | 8 |
| 1 | 2021 | 1 |
| 1 | 2021 | 6 |
| 2 | 2020 | 8 |
| 2 | 2021 | 1 |
| 2 | 2021 | 6 |
| 3 | 2020 | 8 |
| ... |
Any idea on how to automate it, so I have only to change year_month couples ?
I want to put many couples in year_month, so I want to keep a list of couples, and not to list all possibilities in df :
I don't want to do such :
df = df[((df['Year'] == 2020) & (df['Month'] == 8)) |
((df['Year'] == 2021) & (df['Month'] == 1)) | ((df['Year'] == 2021) & (df['Month'] == 6))]
You can use a list comprehension and filter your dataframe with your list of tuples as below:
year_month = [(2020,8), (2021,1), (2021,6)]
df[[i in year_month for i in zip(df.Year,df.Month)]]
Which gives only the paired values back:
ID Year Month
2 1 2021 1
6 2 2021 1
8 3 2021 1
One way using pandas.DataFrame.merge:
df.merge(pd.DataFrame(year_month, columns=["Year", "Month"]))
Output:
ID Year Month
0 1 2021 1
1 2 2021 1
2 3 2021 1

pull row with max date from groupby in python pandas

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1
You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you
I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

How do I get the change from the same quarter in the previous year in a pandas datatable grouped by more than 1 column

I have a datatable that looks like this (but with more than 1 country and many more years worth of data):
| Country | Year | Quarter | Amount |
-------------------------------------------
| UK | 2014 | 1 | 200 |
| UK | 2014 | 2 | 250 |
| UK | 2014 | 3 | 200 |
| UK | 2014 | 4 | 150 |
| UK | 2015 | 1 | 230 |
| UK | 2015 | 2 | 200 |
| UK | 2015 | 3 | 200 |
| UK | 2015 | 4 | 160 |
-------------------------------------------
I want to get the change for each row from the same quarter in the previous year. So for the first 4 rows in the example the change would be null (because there is no previous data for that quarter). For 2015 quarter 1, the difference would be 30 (because quarter 1 for the previous year is 200, so 230 - 200 = 30). So the data table I'm trying to get is:
| Country | Year | Quarter | Amount | Change |
---------------------------------------------------|
| UK | 2014 | 1 | 200 | NaN |
| UK | 2014 | 2 | 250 | NaN |
| UK | 2014 | 3 | 200 | NaN |
| UK | 2014 | 4 | 150 | NaN |
| UK | 2015 | 1 | 230 | 30 |
| UK | 2015 | 2 | 200 | -50 |
| UK | 2015 | 3 | 200 | 0 |
| UK | 2015 | 4 | 160 | 10 |
---------------------------------------------------|
From looking at other questions I've tried using the .diff() method but I'm not quite sure how to get it to do what I want (or if I'll actually need to do something more brute force to work this out), e.g. I've tried:
df.groupby(by=["Country", "Year", "Quarter"]).sum().diff().head(10)
This yields the difference from the previous row in the table as a whole though, rather than the difference from the same quarter for the previous year.
Since you want the change over Country and quarter and not the year, you have to remove the year from the group.
df['Change'] = df.groupby(['Country', 'Quarter']).Amount.diff()

Categories