Conditional subtraction of Pandas Columns - python

I have a dataframe with stock returns and I would like to create a new column that contains the difference between that stock return and the return of the sector ETF it belongs to:
dict0 = {'date': ['1/1/2020', '1/1/2020', '1/1/2020', '1/1/2020', '1/1/2020', '1/2/2020', '1/2/2020', '1/2/2020', '1/2/2020',
'1/2/2020', '1/3/2020', '1/3/2020', '1/3/2020', '1/3/2020', '1/3/2020'],
'ticker': ['SPY', 'AAPL', 'AMZN', 'XLK', 'XLY', 'SPY', 'AAPL', 'AMZN', 'XLK', 'XLY', 'SPY', 'AAPL', 'AMZN', 'XLK', 'XLY'],
'returns': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5],
'sector': [np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN, np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN, np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN,]}
df = pd.DataFrame(dict0)
df = df.set_index(['date', 'ticker'])
That is for instance, for AAPL on 1/1/2020 the return is 2. Since it belongs to the Tech Sector, the relevant return would be the ETF XLK (I have a dictionary that maps sectors to ETF tickers). The in the new column the return would be AAPL's return of 2 minus the XLK return on that day of 4.
I have asked a similar question in the post below, where I wanted to simply compute the difference of reach stock return to 1 ticker, namely SPY.
Computing excess returns
The solution presented there was this:
def func(row):
date, asset = row.name
return df.loc[(date, asset), 'returns'] - df.loc[(date, 'SPY'), 'returns']
dict0 = {'date': ['1/1/2020', '1/1/2020', '1/1/2020', '1/2/2020', '1/2/2020',
'1/2/2020', '1/3/2020', '1/3/2020', '1/3/2020'],
'ticker': ['SPY', 'AAPL', 'MSFT', 'SPY', 'AAPL', 'MSFT', 'SPY', 'AAPL', 'MSFT'],
'returns': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df = pd.DataFrame(dict0) ###
df = df.set_index(['date', 'ticker'])
df['excess_returns'] = df.apply(func, axis=1)
But I haven't been able to modify it so that I can do this sector based. I appreciate any suggestions.

You are almost there:
def func(row):
date, asset = row.name
index = sector_to_index_mapping[row.sector]
return df.loc[(date, asset), 'returns'] - df.loc[(date, index), 'returns']

Related

How to find a total year sales from a dictionary?

I have this dictionary, and when I code for it, I only have the answer for June, May, September. How would I code for the months that are not given in the dictionary? Obviously, I have zero for them.
{'account': 'Amazon', 'amount': 300, 'day': 3, 'month': 'June'}
{'account': 'Facebook', 'amount': 550, 'day': 5, 'month': 'May'}
{'account': 'Google', 'amount': -200, 'day': 21, 'month': 'June'}
{'account': 'Amazon', 'amount': -300, 'day': 12, 'month': 'June'}
{'account': 'Facebook', 'amount': 130, 'day': 7, 'month': 'September'}
{'account': 'Google', 'amount': 250, 'day': 27, 'month': 'September'}
{'account': 'Amazon', 'amount': 200, 'day': 5, 'month': 'May'}
The method I used for months mentioned in the dictionary:
year_balance=sum(d["amount"] for d in my_dict) print(f"The total year balance is {year_balance} $.")
import calendar
months = calendar.month_name[1:]
results = dict(zip(months, [0]*len(months)))
for d in data:
results[d["month"]] += d["amount"]
# then you have results dict with monthly amounts
# sum everything to get yearly total
total = sum(results.values())
This might help:
from collections import defaultdict
mydict = defaultdict(lambda: 0)
print(mydict["January"])
Also, given the comments you have written, is this what you are looking for?
your_list_of_dicts = [
{"January": 3, "March": 5},
{"January": 3, "April": 5}
]
import calendar
months = calendar.month_name[1:]
month_totals = dict()
for month in months:
month_totals[month] = 0
for d in your_list_of_dicts:
month_totals[month] += d[month] if month in d else 0
print(month_totals)
{'January': 6, 'February': 0, 'March': 5, 'April': 5, 'May': 0, 'June': 0, 'July': 0, 'August': 0, 'September': 0, 'October': 0, 'November': 0, 'December': 0}
You can read the following blog regarding the usage of dictionaries and how to perform calculations.
5 best ways to sum dictionary values in python
This is on of the examples given in the blog.
wages = {'01': 910.56, '02': 1298.68, '03': 1433.99, '04': 1050.14, '05': 877.67}
total = sum(wages.values())
print('Total Wages: ${0:,.2f}'.format(total))
Here is the result with 100,000 records.
Result with 100,000 records

How do I rearrange nested Pandas DataFrame columns?

In the DataFrame below, I want to rearrange the nested columns - i.e. to have 'region_sea' appearing before 'region_inland'
df = pd.DataFrame( {'state': ['WA', 'CA', 'NY', 'NY', 'CA', 'CA', 'WA' ]
, 'region': ['region_sea', 'region_inland', 'region_sea', 'region_inland', 'region_sea', 'region_sea', 'region_inland',]
, 'count': [1, 3, 4, 6, 7, 8, 4]
, 'income': [100, 200, 300, 400, 600, 400, 300]
}
)
df = df.pivot_table(index='state', columns='region', values=['count', 'income'], aggfunc={'count': 'sum', 'income': 'mean'})
df
I tried the code below but it's not working...any idea how to do this? Thanks
df[['count']]['region_sea', 'region_inland']
You can use sort_index to sort it. However, as it is nested columns, it will replace income and count too.
df.sort_index(axis='columns', level=0, ascending=False, inplace=True)
If you don't want replace income/count, than it will not give common header for both.
df.sort_index(axis='columns', level='region', ascending=False, inplace=True)

Python Pandas Group by and add new row with number sequence

d = {'ID': ['H1', 'H1', 'H2', 'H2', 'H3', 'H3'], 'Year': ['2012', '2013', '2014', '2013', '2014', '2015'], 'Unit': [5, 10, 15, 7, 15, 20]}
df_input= pd.DataFrame(data=d)
df_input
Group By the above df_input and wanted to get 'lag' and 'lag_u' columns. 'lag' is the number of row sequence at 'ID' and 'Year' group by level.
'lag_u' is just get the first Unit value at 'ID' and 'Year' group by level.
Expected Output:
d = {'ID': ['H1', 'H1', 'H2', 'H2', 'H3', 'H3'], 'Year': ['2012', '2013', '2014', '2013', '2014', '2015'], 'Unit': [5, 10, 15, 7, 15, 20], 'lag': [0, 1, 2, 0, 1, 2], 'lag_u': [5, 5, 5, 7, 7, 7]}
df_output= pd.DataFrame(data=d)
df_output
IIUC need GroupBy.cumcount with GroupBy.transform and GroupBy.first:
g = df_input.groupby('ID')
#if need group by both columns
#g = df_input.groupby(['ID','Year'])
df_input['lag'] = g.cumcount()
df_input['lag_u'] = g['Unit'].transform('first')

How do I use pandas to add rows to a data frame based on a date column and number of days column

I would like to know how to use a start date from a data frame column and have it add rows to the dataframe from the number of days in another column. A new date per day.
Essentially, I am trying to turn this data frame:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
...into this data frame:
df_2 = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter'],
'Date':['1/1/2019', '1/2/2019', '1/2/2019', '1/3/2019', '1/4/2019','1/10/2019', '1/15/2019', '1/16/2019'],
'Hrs':[0.6, 0.6, 1, 1, 1, 1.2, 0.3, 0.3]})
I'm new to programming in general and have tried the following:
df_2 = pd.DataFrame({
'date': pd.date_range(
start = df.Planned_Start,
end = pd.to_timedelta(df.Duration, unit='D'),
freq = 'D'
)
})
... and ...
df["date"] = df.Planned_Start + timedelta(int(df.Duration))
with no luck.
I am not entirely sure what you are trying to achieve as your df_2 looks a bit wrong from what I can see.
If you want the take the duration column as days and add this many dates to a Date column, then the below code achieves that:
You can also drop any columns you don't need with pd.Series.drop() method:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df_new = pd.DataFrame()
for i, row in df.iterrows():
for duration in range(row.Duration):
date = pd.Series([pd.datetime.strptime(row.Planned_Start, '%m/%d/%Y') + timedelta(days=duration)], index=['date'])
newrow = row.append(date)
df_new = df_new.append(newrow, ignore_index=True)

Creating an empty longitudinal country dataset

I would like to create an empty longitudinal country-week-dataset in which every country is represented 52 times (weeks of the year) and all other variables are first filled with 0s. It should then look like this:
countries = ['Albania', 'Belgium', ... 'Zimbabwe']
df_weekly = {['country': 'Albania', 'week': 1],
['country': 'Albania', 'week': 2],
...
['country': 'Albania', 'week': 52],
...
['country': 'Zimbabwe', 'week': 52]}
My question therefore: how do I get from a list of countries to such a longitudinal country-week-dataset.
Turned out to be quite simple:
country_list = ['Albania', 'Belgium', 'China', 'Denmark']
country = country_list * 52 # multiply by the number of weeks in the year
country.sort()
week = [1, 2, 3, 4, 5] * 4 # multiply by the number of countries
weekly = pd.DataFrame(
{'country': country,
'week': week
})

Categories