Related
Below is the script I am working with. For practice, I've created two sets of dataframes, one set of df1,df2,and df3, and another set of dv1,dv2, and dv3. I then created two sets of lists, test and test2, which then combined as zip_list. Now, I am trying to create a loop function that will do the following. 1. Set index and create keys = 2022 and 2021. 2. Swap level so the columns are next to each other. The loop function works but gets only applied to only the first dataframe. Without calling each dataframe one by one, how can I apply it to the whole dataframes that are found in the zipped_list?
import pandas as pd
#Creating a set of dataframes
data = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'logitech', 'samsung', 'lg', 'lenovo'],
'price': [1200, 150, 300, 450, 200]}
df1 = pd.DataFrame(data)
data2 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'mac', 'fujitsu', 'lg', 'asus'],
'price': [2200, 200, 300, 450, 200]}
df2 = pd.DataFrame(data2)
data3 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['microsoft', 'logitech', 'samsung', 'lg', 'asus'],
'price': [1500, 100, 200, 350, 400]}
df3 = pd.DataFrame(data3)
#Creating another set of dataframes
data = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'logitech', 'samsung', 'lg', 'lenovo'],
'price': [10, 20, 30, 40, 50]}
dv1 = pd.DataFrame(data)
data2 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['hp', 'mac', 'fujitsu', 'lg', 'asus'],
'price': [10, 20, 30, 50, 50]}
dv2 = pd.DataFrame(data2)
data3 = {'product_name': ['laptop', 'printer', 'tablet', 'desk', 'chair'],'item_name': ['microsoft', 'logitech', 'samsung', 'lg', 'asus'],
'price': [1, 2, 3, 4, 5]}
dv3 = pd.DataFrame(data3)
#creating a list for dataframe
test=[df1,df2,df3]
test2=[dv1,dv2,dv3]
#combining two lists
zipped = zip(test, test2)
zipped_list = list(zipped)
#Looping through the zipped_list
for x,y in zipped_list:
z=pd.concat([zipped_list[0][0].set_index(['product_name','item_name']), zipped_list[0][1].set_index(['product_name','item_name'])],
axis='columns', keys=['2022', '2021'])
z=z.swaplevel(axis='columns')[zipped_list[0][0].columns[2:]]
print(z)
In addition to this dataframe, there should be two more.
The reason is that you only access 1 element of zipped_list and do not use the repeated element (x and y). You can create a new list and append the modified dataframe to that list:
new_list = []
for x in zipped_list:
z=pd.concat([x[0].set_index(['product_name','item_name']), x[1].set_index(['product_name','item_name'])],
axis='columns', keys=['2022', '2021'])
z=z.swaplevel(axis='columns')[x[0].columns[2:]]
new_list.append(z)
new_list
Output:
[ price
2022 2021
product_name item_name
laptop hp 1200 10
printer logitech 150 20
tablet samsung 300 30
desk lg 450 40
chair lenovo 200 50,
price
2022 2021
product_name item_name
laptop hp 2200 10
printer mac 200 20
tablet fujitsu 300 30
desk lg 450 50
chair asus 200 50,
price
2022 2021
product_name item_name
laptop microsoft 1500 1
printer logitech 100 2
tablet samsung 200 3
desk lg 350 4
chair asus 400 5]
I have a large dataset (circa. 200,000 rows x 30 columns) as a CSV. I need to use pandas to pre-process this data. I have included a dummy dataset below to help visualise the problem.
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
df = pd.DataFrame(data)
df
The goal is to have individual columns that show the probability of each outcome for a batsman & bowler. By way of an example from the dummy dataset, Tom would have a 50% chance of an outcome of '1' or 'Out'
This is calculated by:
Batsman column - The total number of rows with batsman 'X';
Outcome column - The total number of outcomes with 'X';
Point 2. / Point 1. to determine the probability of each outcome;
Repeat the above to determine the Bowler probabilities
The final dataframe from the dummy data should look similar to:
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'zero_prob_bat':[0,0.4,0.4,0.4,0,0.4,0.4,0.5,0.5],
'one_prob_bat':[0.5,0.4,0.4,0.4,0.5,0.4,0.4,0,0],
'two_prob_bat':[0,0,0,0,0,0,0,0.5,0.5],
'three_prob_bat':[0,0,0,0,0,0,0,0,0],
'four_prob_bat':[0,0.2,0.2,0.2,0,0.2,0.2,0,0],
'six_prob_bat':[0,0,0,0,0,0,0,0,0],
'out_prob_bat':[0.5,0,0,0,0.5,0,0,0,0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben'],
'zero_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.5,0.5],
'one_prob_bowl':[0.4285,0.4285,0.4285,0.4285,0.4285,0.4285,0.4285,0,0],
'two_prob_bowl':[0,0,0,0,0,0,0,0.5,0.5],
'three_prob_bowl':[0,0,0,0,0,0,0,0,0],
'four_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'six_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'out_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0]
}
One issue is that with my original dataset there are over 600 unique names. I could manually .groupby each unique name in the batsman/bowler columns, but this is not a scaleable solution as new names will continually be added.
I am tempted to:-
.count the number of instances of each unique name for batsman/bowler;
.count the number of different outcomes for each unique batsman/bowler;
Perform a lookup to match the probability next to each batsman/bowler;
However, I am cautious about implementing a lookup function as detailed in the answer here due to my dataset size which will continuously grow. In the past this has also created numerous issues when I have worked with excel/CSVs so I do not want to fall into any similar traps.
If someone could explain how they would go about solving this problem, so that I have something to aim towards, then it would be much appreciated.
Not sure how much this scales with your actual dataset, but I find it hard to think of a better solution than using groupby on the "Batsman" column and then value_counts on the grouped "Outcome" column. Example:
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
grouped_data = df.groupby('Batsman')['Outcome'].value_counts(normalize=True)
print(grouped_data)
Output:
Batsman Outcome
Nick 1 0.4
0 0.2
4 0.2
Wide 0.2
Pete 0 0.5
2 0.5
Tom 1 0.5
Out 0.5
Name: Outcome, dtype: float64
Note that we did not need to groupby over each unique name manually, since groupby already does that for us.
The same logic can be applied to the "Bowler" column by simply replacing the "Batsman" string in the groupby call.
I think this answers your question...
import pandas as pd
import numpy as np
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
df = pd.DataFrame(data)
display(df)
batsman = df['Batsman'].unique()
bowler = df['Bowler'].unique()
print(sorted(batsman))
print(sorted(bowler))
final_df = pd.DataFrame()
for man in batsman:
df1 = df[df['Batsman'] == man]
count_man = len(df1)
outcome = df['Outcome'].unique()
count_outcome = len(outcome)
batsman_prob = np.array(count_man/count_outcome)
batsman_df = pd.DataFrame(data=[batman_prob], columns=[man], index=['Batsman'])
final_df = pd.concat([final_df, batsman_df, ], axis=1)
for man in bowler:
df1 = df[df['Bowler'] == man]
count_man = len(df1)
outcome = df['Outcome'].unique()
count_outcome = len(outcome)
bowler_prob = np.array(count_man/count_outcome)
bowler_df = pd.DataFrame(data=[bowler_prob], columns=[man], index=['Bowler'])
final_df = pd.concat([final_df, bowler_df, ], axis=1)
display(final_df)
Here is the output:
Tom Nick Pete Bill Ben
Batsman 0.333333 0.333333 0.333333 NaN NaN
Bowler NaN NaN NaN 1.166667 0.333333
I have a dataframe with stock returns and I would like to create a new column that contains the difference between that stock return and the return of the sector ETF it belongs to:
dict0 = {'date': ['1/1/2020', '1/1/2020', '1/1/2020', '1/1/2020', '1/1/2020', '1/2/2020', '1/2/2020', '1/2/2020', '1/2/2020',
'1/2/2020', '1/3/2020', '1/3/2020', '1/3/2020', '1/3/2020', '1/3/2020'],
'ticker': ['SPY', 'AAPL', 'AMZN', 'XLK', 'XLY', 'SPY', 'AAPL', 'AMZN', 'XLK', 'XLY', 'SPY', 'AAPL', 'AMZN', 'XLK', 'XLY'],
'returns': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5],
'sector': [np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN, np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN, np.NaN, 'Tech', 'Cons Disc', np.NaN, np.NaN,]}
df = pd.DataFrame(dict0)
df = df.set_index(['date', 'ticker'])
That is for instance, for AAPL on 1/1/2020 the return is 2. Since it belongs to the Tech Sector, the relevant return would be the ETF XLK (I have a dictionary that maps sectors to ETF tickers). The in the new column the return would be AAPL's return of 2 minus the XLK return on that day of 4.
I have asked a similar question in the post below, where I wanted to simply compute the difference of reach stock return to 1 ticker, namely SPY.
Computing excess returns
The solution presented there was this:
def func(row):
date, asset = row.name
return df.loc[(date, asset), 'returns'] - df.loc[(date, 'SPY'), 'returns']
dict0 = {'date': ['1/1/2020', '1/1/2020', '1/1/2020', '1/2/2020', '1/2/2020',
'1/2/2020', '1/3/2020', '1/3/2020', '1/3/2020'],
'ticker': ['SPY', 'AAPL', 'MSFT', 'SPY', 'AAPL', 'MSFT', 'SPY', 'AAPL', 'MSFT'],
'returns': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df = pd.DataFrame(dict0) ###
df = df.set_index(['date', 'ticker'])
df['excess_returns'] = df.apply(func, axis=1)
But I haven't been able to modify it so that I can do this sector based. I appreciate any suggestions.
You are almost there:
def func(row):
date, asset = row.name
index = sector_to_index_mapping[row.sector]
return df.loc[(date, asset), 'returns'] - df.loc[(date, index), 'returns']
I would like to know how to use a start date from a data frame column and have it add rows to the dataframe from the number of days in another column. A new date per day.
Essentially, I am trying to turn this data frame:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
...into this data frame:
df_2 = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter'],
'Date':['1/1/2019', '1/2/2019', '1/2/2019', '1/3/2019', '1/4/2019','1/10/2019', '1/15/2019', '1/16/2019'],
'Hrs':[0.6, 0.6, 1, 1, 1, 1.2, 0.3, 0.3]})
I'm new to programming in general and have tried the following:
df_2 = pd.DataFrame({
'date': pd.date_range(
start = df.Planned_Start,
end = pd.to_timedelta(df.Duration, unit='D'),
freq = 'D'
)
})
... and ...
df["date"] = df.Planned_Start + timedelta(int(df.Duration))
with no luck.
I am not entirely sure what you are trying to achieve as your df_2 looks a bit wrong from what I can see.
If you want the take the duration column as days and add this many dates to a Date column, then the below code achieves that:
You can also drop any columns you don't need with pd.Series.drop() method:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df_new = pd.DataFrame()
for i, row in df.iterrows():
for duration in range(row.Duration):
date = pd.Series([pd.datetime.strptime(row.Planned_Start, '%m/%d/%Y') + timedelta(days=duration)], index=['date'])
newrow = row.append(date)
df_new = df_new.append(newrow, ignore_index=True)
I have a list of dictionaries with the following keys: country, points, price. I need to get an average of points and price for each country. Here is the list:
0: {country: "US", points: 96, price: 235}
1: {country: "Spain", points: 96, price: 110}
2: {country: "US", points: 96, price: 90}
3: {country: "US", points: 96, price: 65}
And I need a list of dictionaries back with country and their averages.
I have gotten to a point where I have a list of dictionaries with the sum of price and points:
[{'country': 'Albania', 'points': 176, 'price': 40.0}, {'country': 'Argentina', 'points': 480488, 'price': 116181.0}, {'country': 'Australia', 'points': 430092, 'price': 152979.0}
Now I need to get averages. I was thinking to create another key for country.length and then in the for loop perform a basic calculation. But not sure if this is the right approach... Thanks for help!
My code below:
count_dict = country_count.to_dict()
# Output
{'US': 62139,
'Italy': 18784,
'France': 14785,
'Spain': 8160}
# Get the sum of points and price for each country
grouped_data = wine_data.groupby('country').agg({'points':'sum', 'price':'sum'})
# Reset the index in order to convert df into a list of dictionaries
country_data = grouped_data.reset_index()
country_list = country_data.to_dict('records')
# Output
[{'country': 'Albania', 'points': 176, 'price': 40.0}, {'country': 'Argentina', 'points': 48048 etc]```
Have you tried passing your data into a Pandas DataFrame and work with it there?
You can do it as follows, first to make a DataFrame:
import pandas as pd
import numpy as np
d = {
0: {'country': "US", 'points': 96, 'price': 235},
1: {'country': "Spain", 'points': 96, 'price': 110},
2: {'country': "US", 'points': 96, 'price': 90},
3: {'country': "US", 'points': 96, 'price': 65}
}
#
df = pd.DataFrame(d).Transpose()
Out:
country points price
0 US 96 235
1 Spain 96 110
2 US 96 90
3 US 96 65
Then groupby country
# just to make sure they are numeric
df[['points','price']] = df[['points','price']].astype('float64')
df.groupby('country').mean()
Out:
points price
country
Spain 96.0 110.0
US 96.0 130.0