How to create a groupby dataframe without a multi-level index - python

I have the following pandas groupby object, and I'd like to turn the result into a new dataframe.
Following, is the code to get the conditional probability:
bin_probs = data.groupby('season')['bin'].value_counts()/data.groupby('season')['bin'].count()
I've tried the following code, but it returns as follows.
I like the season to fill in each row. How can I do that?
a = pd.DataFrame(data_5.groupby('season')['bin'].value_counts()/data_5.groupby('season')['bin'].count())

a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
The index can't be reset when the name in the index and the column are the same.
Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
This would not work with the implementation in the OP, because that is a dataframe.
This works with the following implementation, because a pandas.Series is created with .groupby.
The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
As already noted above, use normalize=True to get normalized values
The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
See Pandas unable to reset index because name exist to reset by a level.
Plotting
It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
For side-by-side bars, set stacked=False.
The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')

You are looking for parameter normalize:
bin_probs = data.groupby('season')['bin'].value_counts(normalize=True)
Read more about it here:

Related

Pandas: how to adapt one dataframe to another based on the date?

I have two data frames that collect historical price series of two different stocks. applying describe () I noticed that the elements of the first stock are 1291 while those of the second are 1275. This difference is due to the fact that the two securities are listed on different stock exchanges and therefore show differences on some dates. What I would like to do is keep the two separate dataframes, but make sure that in the first dataframe, all those rows whose dates are not present in the second dataframe are deleted in order to have the perfect matching of the two dataframes to do the analyzes. I have read that there are functions such as merge () or join () but I have not been able to understand well how to use them (if these are the correct functions). I thank those who will use some of their time to answer my question.
"ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1275 and the array at index 1 has size 1291"
Thank you
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as web
from scipy import stats
import seaborn as sns
pd.options.display.min_rows= None
pd.options.display.max_rows= None
tickers = ['DISW.MI','IXJ','NRJ.PA','SGOL','VDC','VGT']
wts= [0.19,0.18,0.2,0.08,0.09,0.26]
price_data = web.get_data_yahoo(tickers,
start = '2016-01-01',
end = '2021-01-01')
price_data = price_data['Adj Close']
ret_data = price_data.pct_change()[1:]
port_ret = (ret_data * wts).sum(axis = 1)
benchmark_price = web.get_data_yahoo('ACWE.PA',
start = '2016-01-01',
end = '2021-01-01')
benchmark_ret = benchmark_price["Adj Close"].pct_change()[1:].dropna()
#From now i get error
sns.regplot(benchmark_ret.values,
port_ret.values)
plt.xlabel("Benchmark Returns")
plt.ylabel("Portfolio Returns")
plt.title("Portfolio Returns vs Benchmark Returns")
plt.show()
(beta, alpha) = stats.linregress(benchmark_ret.values,
port_ret.values)[0:2]
print("The portfolio beta is", round(beta, 4))
Let's consider a toy example.
df1 consists of 6 days of data and df2 consists of 5 days of data.
What I have understood, you want df1 also to have 5 days of data matching the dates with df2.
df1
df1 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=6),
'px':np.random.rand(6)
})
df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
5 2021-05-22 0.127086
df2
df2 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=5),
'px':np.random.rand(5)
})
df2
date px
0 2021-05-17 0.650976
1 2021-05-18 0.393061
2 2021-05-19 0.985700
3 2021-05-20 0.879786
4 2021-05-21 0.463206
Code
To consider only matching dates in df1 from df2.
df1 = df1[df1.date.isin(df2.date)]
Output df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627

Python dataframe interpolation - adding a new row to a dataframe

I have a dataframe that I would like to add a new row when EVM = a specific value (-30) and update the other columns with linear interpolation.
Index PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 15.257129 -48.624869 32.257129 134.487430
5 17.260618 -45.971596 32.260618 134.586753
6 18.263079 -44.319692 32.263079 134.656616
7 19.266674 -41.532695 32.266674 134.743599
8 20.271934 -37.546253 32.271934 134.849050
9 21.278990 -33.239208 32.278990 134.972439
10 22.286989 -29.221786 32.286989 135.111068
11 23.293533 -25.652448 32.293533 135.261357
For example, (in the 3rd column) EVM = -30 lies between rows 9 and 10 above. How can I include a new row (between rows 9 and 10) that has EVM = -30 and then update the other columns (in this new row only) with linear interpolation that is based on the EVM column's position between the numbers in rows 9 and 10?
It would be great to be able to search and find the rows that EVM =-30 lies between.
Is it possible to apply linear interpolation to some rows but nonlinear interpolation to other columns?
Thanks!
Interpolation is by far the easiest part. Here is one approach.
First, find the missing rows and add them one by one:
targets = (-50, -40, -30) # Arbitrary
idxs = df.EVM.searchsorted(targets) # Find the rows location
arr = df.values
for idx, target in zip(idxs, targets):
arr = np.insert(arr, idx, [np.nan, target, np.nan, np.nan], axis=0)
df1 = pd.DataFrame(arr, columns=df.columns)
Then you can actually interpolate:
df2 = df1.interpolate('linear')
Output:
PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 13.254103 -50.000000 32.254103 134.435519
5 15.257129 -48.624869 32.257129 134.487430
6 17.260618 -45.971596 32.260618 134.586753
7 18.263079 -44.319692 32.263079 134.656616
9 19.266674 -41.532695 32.266674 134.743599
8 19.769304 -40.000000 32.269304 134.796324
11 20.271934 -37.546253 32.271934 134.849050
12 21.278990 -33.239208 32.278990 134.972439
10 21.782989 -30.000000 32.282989 135.041753
13 22.286989 -29.221786 32.286989 135.111068
14 23.293533 -25.652448 32.293533 135.261357
If you want custom interpolation methods by columns, go individually, e.g:
df2.PwrOut = df1.PwrOut.interpolate('cubic')

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

Is there a pandas way of getting the averages between consecutive rows?

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(30,3))
df.head()
which gives:
0 1 2
0 0.741955 0.913681 0.110109
1 0.079039 0.662438 0.510414
2 0.469055 0.201658 0.259958
3 0.371357 0.018394 0.485339
4 0.850254 0.808264 0.469885
Say I want to add another column that will build the averages in column 2: between index (0,1) (1,2)... (28,29).
I imagine this is a common task as column 2 are the x axis positions and I want the categorical labels on a plot to appear in the middle between the 2 points on the x axis.
So I was wondering if there is a pandas way for this:
averages = []
for index, item in enumerate(df[2]):
if index < df[2].shape[0] -1:
averages.append((item + df[2].iloc[index + 1]) / 2)
df["averages"] = pd.Series(averages)
df.head()
which gives:
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333138
3 0.996487 0.272300 0.334554 0.586686
as you can see 0.31 is an average of 0.21 and 0.42.
Thanks!
I think that you can do this with pandas.DataFrame.rolling. Using your dataframe head as an example:
df['averages'] = df[2].rolling(2).mean().shift(-1)
returns:
>>> df
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333139
3 0.996487 0.272300 0.334554 NaN
The NaN at the end is there because there is no row indexed 4; but in your full dataframe, it would go on until the second to last row (the average of value at indices 28 and 29, i.e. your 29th and 30th values). I just wanted to show that this gives the same values as your desired output, so I used the exact data you provided. (for future reference, if you want to provide a reproducible dataframe for us from random numbers, use and show us a random seed such as np.random.seed(42) before creating the df, that way, we'll all have the same one.)
breaking it down:
df[2] is there because you're interested in column 2; .rolling(2) is there because you want to get the mean of 2 values (if you wanted the mean of 3 values, use .rolling(3), etc...), .mean() is whatever function you want (in your case, the mean); finally .shift(-1) makes sure that the new column is in the proper place (i.e., makes sure you show the mean of each value in column 2 and the value below, as the default would be the value above)
This is one way, though slightly loopy. But #sacul's solution is better. I leave this here for reference only.
import pandas as pd
import numpy as np
from itertools import zip_longest
df = pd.DataFrame(np.random.rand(30, 3))
v = df.values[:, -1]
df = df.join(pd.DataFrame(np.array([np.mean([i, j], axis=0) for i, j in \
zip_longest(v, v[1:], fillvalue=v[-1])]), columns=['2_pair_avg']))
# 0 1 2 2_pair_avg
# 0 0.382656 0.228837 0.053199 0.373678
# 1 0.812690 0.255277 0.694156 0.697738
# 2 0.040521 0.211511 0.701320 0.491044
# 3 0.558739 0.697916 0.280768 0.615398
# 4 0.262771 0.912669 0.950029 0.489550
# 5 0.217489 0.405125 0.029071 0.101794
# 6 0.577929 0.933565 0.174517 0.214530
# 7 0.067030 0.452027 0.254544 0.613225
# 8 0.580869 0.556112 0.971907 0.582547
# 9 0.483528 0.951537 0.193188 0.175215
# 10 0.481141 0.589833 0.157242 0.159363
# 11 0.087057 0.823691 0.161485 0.108634
# 12 0.319516 0.161386 0.055784 0.285276
# 13 0.901529 0.365992 0.514768 0.386599
# 14 0.270118 0.454583 0.258430 0.245463
# 15 0.379739 0.299569 0.232497 0.214943
# 16 0.017621 0.182647 0.197389 0.538386
# 17 0.720688 0.147093 0.879383 0.732239
# 18 0.859594 0.538390 0.585096 0.503846
# 19 0.360718 0.571567 0.422596 0.287384
# 20 0.874800 0.391535 0.152171 0.239078
# 21 0.935150 0.379871 0.325984 0.294485
# 22 0.269607 0.891331 0.262986 0.212050
# 23 0.140976 0.414547 0.161115 0.542682
# 24 0.851434 0.059209 0.924250 0.801210
# 25 0.389025 0.774885 0.678170 0.388856
# 26 0.679247 0.982517 0.099542 0.372649
# 27 0.670354 0.279138 0.645756 0.336031
# 28 0.393414 0.970737 0.026307 0.343947
# 29 0.479611 0.349401 0.661587 0.661587

Plotting dataframes on same plot

I have two dataframes:
a.head()
AAPL SPY date
0 1000000.000000 1000000.000000 2010-01-04
1 921613.643818 969831.805642 2010-02-04
2 980649.393244 1000711.933790 2010-03-04
3 980649.393244 1000711.933790 2010-04-04
4 1232535.257461 1059090.504583 2010-05-04
and
b.head()
date test
0 2010-01-26 22:17:44 990482.664854
1 2010-03-09 22:37:17 998565.699784
2 2010-03-12 02:11:23 989957.374785
3 2010-04-05 18:01:37 994315.860439
4 2010-04-06 11:06:50 987887.723816
After I set the index for a and b (set_index('date')), I can use the pandas plot() function to create a nice plot with the date as the x-axis and the various columns as y-values. What I want to do is plot two dataframes with different indices on the same figure. As you can see from a and b, the indices are different, and I want to plot them on the same figure.
I tried merge and concat to join the dataframes together, but the resulting plot is not what I'd like because those functions insert numpy.NaN in places where the date is not the same, which makes discontinuities in my plots. I can use pd.fillna() but this is not what I'd like, since I'd rather it just connect the points together rather than drop down to 0.
Assuming you want the same time scale on the x-axis, then you will need timestamps as the index for for a and b before concatenating the columns.
You can then use interpolation to fill in the missing data, optionally with ffill() as an additional operation if you want to fill forward past the last observed data point.
df = pd.concat([a, b.set_index('date')], axis=1)
df.interpolate(method='time').plot() # interpolate(method='time').ffill()

Categories