Is there a pandas way of getting the averages between consecutive rows? - python

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(30,3))
df.head()
which gives:
0 1 2
0 0.741955 0.913681 0.110109
1 0.079039 0.662438 0.510414
2 0.469055 0.201658 0.259958
3 0.371357 0.018394 0.485339
4 0.850254 0.808264 0.469885
Say I want to add another column that will build the averages in column 2: between index (0,1) (1,2)... (28,29).
I imagine this is a common task as column 2 are the x axis positions and I want the categorical labels on a plot to appear in the middle between the 2 points on the x axis.
So I was wondering if there is a pandas way for this:
averages = []
for index, item in enumerate(df[2]):
if index < df[2].shape[0] -1:
averages.append((item + df[2].iloc[index + 1]) / 2)
df["averages"] = pd.Series(averages)
df.head()
which gives:
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333138
3 0.996487 0.272300 0.334554 0.586686
as you can see 0.31 is an average of 0.21 and 0.42.
Thanks!

I think that you can do this with pandas.DataFrame.rolling. Using your dataframe head as an example:
df['averages'] = df[2].rolling(2).mean().shift(-1)
returns:
>>> df
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333139
3 0.996487 0.272300 0.334554 NaN
The NaN at the end is there because there is no row indexed 4; but in your full dataframe, it would go on until the second to last row (the average of value at indices 28 and 29, i.e. your 29th and 30th values). I just wanted to show that this gives the same values as your desired output, so I used the exact data you provided. (for future reference, if you want to provide a reproducible dataframe for us from random numbers, use and show us a random seed such as np.random.seed(42) before creating the df, that way, we'll all have the same one.)
breaking it down:
df[2] is there because you're interested in column 2; .rolling(2) is there because you want to get the mean of 2 values (if you wanted the mean of 3 values, use .rolling(3), etc...), .mean() is whatever function you want (in your case, the mean); finally .shift(-1) makes sure that the new column is in the proper place (i.e., makes sure you show the mean of each value in column 2 and the value below, as the default would be the value above)

This is one way, though slightly loopy. But #sacul's solution is better. I leave this here for reference only.
import pandas as pd
import numpy as np
from itertools import zip_longest
df = pd.DataFrame(np.random.rand(30, 3))
v = df.values[:, -1]
df = df.join(pd.DataFrame(np.array([np.mean([i, j], axis=0) for i, j in \
zip_longest(v, v[1:], fillvalue=v[-1])]), columns=['2_pair_avg']))
# 0 1 2 2_pair_avg
# 0 0.382656 0.228837 0.053199 0.373678
# 1 0.812690 0.255277 0.694156 0.697738
# 2 0.040521 0.211511 0.701320 0.491044
# 3 0.558739 0.697916 0.280768 0.615398
# 4 0.262771 0.912669 0.950029 0.489550
# 5 0.217489 0.405125 0.029071 0.101794
# 6 0.577929 0.933565 0.174517 0.214530
# 7 0.067030 0.452027 0.254544 0.613225
# 8 0.580869 0.556112 0.971907 0.582547
# 9 0.483528 0.951537 0.193188 0.175215
# 10 0.481141 0.589833 0.157242 0.159363
# 11 0.087057 0.823691 0.161485 0.108634
# 12 0.319516 0.161386 0.055784 0.285276
# 13 0.901529 0.365992 0.514768 0.386599
# 14 0.270118 0.454583 0.258430 0.245463
# 15 0.379739 0.299569 0.232497 0.214943
# 16 0.017621 0.182647 0.197389 0.538386
# 17 0.720688 0.147093 0.879383 0.732239
# 18 0.859594 0.538390 0.585096 0.503846
# 19 0.360718 0.571567 0.422596 0.287384
# 20 0.874800 0.391535 0.152171 0.239078
# 21 0.935150 0.379871 0.325984 0.294485
# 22 0.269607 0.891331 0.262986 0.212050
# 23 0.140976 0.414547 0.161115 0.542682
# 24 0.851434 0.059209 0.924250 0.801210
# 25 0.389025 0.774885 0.678170 0.388856
# 26 0.679247 0.982517 0.099542 0.372649
# 27 0.670354 0.279138 0.645756 0.336031
# 28 0.393414 0.970737 0.026307 0.343947
# 29 0.479611 0.349401 0.661587 0.661587

Related

Python dataframe interpolation - adding a new row to a dataframe

I have a dataframe that I would like to add a new row when EVM = a specific value (-30) and update the other columns with linear interpolation.
Index PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 15.257129 -48.624869 32.257129 134.487430
5 17.260618 -45.971596 32.260618 134.586753
6 18.263079 -44.319692 32.263079 134.656616
7 19.266674 -41.532695 32.266674 134.743599
8 20.271934 -37.546253 32.271934 134.849050
9 21.278990 -33.239208 32.278990 134.972439
10 22.286989 -29.221786 32.286989 135.111068
11 23.293533 -25.652448 32.293533 135.261357
For example, (in the 3rd column) EVM = -30 lies between rows 9 and 10 above. How can I include a new row (between rows 9 and 10) that has EVM = -30 and then update the other columns (in this new row only) with linear interpolation that is based on the EVM column's position between the numbers in rows 9 and 10?
It would be great to be able to search and find the rows that EVM =-30 lies between.
Is it possible to apply linear interpolation to some rows but nonlinear interpolation to other columns?
Thanks!
Interpolation is by far the easiest part. Here is one approach.
First, find the missing rows and add them one by one:
targets = (-50, -40, -30) # Arbitrary
idxs = df.EVM.searchsorted(targets) # Find the rows location
arr = df.values
for idx, target in zip(idxs, targets):
arr = np.insert(arr, idx, [np.nan, target, np.nan, np.nan], axis=0)
df1 = pd.DataFrame(arr, columns=df.columns)
Then you can actually interpolate:
df2 = df1.interpolate('linear')
Output:
PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 13.254103 -50.000000 32.254103 134.435519
5 15.257129 -48.624869 32.257129 134.487430
6 17.260618 -45.971596 32.260618 134.586753
7 18.263079 -44.319692 32.263079 134.656616
9 19.266674 -41.532695 32.266674 134.743599
8 19.769304 -40.000000 32.269304 134.796324
11 20.271934 -37.546253 32.271934 134.849050
12 21.278990 -33.239208 32.278990 134.972439
10 21.782989 -30.000000 32.282989 135.041753
13 22.286989 -29.221786 32.286989 135.111068
14 23.293533 -25.652448 32.293533 135.261357
If you want custom interpolation methods by columns, go individually, e.g:
df2.PwrOut = df1.PwrOut.interpolate('cubic')

How to create a groupby dataframe without a multi-level index

I have the following pandas groupby object, and I'd like to turn the result into a new dataframe.
Following, is the code to get the conditional probability:
bin_probs = data.groupby('season')['bin'].value_counts()/data.groupby('season')['bin'].count()
I've tried the following code, but it returns as follows.
I like the season to fill in each row. How can I do that?
a = pd.DataFrame(data_5.groupby('season')['bin'].value_counts()/data_5.groupby('season')['bin'].count())
a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
The index can't be reset when the name in the index and the column are the same.
Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
This would not work with the implementation in the OP, because that is a dataframe.
This works with the following implementation, because a pandas.Series is created with .groupby.
The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
As already noted above, use normalize=True to get normalized values
The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
See Pandas unable to reset index because name exist to reset by a level.
Plotting
It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
For side-by-side bars, set stacked=False.
The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')
You are looking for parameter normalize:
bin_probs = data.groupby('season')['bin'].value_counts(normalize=True)
Read more about it here:

Pandas: How to round the values which are closest to the whole number for the values more than 1?

Following is the dataframe. I would like to round the values in 'Period' which are closest to the whole numbers. For example : 1.005479452 rounded to 1.0000, 2.002739726 rounded to 2.0000, 3.002739726 rounded to 3.00000, 5.005479452 rounded to 5.0000, 12.01369863 rounded to 12.0000 and so on. I have a big list. I am trying to do so because in later program I have to concatenate this dataframe with other dataframes based on 'period' column.
df = period rate
0.931506849 -0.001469
0.994520548 0.008677
1.005479452 0.11741125
1.008219178 0.073975
1.010958904 0.147474833
1.994520548 -0.007189219
2.002739726 0.1160815
2.005479452 0.06995
2.008219178 0.026808
2.010958904 0.1200695
2.980821918 -0.007745727
3.002739726 0.192208333
3.010958904 0.119895833
3.019178082 0.151857267
3.021917808 0.016165
3.863013699 0.005405321
4 0.06815
4.002739726 0.1240695
4.016438356 0.2410323
4.019178082 0.0459375
4.021917808 0.03161
4.997260274 0.0682
5.005479452 0.1249955
5.01369863 0.03260875
5.016438356 0.238069083
5.019178082 0.04590625
5.021917808 0.0120625
12.01369863 0.136991
12.01643836 0.053327917
12.01917808 0.2309365
I am trying to do something like below but couldn't move further.
df['period'] = np.where(df.period>1, df.period.round(), df.period.round(decimals = 4))
You can apply a lambda function. This one will check it the value is greater than one before rounding to whole, otherwise rounding to 4 decimal places for values less than one. I think that's what you seem to want?
df['period'] = df['period'].apply(lambda x: round(x, 0) if x > 1 else round(x, 4))
I built a function that basically iterates from 1 to whatever the max whole value should be in the dataframe. This should be faster than a solution that just iterates row-by-row, though it does assume that the dataframe is sorted (like in your example).
import pandas as pd
df = pd.DataFrame(
{
"period": [0.931506849, 0.994520548, 1.005479452, 1.008219178, 1.010958904, 1.994520548, 2.002739726, 2.005479452, 2.008219178, 2.010958904, 2.980821918, 3.002739726, 3.010958904, 3.019178082, 3.021917808, 3.863013699, 4, 4.002739726, 4.016438356, 4.019178082, 4.021917808, 4.997260274, 5.005479452, 5.01369863, 5.016438356, 5.019178082, 5.021917808, 12.01369863, 12.01643836, 12.01917808]
}
)
print(df.head())
"""
period
0 0.931507
1 0.994521
2 1.005479
3 1.008219
4 1.010959
"""
def process_df(df: pd.DataFrame) -> pd.DataFrame:
df_range_vals = [round(period) for period in df['period'].tolist()]
out_df = df.loc[df['period'] < 1]
for base in range(1, max(df_range_vals) + 1):
# only keep the ones in the range we want
temp_df = df.loc[(df['period'] >= base) & (df['period'] < base + 1)]
# if there's nothing to change, then just skip
if temp_df.empty:
continue
temp_df.loc[temp_df.first_valid_index(), 'period'] = temp_df.loc[temp_df.first_valid_index(), 'period'].round(0)
out_df = out_df.append(temp_df, ignore_index = True)
return out_df
df = process_df(df)
print(df.head())
"""
period
0 0.931507
1 0.994521
2 1.000000
3 1.008219
4 1.010959
"""
Try:
# Sort so that we know what is closes to whole no
df.sort_values(by=['period'])
# Create a new column and round everything. This is done to do
# partition effectively
df['round_period'] = df['period'].round()
df_of_values_close_to_whole_number = list(df.groupby('round_period').tail(1)['period'])
def round_func(x, df_of_val_close_to_whole_number):
return '{:.5f}'.format(round(x)) if x in df_of_val_close_to_whole_number and x > 1 else x
# Apply round only to values closer to whole number.
df['period'].apply(round_func, args=(df_of_values_close_to_whole_number,))
Output
0 0.931507
1 0.994521
2 1.00548
3 1.00822
4 1.00000
5 1.99452
6 2.00274
7 2.00548
8 2.00822
9 2.00000
10 2.98082
11 3.00274
12 3.01096
13 3.01918
14 3.00000
15 3.86301
16 4
17 4.00274
18 4.01644
19 4.01918
20 4.00000
21 4.99726
22 5.00548
23 5.0137
24 5.01644
25 5.01918
26 5.00000
27 12.0137
28 12.0164
29 12.00000
Name: period, dtype: object

Calculate row-wise dot products based on previous row and next row in pandas

I have a pandas dataframe like below:
Coordinate
1 (1150.0,1760.0)
28 (1260.0,1910.0)
6 (1030.0,2070.0)
12 (1170.0,2300.0)
9 (790.0,2260.0)
5 (750.0,2030.0)
26 (490.0,2130.0)
29 (360.0,1980.0)
3 (40.0,2090.0)
2 (630.0,1660.0)
20 (590.0,1390.0)
Now, I want to create a new column 'dotProduct' by applying the formula
np.dot((b-a),(b-c)) where b is the Coordinates(1260.0,1910.0) for index 28, c is the same for index 6, (i.e. (1030.0,2070.0)). The calculated product is for row 2. So, in a way I have to get the previous row value and next value too. This way I have to calculate for entire 'Coordinate' I am quite new to pandas, hence still in learning path. Please guide me a bit.
Thanks a lot for the help.
I assume that your 'Coordinate' column elements are already tuples of float values.
# Convert elements of 'Coordinate' into numpy array
df.Coordinate = df.Coordinate.apply(np.array)
# Subtract +/- 1 shifted values from original 'Coordinate'
a = df.Coordinate - df.Coordinate.shift(1)
b = df.Coordinate - df.Coordinate.shift(-1)
# take row-wise dot product based on the arrays a, b
df['dotProduct'] = [np.dot(x, y) for x, y in zip(a, b)]
# make 'Coordinate' tuple again (if you want)
df.Coordinate = df.Coordinate.apply(tuple)
Now I get this as df:
Coordinate dotProduct
1 (1150.0, 1760.0) NaN
28 (1260.0, 1910.0) 1300.0
6 (1030.0, 2070.0) -4600.0
12 (1170.0, 2300.0) 62400.0
9 (790.0, 2260.0) -24400.0
5 (750.0, 2030.0) 12600.0
26 (490.0, 2130.0) -18800.0
29 (360.0, 1980.0) -25100.0
3 (40.0, 2090.0) 236100.0
2 (630.0, 1660.0) -92500.0
20 (590.0, 1390.0) NaN

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

Categories