I have a large pandas dataframe, I want to average first 12 rows, then next 12 rows and so on. I wrote a for loop for this task
df_list=[]
for i in range(0,len(df),12):
print(i,i+12)
df_list.append(df.iloc[i:i+12].mean())
pd.concat(df_list,1).T
Is there an efficient way to do this without for loop
You can divide the index by N i.e. 12 in your case, then group the dataframe by the quotient, and finally call mean on these groups:
# Random dataframe of shape 120,4
>>> df=pd.DataFrame(np.random.randint(10,100,(120,4)), columns=list('ABCD'))
>>> df.groupby(df.index//12).mean()
A B C D
0 49.416667 52.583333 63.833333 47.833333
1 60.166667 61.666667 53.750000 34.583333
2 49.916667 54.500000 50.583333 64.750000
3 51.333333 51.333333 56.333333 60.916667
4 51.250000 51.166667 50.750000 50.333333
5 56.333333 50.916667 51.416667 59.750000
6 53.750000 57.000000 45.916667 59.250000
7 48.583333 59.750000 49.250000 50.750000
8 53.750000 48.750000 51.583333 68.000000
9 54.916667 48.916667 57.833333 43.333333
I believe you want to split your dataframe to seperate chunks with 12 rows. Then you can use np.arange inside groupby to take the mean of each seperate chunk:
df.groupby(np.arange(len(df)) // 12).mean()
Related
I have a dataframe that I would like to add a new row when EVM = a specific value (-30) and update the other columns with linear interpolation.
Index PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 15.257129 -48.624869 32.257129 134.487430
5 17.260618 -45.971596 32.260618 134.586753
6 18.263079 -44.319692 32.263079 134.656616
7 19.266674 -41.532695 32.266674 134.743599
8 20.271934 -37.546253 32.271934 134.849050
9 21.278990 -33.239208 32.278990 134.972439
10 22.286989 -29.221786 32.286989 135.111068
11 23.293533 -25.652448 32.293533 135.261357
For example, (in the 3rd column) EVM = -30 lies between rows 9 and 10 above. How can I include a new row (between rows 9 and 10) that has EVM = -30 and then update the other columns (in this new row only) with linear interpolation that is based on the EVM column's position between the numbers in rows 9 and 10?
It would be great to be able to search and find the rows that EVM =-30 lies between.
Is it possible to apply linear interpolation to some rows but nonlinear interpolation to other columns?
Thanks!
Interpolation is by far the easiest part. Here is one approach.
First, find the missing rows and add them one by one:
targets = (-50, -40, -30) # Arbitrary
idxs = df.EVM.searchsorted(targets) # Find the rows location
arr = df.values
for idx, target in zip(idxs, targets):
arr = np.insert(arr, idx, [np.nan, target, np.nan, np.nan], axis=0)
df1 = pd.DataFrame(arr, columns=df.columns)
Then you can actually interpolate:
df2 = df1.interpolate('linear')
Output:
PwrOut EVM PwrGain Vout
0 -0.760031 -58.322902 32.239969 134.331851
1 3.242575 -58.073389 32.242575 134.332376
2 7.246203 -57.138122 32.246203 134.343538
3 11.251078 -54.160870 32.251078 134.383609
4 13.254103 -50.000000 32.254103 134.435519
5 15.257129 -48.624869 32.257129 134.487430
6 17.260618 -45.971596 32.260618 134.586753
7 18.263079 -44.319692 32.263079 134.656616
9 19.266674 -41.532695 32.266674 134.743599
8 19.769304 -40.000000 32.269304 134.796324
11 20.271934 -37.546253 32.271934 134.849050
12 21.278990 -33.239208 32.278990 134.972439
10 21.782989 -30.000000 32.282989 135.041753
13 22.286989 -29.221786 32.286989 135.111068
14 23.293533 -25.652448 32.293533 135.261357
If you want custom interpolation methods by columns, go individually, e.g:
df2.PwrOut = df1.PwrOut.interpolate('cubic')
consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64
I have 2 time series.
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df.index=pd.to_datetime(df['Time'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df1.index=pd.to_datetime(df1['Time'])
I am trying to align the time series so the index is in order. I am guessing reindex_like is what I need but not sure how to use it.
Here is my desired output
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Here is what you need:
df.append(df1).sort().reset_index(drop=True)
If you need to compile more pieces together, it is more efficient to use pd.concat(<names of all your dataframes as a list>).
P.S. You code is a bit redundant: you don't need to cast Time into index if you don't need it there. You can sort values based on any column, like this:
import pandas as pd
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df.append(df1).sort_values('Time')
You can use concat, sort_index and reset_index:
df = pd.concat([df,df1]).sort_index().reset_index(drop=True)
print df
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Or you can use ordered_merge:
print pd.ordered_merge(df, df1)
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...
I have an enormous timeseries of functions stored in a pandas dataframe in an HDF5 store and I want to make plots of a certain transform of every function in the timeseries. Since the number of plots is so large, and plotting them takes so long, I've used fork() and numpy.array_split() to break the indices up and run several plots in parallel.
Doing things this way means that every process has a copy of the whole timeseries. Since what limits how many processes I can run is the total amount of memory I use, I would like to be able to have each process store only it's own chunk of the dataframe.
How can I split up a pandas dataframe?
np.array_split works pretty well for this usecase.
[40]: df = DataFrame(np.random.randn(5,10))
In [41]: df
Out[41]:
0 1 2 3 4 5 6 7 8 9
0 -1.998163 -1.973708 0.461369 -0.575661 0.862534 -1.326168 1.164199 -1.004121 1.236323 -0.339586
1 -0.591188 -0.162782 0.043923 0.101241 0.120330 -1.201497 -0.108959 -0.033221 0.145400 -0.324831
2 0.114842 0.200597 2.792904 0.769636 -0.698700 -0.544161 0.838117 -0.013527 -0.623317 -1.461193
3 1.309628 -0.444961 0.323008 -1.409978 -0.697961 0.132321 -2.851494 1.233421 -1.540319 1.107052
4 0.436368 0.627954 -0.942830 0.448113 -0.030464 0.764961 -0.241905 -0.620992 1.238171 -0.127617
Just pretty-printing as you get a list of 3 elements here.
In [43]: for dfs in np.array_split(df,3,axis=1):
....: print dfs, "\n"
....:
0 1 2 3
0 -1.998163 -1.973708 0.461369 -0.575661
1 -0.591188 -0.162782 0.043923 0.101241
2 0.114842 0.200597 2.792904 0.769636
3 1.309628 -0.444961 0.323008 -1.409978
4 0.436368 0.627954 -0.942830 0.448113
4 5 6
0 0.862534 -1.326168 1.164199
1 0.120330 -1.201497 -0.108959
2 -0.698700 -0.544161 0.838117
3 -0.697961 0.132321 -2.851494
4 -0.030464 0.764961 -0.241905
7 8 9
0 -1.004121 1.236323 -0.339586
1 -0.033221 0.145400 -0.324831
2 -0.013527 -0.623317 -1.461193
3 1.233421 -1.540319 1.107052