align timeseries in pandas

align timeseries in pandas - python

I have 2 time series.
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df.index=pd.to_datetime(df['Time'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df1.index=pd.to_datetime(df1['Time'])
I am trying to align the time series so the index is in order. I am guessing reindex_like is what I need but not sure how to use it.
Here is my desired output
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

Here is what you need:
df.append(df1).sort().reset_index(drop=True)
If you need to compile more pieces together, it is more efficient to use pd.concat(<names of all your dataframes as a list>).
P.S. You code is a bit redundant: you don't need to cast Time into index if you don't need it there. You can sort values based on any column, like this:
import pandas as pd
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df.append(df1).sort_values('Time')

You can use concat, sort_index and reset_index:
df = pd.concat([df,df1]).sort_index().reset_index(drop=True)
print df
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Or you can use ordered_merge:
print pd.ordered_merge(df, df1)
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

Related

Python pandas Get daily: MIN MAX AVG results of datasets

Using Python with pandas to export data from a database to csv.Data looks like this when exported. Got like 100 logs/day so this is pure for visualising purpose:
time
Buf1
Buf2
12/12/2022 19:15:56
12
3
12/12/2022 18:00:30
5
18
11/12/2022 15:15:08
12
3
11/12/2022 15:15:08
10
9
Now i only show the "raw" data into a csv but i am in need to generate for each day a min. max. and avg value. Whats the best way to create that ? i've been trying to do some min() max() functions but the problem here is that i've multiple days in these csv files. Also trying to manupilate the data in python it self but kinda worried about that i'll be missing some and the data will be not correct any more.
I would like to end up with something like this:
time
buf1_max
buf_min
12/12/2022
12
3
12/12/2022
12
10

Here you go, step by step.
In [27]: df['time'] = df['time'].astype("datetime64").dt.date
In [28]: df
Out[28]:
time Buf1 Buf2
0 2022-12-12 12 3
1 2022-12-12 5 18
2 2022-11-12 12 3
3 2022-11-12 10 9
In [29]: df = df.set_index("time")
In [30]: df
Out[30]:
Buf1 Buf2
time
2022-12-12 12 3
2022-12-12 5 18
2022-11-12 12 3
2022-11-12 10 9
In [31]: df.groupby(df.index).agg(['min', 'max', 'mean'])
Out[31]:
Buf1 Buf2
min max mean min max mean
time
2022-11-12 10 12 11.0 3 9 6.0
2022-12-12 5 12 8.5 3 18 10.5

Another approach is to use pivot_table for simplification of grouping data (keep in mind to convert 'time' column to datetime64 format as suggested:
import pandas as pd
import numpy as np
df.pivot_table(
index='time',
values=['Buf1', 'Buf2'],
aggfunc={'Buf1':[min, max, np.mean], 'Buf2':[min, max, np.mean]}
)
You can add any aggfunc as you wish.

Take average of window in pandas

I have a large pandas dataframe, I want to average first 12 rows, then next 12 rows and so on. I wrote a for loop for this task
df_list=[]
for i in range(0,len(df),12):
print(i,i+12)
df_list.append(df.iloc[i:i+12].mean())
pd.concat(df_list,1).T
Is there an efficient way to do this without for loop

You can divide the index by N i.e. 12 in your case, then group the dataframe by the quotient, and finally call mean on these groups:
# Random dataframe of shape 120,4
>>> df=pd.DataFrame(np.random.randint(10,100,(120,4)), columns=list('ABCD'))
>>> df.groupby(df.index//12).mean()
A B C D
0 49.416667 52.583333 63.833333 47.833333
1 60.166667 61.666667 53.750000 34.583333
2 49.916667 54.500000 50.583333 64.750000
3 51.333333 51.333333 56.333333 60.916667
4 51.250000 51.166667 50.750000 50.333333
5 56.333333 50.916667 51.416667 59.750000
6 53.750000 57.000000 45.916667 59.250000
7 48.583333 59.750000 49.250000 50.750000
8 53.750000 48.750000 51.583333 68.000000
9 54.916667 48.916667 57.833333 43.333333

I believe you want to split your dataframe to seperate chunks with 12 rows. Then you can use np.arange inside groupby to take the mean of each seperate chunk:
df.groupby(np.arange(len(df)) // 12).mean()

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.

An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100

As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

Getting Top N results from pandas groupby

I have a two columns in dataset:
1) Supplier_code
2) Item_code
I have grouped them using:
data.groupby(['supplier_code', 'item_code']).size()
I get result like this:
supplier_code item_code
591495 127018419 9
547173046 1
3024466 498370473 1
737511044 1
941755892 1
6155238 875189969 1
13672569 53152664 1
430351453 1
573603000 1
634275342 1
18510135 362522958 6
405196476 6
441901484 12
29222428 979575973 1
31381089 28119319 2
468441742 3
648079349 18
941387936 1
I have my top 15 suppliers using:
supCounter = collections.Counter(datalist[3])
supDic = dict(sorted(supCounter.iteritems(), key=operator.itemgetter(1), reverse=True)[:15])
print supDic.keys()
This is my list of top 15 suppliers:
[723223131, 687164888, 594473706, 332379250, 203288669, 604236177,
533512754, 503134099, 982883317, 147405879, 151212120, 737780569, 561901243,
786265866, 79886783]
Now I want to join the two, i.e. groupby and get only the top 15 suppliers and there item counts.
Kindly help me in figuring this out.

IIUC, you can groupby supplier_code and then sum and sort_values. Take the top 15 and you're done.
For example, with:
gb_size = data.groupby(['supplier_code', 'item_code']).size()
Then:
N = 3 # change to 15 for actual data
gb_size.groupby("supplier_code").sum().sort_values(ascending=False).head(N)
Output:
supplier_code
31381089 24
18510135 24
591495 10
dtype: int64

how to create new column based on multiple columns with a function

This question is following up to my questionabout linear interpolation between two data points
I built following function from it:
def inter(colA, colB):
s = pd.Series([colA, np.nan, colB], index= [95, 100, 102.5])
s = s.interpolate(method='index')
return s.iloc[1]
Now I have a data frame that looks like this:
on95 on102.5 on105
Index
1 5 17 20
2 7 15 25
3 6 16 23
I would like to create a new column df['new'] that uses the function inter with inputs of on95 and on102.5
I tried like this:
df['new'] = inter(df['on95'],df['on102.5'])
but this resulted in NaN's.
I also tried with apply(inter) but did not find a way to make it work without an error message.
Any hints how I can solve this?

You need to vectorize your self defined function with np.vectorize, since the function parameters are accepted as pandas Series:
inter = np.vectorize(inter)
df['new'] = inter(df['on95'],df['on102.5'])
df
on95 on102.5 on105 new
#Index
# 1 5 17 20 13.000000
# 2 7 15 25 12.333333
# 3 6 16 23 12.666667

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

align timeseries in pandas - python

Related

Python pandas Get daily: MIN MAX AVG results of datasets

Take average of window in pandas

how to construct an index from percentage change time series?

Getting Top N results from pandas groupby

how to create new column based on multiple columns with a function

Categories

Resources