Python pandas Get daily: MIN MAX AVG results of datasets - python

Using Python with pandas to export data from a database to csv.Data looks like this when exported. Got like 100 logs/day so this is pure for visualising purpose:
time
Buf1
Buf2
12/12/2022 19:15:56
12
3
12/12/2022 18:00:30
5
18
11/12/2022 15:15:08
12
3
11/12/2022 15:15:08
10
9
Now i only show the "raw" data into a csv but i am in need to generate for each day a min. max. and avg value. Whats the best way to create that ? i've been trying to do some min() max() functions but the problem here is that i've multiple days in these csv files. Also trying to manupilate the data in python it self but kinda worried about that i'll be missing some and the data will be not correct any more.
I would like to end up with something like this:
time
buf1_max
buf_min
12/12/2022
12
3
12/12/2022
12
10

Here you go, step by step.
In [27]: df['time'] = df['time'].astype("datetime64").dt.date
In [28]: df
Out[28]:
time Buf1 Buf2
0 2022-12-12 12 3
1 2022-12-12 5 18
2 2022-11-12 12 3
3 2022-11-12 10 9
In [29]: df = df.set_index("time")
In [30]: df
Out[30]:
Buf1 Buf2
time
2022-12-12 12 3
2022-12-12 5 18
2022-11-12 12 3
2022-11-12 10 9
In [31]: df.groupby(df.index).agg(['min', 'max', 'mean'])
Out[31]:
Buf1 Buf2
min max mean min max mean
time
2022-11-12 10 12 11.0 3 9 6.0
2022-12-12 5 12 8.5 3 18 10.5

Another approach is to use pivot_table for simplification of grouping data (keep in mind to convert 'time' column to datetime64 format as suggested:
import pandas as pd
import numpy as np
df.pivot_table(
index='time',
values=['Buf1', 'Buf2'],
aggfunc={'Buf1':[min, max, np.mean], 'Buf2':[min, max, np.mean]}
)
You can add any aggfunc as you wish.

Related

Rolling average based on another column

I have a dataframe df which looks like
time(float)
value (float)
10.45
10
10.50
20
10.55
25
11.20
30
11.44
20
12.30
30
I need help to calculate a new column called rolling_average_value which is basically the average value of that row and all the values 1 hour before that row such that the new dataframe looks like.
time(float)
value (float)
rolling_average_value
10.45
10
10
10.50
20
15
10.55
25
18.33
11.20
30
21.25
11.44
20
21
12.30
30
25
Note: This time column is a float column
You can temporarily set a datetime index and apply rolling.mean:
# extract hours/minuts from float
import numpy as np
minutes, hours = np.modf(df['time(float)'])
hours = hours.astype(int)
minutes = minutes.mul(100).astype(int)
dt = pd.to_datetime(hours.astype(str)+minutes.astype(str), format='%H%M')
# perform rolling computation
df['rolling_mean'] = (df.set_axis(dt)
.rolling('1h')['value (float)']
.mean()
.set_axis(df.index)
)
output:
time(float) value (float) rolling_mean
0 10.45 10 10.000000
1 10.50 20 15.000000
2 10.55 25 18.333333
3 11.20 30 21.250000
4 11.44 20 21.000000
5 12.30 30 25.000000
Alternative to compute dt:
dt = pd.to_datetime(df['time(float)'].astype(str)
.str.replace('\d+', lambda x: x.group().zfill(2),
regex=True),
format='%H.%M')
Assuming your data frame is sorted by time, you can also use a simple list comprehension to solve your problem. Iterate over times and get all indices where the distance from the previous time values to the actual iteration value is less than one (meaning less than one hour) and slice the value column that was converted to an array by those indices. Then, you can just compute the mean of the sliced array:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"time": [10.45, 10.5, 10.55, 11.2, 11.44, 12.3],
"value": [10, 20, 25, 30, 20, 30]}
)
times = df["time"].values
values = df["value"].values
df["rolling_mean"] = [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
If your data frame is large, you can compile this loop in C/C++ too make it significantly faster:
from numba import njit
#njit
def compute_rolling_mean(times, values):
return [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
df["rolling_mean"] = compute_rolling_mean(df["time"].values, df["value"].values)
Output:
time value rolling_mean
0 10.45 10 10.00
1 10.50 20 15.00
2 10.55 25 18.33
3 11.20 30 21.25
4 11.44 20 21.00
5 12.30 30 25.00

Pandas Dataframe ... how to incrementally add values of rows?

Is there an easy way to sum the value of all the rows above the current row in an adjacent column? Click on the image below to see what I'm trying to make. It's easier to see it than explain it.
Text explanation: I'm trying to create a chart where column B is either the sum or percent of total of all the rows in A that are above it. That way I can quickly visualize where the quartile, third, etc are in the dataframe. I'm familiar with the percentile function
How to calculate 1st and 3rd quartiles?
but I'm not sure I can get it to do exactly what I want it to do. Image below as well as text version:
Text Version
1--1%
1--2%
4--6%
4--10%
2--12%
...
and so on to 100 percent.
Do i need to write a for loop to do this?
Excel Chart:
you can use cumsum for this:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=dict(x=[13,22,34,21,33,41,87,24,41,22,18,12,13]))
df["percent"] = (100*df.x.cumsum()/df.x.sum()).round(1)
output:
x percent
0 13 3.4
1 22 9.2
2 34 18.1
3 21 23.6
4 33 32.3
5 41 43.0
6 87 65.9
7 24 72.2
8 41 82.9
9 22 88.7
10 18 93.4
11 12 96.6
12 13 100.0

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

need to fill the NA values with the past three values before na values in python

need to fill the NA values with the past three values mean of that NA
this is my dataset
RECEIPT_MONTH_YEAR NET_SALES
0 2014-01-01 818817.20
1 2014-02-01 362377.20
2 2014-03-01 374644.60
3 2014-04-01 NA
4 2014-05-01 NA
5 2014-06-01 NA
6 2014-07-01 NA
7 2014-08-01 46382.50
8 2014-09-01 55933.70
9 2014-10-01 292303.40
10 2014-10-01 382928.60
is this dataset a .csv file or a dataframe. This NA is a 'NaN' or a string ?
import pandas as pd
import numpy as np
df=pd.read_csv('your dataset',sep=' ')
df.replace('NA',np.nan)
df.fillna(method='ffill',inplace=True)
you mention something about mean of 3 values..the above simply forward fills the last observation before the NaNs begin. This is often a good way for forecasting (better than taking means in certain cases, if persistence is important)
ind = df['NET_SALES'].index[df['NET_SALES'].apply(np.isnan)]
Meanof3 = df.iloc[ind[0]-3:ind[0]].mean(axis=1,skipna=True)
df.replace('NA',Meanof3)
Maybe the answer can be generalised and improved if more info about the dataset is known - like if you always want to take the mean of last 3 measurements before any NA. The above will allow you to check the indices that are NaNs and then take mean of 3 before, while ignoring any NaNs
This is simple but it is working
df_data.fillna(0,inplace=True)
for i in range(0,len(df_data)):
if df_data['NET_SALES'][i]== 0.00:
condtn = df_data['NET_SALES'][i-1]+df_data['NET_SALES'][i-2]+df_data['NET_SALES'][i-3]
df_data['NET_SALES'][i]=condtn/3
You could use fillna (assuming that your NA is already np.nan) and rolling mean:
import pandas as pd
import numpy as np
df = pd.DataFrame([818817.2,362377.2,374644.6,np.nan,np.nan,np.nan,np.nan,46382.5,55933.7,292303.4,382928.6], columns=["NET_SALES"])
df["NET_SALES"] = df["NET_SALES"].fillna(df["NET_SALES"].shift(1).rolling(3, min_periods=1).mean())
Out:
NET_SALES
0 818817.2
1 362377.2
2 374644.6
3 518613.0
4 368510.9
5 374644.6
6 NaN
7 46382.5
8 55933.7
9 292303.4
10 382928.6
If you want to include the imputed values I guess you'll need to use a loop.

align timeseries in pandas

I have 2 time series.
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df.index=pd.to_datetime(df['Time'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df1.index=pd.to_datetime(df1['Time'])
I am trying to align the time series so the index is in order. I am guessing reindex_like is what I need but not sure how to use it.
Here is my desired output
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Here is what you need:
df.append(df1).sort().reset_index(drop=True)
If you need to compile more pieces together, it is more efficient to use pd.concat(<names of all your dataframes as a list>).
P.S. You code is a bit redundant: you don't need to cast Time into index if you don't need it there. You can sort values based on any column, like this:
import pandas as pd
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df.append(df1).sort_values('Time')
You can use concat, sort_index and reset_index:
df = pd.concat([df,df1]).sort_index().reset_index(drop=True)
print df
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Or you can use ordered_merge:
print pd.ordered_merge(df, df1)
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

Categories