Compute the median of dynamic time series - python

If I have a pandas series [a1,a2,a3,a4,...] with length = T. Each value corresponds to one day. For each day, I would like to compute the historical median. For example, the first day compute the median of [a1]; the second day compute the median of [a1,a2]; the nth day compute the median of [a1,a2,...,an]. Finally I would like to get a series with length = T as well. Do we have an efficient way to do this in pandas? Thanks!

For a Series, ser:
ser = pd.Series(np.random.randint(0, 100, 10))
If your pandas version is 0.18.0 or above, use:
ser.expanding().median()
Out:
0 0.0
1 25.0
2 50.0
3 36.5
4 33.0
5 36.0
6 33.0
7 36.0
8 33.0
9 36.0
dtype: float64
The following is for earlier versions and deprecated:
pd.expanding_median(ser)
C:\Anaconda3\envs\p3\lib\site-packages\spyder\utils\ipython\start_kernel.py:1: FutureWarning: pd.expanding_median is deprecated for Series and will be removed in a future version, replace with
Series.expanding(min_periods=1).median()
# -*- coding: utf-8 -*-
Out:
0 0.0
1 25.0
2 50.0
3 36.5
4 33.0
5 36.0
6 33.0
7 36.0
8 33.0
9 36.0
dtype: float64

Related

Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID

I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:
Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
Initial dataframe 'data':
print(data)
ID Surface Volume
0 2 10.0 25.0
1 2 12.0 30.0
2 2 24.0 60.0
3 2 8.0 20.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 NaN NaN
8 52 96.0 240.0
9 95 8.0 20.0
10 95 6.0 15.0
11 95 12.0 30.0
12 95 30.0 75.0
13 95 12.0 30.0
Desired output from 'df':
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 240.0
3 95 68.0 170.0
Tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [2,4,52,95]})

data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"Volume": [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})

print(data)

#Tried something, but no idea how to do this actually:
df["Surface"] = data.groupby("ID").agg(sum)
df["Volume"] = data.groupby("ID").agg(sum)
print(df)
Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.
Chain both masks by & for bitwise AND and repalce matched values by NaNs by DataFrame.mask and last aggregate sum:
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0
2 52 96.0 240.0
3 95 68.0 170.0
If need new columns filled by aggregate sum values use GroupBy.transform :
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
ID Surface Volume
0 2 54.0 135.0
1 2 54.0 135.0
2 2 54.0 135.0
3 2 54.0 135.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 96.0 240.0
8 52 96.0 240.0
9 95 68.0 170.0
10 95 68.0 170.0
11 95 68.0 170.0
12 95 68.0 170.0
13 95 68.0 170.0

Python: Array-based equation

I have a dataframe 500 rows long by 4 columns. I need to find the proper python code that would divide the current row by the row below and then multiply that value by the value in the last row for every value in each column. I need to replicate this excel formula basically.
It's not clear if your data is stored in an array as provided by Numpy, were it true you'd write, with the original data contained in a
b = a[-1]*(a[:-1]/a[+1:])
a[-1] is the last row, a[:-1] the array without the last row and a[+1:] the array without the first (index zero, that is) row
Assuming you are talking about pandas DataFrame
import pandas as pd
import random
# sample DataFrame object
df = pd.DataFrame((float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)))
for _ in range(10))
def function(col):
for i in range(len(col)-1):
col[i] = (col[i]/col[i+1])*col[len(col)-1]
print(df) # before formula apply
df.apply(function)
print(df) # after formula apply
>>>
0 1 2 3
0 10.0 78.0 27.0 23.0
1 72.0 42.0 77.0 86.0
2 82.0 12.0 58.0 98.0
3 27.0 92.0 19.0 86.0
4 48.0 83.0 14.0 43.0
5 55.0 18.0 58.0 77.0
6 20.0 58.0 20.0 22.0
7 76.0 19.0 63.0 82.0
8 23.0 99.0 58.0 15.0
9 60.0 57.0 89.0 100.0
0 1 2 3
0 8.333333 105.857143 31.207792 26.744186
1 52.682927 199.500000 118.155172 87.755102
2 182.222222 7.434783 271.684211 113.953488
3 33.750000 63.180723 120.785714 200.000000
4 52.363636 262.833333 21.482759 55.844156
5 165.000000 17.689655 258.100000 350.000000
6 15.789474 174.000000 28.253968 26.829268
7 198.260870 10.939394 96.672414 546.666667
8 23.000000 99.000000 58.000000 15.000000
9 60.000000 57.000000 89.000000 100.000000

Strange behavior with Pandas median

Consider the following dataframe:
b c d e f g h
0 6.25 2018-04-01 True NaN 7 54.0 64.0
1 32.50 2018-04-01 True NaN 7 54.0 64.0
2 16.75 2018-04-01 True NaN 7 54.0 64.0
3 29.25 2018-04-01 True NaN 7 54.0 64.0
4 21.75 2018-04-01 True NaN 7 54.0 64.0
5 21.75 2018-04-01 True True 7 54.0 64.0
6 7.75 2018-04-01 True True 7 54.0 64.0
7 23.25 2018-04-01 True True 7 54.0 64.0
8 12.25 2018-04-01 True True 7 54.0 64.0
9 30.50 2018-04-01 True NaN 7 54.0 64.0
(copy and paste and use df = pd.read_clipboard() to create the dataframe)
Finding the medians initially works with no problem:
df.median()
b 21.75
d 1.00
e 1.00
f 7.00
g 54.00
h 64.00
dtype: float64
However, if a column is dropped and then the median is found, the median for column e disappears:
new_df = df.drop(columns=['b'])
new_df.median()
d 1.0
f 7.0
g 54.0
h 64.0
dtype: float64
This behavior is a little unexpected and finding the median for column e by itself still works:
new_df['e'].median()
1.0
Using skipna=False does not make a difference:
new_df.median(skipna=False)
d 1.0
f 7.0
g 54.0
h 64.0
dtype: float64
(it does for the original dataframe):
df.median(skipna=False)
b 21.75
d 1.00
e NaN
f 7.00
g 54.00
h 64.00
dtype: float64
The datatype of column e is object in both df and new_df and the only difference between the two dataframes is new_df does not have column b. Adding the column back into new_df does not resolve the issue. This only occurs when the first column b is dropped. It does not occur if column e is a float or integer datatype.
This behavior is present in both pandas==0.22.0 and pandas==0.24.1
There is now an open GitHub issue for anyone to try and solve this!
This appears to be a bug. When we dispatch any df to median, this maps to the internal _reduce function. With numeric_only set to None, this computes the median by series, and ignore failures (for the c columns, for e.g. median computation will fail.) and accumulate results (see _reduce in pandas source core/frame.py). So far it is fine. But while stiching the results together through it does a check to infer if the results are scalar or series (for median it will be scalar of course). To do this check, it always use the first column (see wrap_results in pandas source core/apply.py). So if the first column calc failed and it was skipped, this check fails, raising an exception. This triggers the fallback method within _reduce of forcing the dataframe to numeric only (dropping any columns with NaN) and re-compute the medians.
So in your case, if the column c (or any other dtype where median computation will fail, like text) is in the first column, then all columns with NaN will also be dropped for the median results. Setting skipna does not change as the bug is with how non-numeric column in first position triggers a forced numeric only computation. I do not see there is any fix possible without fixing it in the pandas codebase. Or ensuring first column will always succeed for median computation.

Interpolation of a dataframe with immediate data appearing before and after it - Pandas

Let's say I've a dataframe like this -
ID Weight Height
1 80.0 180.0
2 60.0 170.0
3 NaN NaN
4 NaN NaN
5 82.0 185.0
I want the dataframe to be transormed to -
ID Weight Height
1 80.0 180.0
2 60.0 170.0
3 71.0 177.5
4 76.5 181.25
5 82.0 185.0
It takes the average of the immediate data available before and after a NaN and updates the missing/NaN value accordingly.
You can use interpolation from the pandas library by using the following:
df['Weight'], df['Height'] = df.Weight.interpolate(), df.Height.interpolate()
Check the arguments on the documentation for the method of interpolation to tune this to your problem case: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html

Plotting incompatible numpy arrays

I have two numpy arrays of identical length (398 rows), with the first 5 values for each as follows:
y_predicted =
[[-0.85908649]
[-1.19176482]
[-0.93658361]
[-0.83557211]
[-0.80681243]]
y_norm =
mpg
0 -0.705551
1 -1.089379
2 -0.705551
3 -0.961437
4 -0.833494
That is, the first has square brackets around each value, and the second has indexing and no square brackets.
The data is a normalised version of the first column (MPG) of the Auto-MPG dataset. The y_predicted values are results of a linear regression.
https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
Would anyone know how I might convert these arrays to the same type so I can plot a scatter plot of them?
Both have shape: (398, 1)
Both have type: class 'numpy.ndarray', dtype float64
Data from the link provided
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
The second of these looks like a pandas Series to me. If so you can do y_norm.values to get the underlying numpy array.

Categories