I'd like to generate a series that's the incremental mean of a timeseries. Meaning that, starting from the first date (index 0), the mean stored in row x is the average of values [0:x]
data
index value mean formula
0 4
1 5
2 6
3 7 5.5 average(0-3)
4 4 5.2 average(0-4)
5 5 5.166666667 average(0-5)
6 6 5.285714286 average(0-6)
7 7 5.5 average(0-7)
I'm hoping there's a way to do this without looping to take advantage of pandas.
Here's an update for newer versions of Pandas (starting with 0.18.0)
df['value'].expanding().mean()
or
s.expanding().mean()
As #TomAugspurger points out, you can use expanding_mean:
In [11]: s = pd.Series([4, 5, 6, 7, 4, 5, 6, 7])
In [12]: pd.expanding_mean(s, 4)
Out[12]:
0 NaN
1 NaN
2 NaN
3 5.500000
4 5.200000
5 5.166667
6 5.285714
7 5.500000
dtype: float64
Another approach is to use cumsum(), and divide by the cumulative number of items, for example:
In [1]:
s = pd.Series([4, 5, 6, 7, 4, 5, 6, 7])
s.cumsum() / pd.Series(np.arange(1, len(s)+1), s.index)
Out[1]:
0 4.000000
1 4.500000
2 5.000000
3 5.500000
4 5.200000
5 5.166667
6 5.285714
7 5.500000
dtype: float64
Related
I am fairly new to Python and trying to figure out how to generate dataframes for multiple arrays. I have a list where the arrays are currently stored:
list = [ [1 2 3 4], [12 19 30 60 95 102] ]
What I want to do is take each array from this list and put them into separate dataframes, with the array contents populating a column of the dataframe like so:
Array2_df
1 12
2 19
3 30
4 60
I have found several answers involving the use of dictionaries, but am not sure how that would actually solve my problem... I also don't understand how naming the dataframes dynamically would work. I have tried playing around with for loops, but that just overwrote the same dataframe repeatedly. Please help!! Thanks :)
As mentioned in the comments, dynamically created variables is a bad idea. Why not use a single dataframe, like so:
In [1]: zlist = [[1, 2, 3, 4], [12, 19, 30, 60, 95, 102], [1, 2, 4, 5, 1, 6, 1, 7, 8, 21]]
In [2]: pd.DataFrame({f"array_{i}": pd.Series(z) for i, z in enumerate(zlist)})
Out[2]:
array_0 array_1 array_2
0 1.0 12.0 1
1 2.0 19.0 2
2 3.0 30.0 4
3 4.0 60.0 5
4 NaN 95.0 1
5 NaN 102.0 6
6 NaN NaN 1
7 NaN NaN 7
8 NaN NaN 8
9 NaN NaN 21
If you really insist on separate dataframes, then you should store them in a dictionary:
df_dict = {f"array_{i}": pd.DataFrame({f"array_{i}": z}) for i, z in enumerate(zlist)}
Then, you can access a specific dataframe by name:
In [8]: df_dict["array_2"]
Out[8]:
array_2
0 1
1 2
2 4
3 5
4 1
5 6
6 1
7 7
8 8
9 21
I have a dataframe in which the index is a datetime and column A and B are objects. I need to see the unique values of A and B per week.
I managed to get the unique value count per week (I am using the pd.grouper function for that) but I am struggling to get the unique values per week.
This code gives me the unique value counts per week
df_unique = pd.DataFrame(df.groupby(pd.Grouper(freq="W"))['A', 'B'].nunique())
However, the code below does not give me the unique values itself per week
df_unique_list = pd.DataFrame(df.groupby(pd.Grouper(freq="W"))['A', 'B'].unique())
This code gives me te following error message
AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'
Use lambda function with Series.unique and converting to list:
np.random.seed(123)
rng = pd.date_range('2017-04-03', periods=20)
df = pd.DataFrame({'A': np.random.choice([1,2,3,4,5,6], size=20),
'B': np.random.choice([1,2,3,4,5,6,7,8], size=20)}, index=rng)
print (df)
A B
2017-04-03 6 1
2017-04-04 3 5
2017-04-05 5 2
2017-04-06 3 8
2017-04-07 2 4
2017-04-08 4 3
2017-04-09 3 5
2017-04-10 4 8
2017-04-11 2 3
2017-04-12 2 5
2017-04-13 1 8
2017-04-14 2 1
2017-04-15 2 6
2017-04-16 1 1
2017-04-17 1 8
2017-04-18 2 2
2017-04-19 4 4
2017-04-20 6 5
2017-04-21 5 5
2017-04-22 1 5
df_unique_list = df.groupby(pd.Grouper(freq="W"))['A', 'B'].agg(lambda x: list(x.unique()))
print (df_unique_list)
A B
2017-04-09 [6, 3, 5, 2, 4] [1, 5, 2, 8, 4, 3]
2017-04-16 [4, 2, 1] [8, 3, 5, 1, 6]
2017-04-23 [1, 2, 4, 6, 5] [8, 2, 4, 5]
This question already has an answer here:
adding a grouped-by zscore column to a pandas dataframe
(1 answer)
Closed 3 years ago.
So I have a dataframe that looks like this:
pd.DataFrame([[1, 10, 14], [1, 12, 14], [1, 20, 12], [1, 25, 12], [2, 18, 12], [2, 30, 14], [2, 4, 12], [2, 10, 14]], columns = ['A', 'B', 'C'])
A B C
0 1 10 14
1 1 12 14
2 1 20 12
3 1 25 12
4 2 18 12
5 2 30 14
6 2 4 12
7 2 10 14
My goal is to get the z-scores of column B, relative to their groups by column A and C. I know I can calculate the mean and standard deviation of each group
test.groupby(['A', 'C']).mean()
B
A C
1 12 22.5
14 11.0
2 12 11.0
14 20.0
test.groupby(['A', 'C']).std()
B
A C
1 12 3.535534
14 1.414214
2 12 9.899495
14 14.142136
Now for every item in column B I want to calculate it's z-score based off of these means and standard deviations. So the first result would be (10 - 11) / 1.41. I feel like there has to be a way to do this without too much complexity but I've been stuck on how to proceed. Let me know if anyone can point me in the right direction or if I need to clarify anything!
Do with transform
Mean=test.groupby(['A', 'C']).B.transform('mean')
Std=test.groupby(['A', 'C']).B.transform('std')
Then
(test.B - Mean) / Std
One function zscore from scipy
from scipy.stats import zscore
test.groupby(['A', 'C']).B.transform(lambda x : zscore(x,ddof=1))
Out[140]:
0 -0.707107
1 0.707107
2 -0.707107
3 0.707107
4 0.707107
5 0.707107
6 -0.707107
7 -0.707107
Name: B, dtype: float64
Ok Show my number tie out hehe
(test.B - Mean) / Std ==test.groupby(['A', 'C']).B.transform(lambda x : zscore(x,ddof=1))
Out[148]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
Name: B, dtype: bool
https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_quantile.html
I cant not see how to best ignore NaNs in the rolling percentile function. Would anyone know?
seriestest = pd.Series([1, 5, 7, 2, 4, 6, 9, 3, 8, 10])
and insert nans
seriestest2 = pd.Series([1, 5, np.NaN, 2, 4, np.nan, 9, 3, 8, 10])
Now, on the first series, I get expected output, using:
seriestest.rolling(window = 3).quantile(.5)
But, I wish to do the same and ignore NaNs on the test2 series.
seriestest2.rolling(window = 3).quantile(.5)
Gives:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 8.0
9 8.0
dtype: float64
But I think it gives something like this if we can parse a skipna=True, which doesn't work for me:
0 NaN
1 NaN
2 5.0
3 2.0
4 4.0
5 4.0
6 4.0
7 3.0
8 8.0
9 8.0
dtype: float64
The issue is that having nan values will give you less than the required number of elements (3) in your rolling window. You can define the minimum number of valid observations with rolling to be less by setting the min_periods parameter.
seriestest2.rolling(window=3, min_periods=1).quantile(.5)
Alternatively, if you simply want to replace nan values, with say 0, you can use fillna:
seriestest2.fillna(value=0).rolling(window=3).quantile(.5)
I am struggeling to get the right (restricted to the selection) index when using the methode xs by pandas to select specific data in my dataframe. Let me demonstrate, what I am doing:
print(df)
value
idx1 idx2 idx3 idx4 idx5
10 2.0 0.0010 1 2 6.0 ...
2 3 6.0 ...
...
7 8 6.0 ...
8 9 6.0 ...
20 2.0 0.0010 1 2 6.0 ...
2 3 6.0 ...
...
18 19 6.0 ...
19 20 6.0 ...
# get dataframe for idx1 = 10, idx2 = 2.0, idx3 = 0.0010
print(df.xs([10,2.0,0.0010]))
value
idx4 idx5
1 2 6.0 ...
2 3 6.0 ...
3 4 6.0 ...
4 5 6.0 ...
5 6 6.0 ...
6 7 6.0 ...
7 8 6.0 ...
8 9 6.0 ...
# get the first index list of this part of the dataframe
print(df.xs([10,2.0,0.0010]).index.levels[0])
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19]
So I do not understand, why the full list of values that occur in idx4 is returned even though we restricted the dataframe to a part where idx4 only takes values from 1 to 8. Is it that I use the index method in a wrong way?
This is a known feature not bug. pandas preserves all of the index information. You can determine which of the levels are expressed and at what location via the labels attribute.
If you are looking to create an index that is fresh and just contains the information relevant to the slice you just made, you can do this:
df_new = df.xs([10,2.0,0.0010])
idx_new = pd.MultiIndex.from_tuples(df_new.index.to_series(),
names=df_new.index.names)
df_new.index = idx_new