Dataframe.rolling().mean not calculating moving average - python

I've been trying to calculate a moving avarage using pandas, but when I use the Dataframe.rolling().mean(), it copies the value instead.
stock_info['stock'].head()
Fecha Open High Low Close Volume
0 04-05-2007 00:00:00 234,4593 255,5703 234,3532 246,8906 6044574
1 07-05-2007 00:00:00 246,8906 254,7023 247,855 252,1563 2953869
2 08-05-2007 00:00:00 252,1562 250,7482 244,9617 250,1695 2007217
3 09-05-2007 00:00:00 250,1695 249,7838 245,9261 248,3757 2329078
4 10-05-2007 00:00:00 248,8194 248,9158 244,9617 245,6368 2138002
stock_info['stock']['MA'] = stock_info['stock']['Close'].rolling(window=2).mean()
Fecha Open High Low Close Volume MA
0 04-05-2007 00:00:00 234,4593 255,5703 234,3532 246,8906 6044574 246,8906
1 07-05-2007 00:00:00 246,8906 254,7023 247,855 252,1563 2953869 252,1563
2 08-05-2007 00:00:00 252,1562 250,7482 244,9617 250,1695 2007217 250,1695
3 09-05-2007 00:00:00 250,1695 249,7838 245,9261 248,3757 2329078 248,3757
4 10-05-2007 00:00:00 248,8194 248,9158 244,9617 245,6368 2138002 245,6368

My first thought is that the values in stock_info['stock']['Close'] are stored as strings, rather than as a numeric type. Attempting
df['MA'] = df['Close'].rolling(window=2).mean()
on
df = pd.DataFrame({'Close': ['246,8906', '252,1563', '250,1695']})
gives
df
Out[38]:
Close MA
0 246,8906 246,8906
1 252,1563 252,1563
2 250,1695 250,1695
as happened for you.
Converting this to a numeric value first, say with
df['MA'] = df['Close'].str.replace(',', '.').astype(float).rolling(window=2).mean()
gives
df
Out[40]:
Close MA
0 246,8906 NaN
1 252,1563 249.52345
2 250,1695 251.16290
as desired.

According to Pandas Latest version docs http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html you have tu use the on parameter in the rolling function.
df1 = pd.DataFrame({'val': range(10,30)})
df1['avg'] = df1.val.mean()
df1['rolling'] = df1.rolling(window=2, on='avg').mean()
instead of using df1['avg'].rolling()

you can use pd.rolling_mean to calcualte
example:
df1 = pd.DataFrame([ np.random.randint(-10,10) for _ in xrange(100) ],columns =['val'])
val
0 4
1 -3
2 -7
3 3
4 -10
df1['MA'] = pd.rolling_mean(df1.val,2)
val MA
0 4 NaN
1 -3 0.5
2 -7 -5.0
3 3 -2.0
4 -10 -3.5

Related

How to create new column in my data frame by condition's

I'm trying to creating a new column in my data frame by the follow condition's :
If the value in Date_of_basket_entryis NAN then respond 0.
If the value in Date_of_basket_entryis greater(DATE STILL IN THE
FUTURE) then in month_year then respond 1.
If the value in Date_of_basket_entryis lower (DATE STILL IN THE
PAST) then in month_year then respond 0.
month_year Date_of_basket_entry
0 03/2017 01.04.2005
1 02/2019 01.01.1995
2 07/2017 None
4 02/2017 None
5 04/2017 01.01.2020
it should be something like this:
month_year Date_of_basket_entry Date_of_basket_boolean
0 03/2017 01.04.2005 0
1 02/2019 01.01.1995 0
2 07/2017 None 0
4 02/2017 None 0
5 04/2017 01.01.2020 1
#Danielhab I like np.where for this situation.
import numpy as np
# if dtype is wrong the condition won't work correctly
df = df.astype(np.datetime64)
df.loc[:, 'Date_of_basket_boolean'] = np.where((df.Date_of_basket_entry.isna()) | (df.Date_of_basket_entry < df.month_year), 0, 1)
I think this should work, just check your logic.
I think it may be difficult to compare month/year to month.day.year. I would start by converting columns to have the same structure. Then you can use numpy's np.where function.
import pandas as pd
import numpy as np
df = pd.DataFrame({'month_year':['03/2017','02/2019', '07/2017', '02/2017', '04/2017'],
'Date_of_basket_entry':['1.04.2005','01.01.1995', None, None, '01.01.2020']})
df['new1'] = pd.to_datetime(df['month_year'], infer_datetime_format=True)
df['new2'] = pd.to_datetime(df['Date_of_basket_entry'], infer_datetime_format=True)
print(df)
month_year Date_of_basket_entry new1 new2
0 03/2017 1.04.2005 2017-03-01 2005-01-04
1 02/2019 01.01.1995 2019-02-01 1995-01-01
2 07/2017 None 2017-07-01 NaT
3 02/2017 None 2017-02-01 NaT
4 04/2017 01.01.2020 2017-04-01 2020-01-01
df['Date_of_basket_boolean'] = np.where(df['new2']>df['new1'],1,0)
print(df)
month_year Date_of_basket_entry new1 new2 Date_of_basket_boolean
0 03/2017 1.04.2005 2017-03-01 2005-01-04 0
1 02/2019 01.01.1995 2019-02-01 1995-01-01 0
2 07/2017 None 2017-07-01 NaT 0
3 02/2017 None 2017-02-01 NaT 0
4 04/2017 01.01.2020 2017-04-01 2020-01-01 1

Handle multiple date formats in pandas dataframe

I have a dataframe (imported from Excel) which looks like this:
Date Period
0 2017-03-02 2017-03-01 00:00:00
1 2017-03-02 2017-04-01 00:00:00
2 2017-03-02 2017-05-01 00:00:00
3 2017-03-02 2017-06-01 00:00:00
4 2017-03-02 2017-07-01 00:00:00
5 2017-03-02 2017-08-01 00:00:00
6 2017-03-02 2017-09-01 00:00:00
7 2017-03-02 2017-10-01 00:00:00
8 2017-03-02 2017-11-01 00:00:00
9 2017-03-02 2017-12-01 00:00:00
10 2017-03-02 Q217
11 2017-03-02 Q317
12 2017-03-02 Q417
13 2017-03-02 Q118
14 2017-03-02 Q218
15 2017-03-02 Q318
16 2017-03-02 Q418
17 2017-03-02 2018
I am trying to convert all the 'Period' column into a consistent format. Some elements look already in the datetime format, others are converted to string (ex. Q217), others to int (ex 2018). Which is the fastest way to convert everything in a datetime? I was trying with some masking, like this:
mask = df['Period'].str.startswith('Q', na = False)
list_quarter = df_final[mask]['Period'].tolist()
quarter_convert = {'1':'31/03', '2':'30/06', '3':'31/08', '4':'30/12'}
counter = 0
for element in list_quarter:
element = element[1:]
quarter = element[0]
year = element[1:]
daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
final = daymonth+'/'+year
list_quarter[counter] = final
counter+=1
However it fails when I try to substitute the modified elements in the original column:
df_nwe_final['Period'] = np.where(mask, pd.Series(list_quarter), df_nwe_final['Period'])
Of course I would need to do more or less the same with the 2018 type formats. However, I am sure I am missing something here, and there should be a much faster solution. Some fresh ideas from you would help! Thank you.
Reusing the code you show, let's first write a function that converts the Q-string to a datetime format (I adjusted to final format a little bit):
def convert_q_string(element):
quarter_convert = {'1':'03-31', '2':'06-30', '3':'08-31', '4':'12-30'}
element = element[1:]
quarter = element[0]
year = element[1:]
daymonth = ''.join(str(quarter_convert.get(word, word)) for word in quarter)
final = '20' + year + '-' + daymonth
return final
We can now use this to first convert all 'Q'-strings, and then pd.to_datetime to convert all elements to proper datetime values:
In [2]: s = pd.Series(['2017-03-01 00:00:00', 'Q217', '2018'])
In [3]: mask = s.str.startswith('Q')
In [4]: s[mask] = s[mask].map(convert_q_string)
In [5]: s
Out[5]:
0 2017-03-01 00:00:00
1 2017-06-30
2 2018
dtype: object
In [6]: pd.to_datetime(s)
Out[6]:
0 2017-03-01
1 2017-06-30
2 2018-01-01
dtype: datetime64[ns]

Merge 2 data frames in pandas

I have 2 data frames: GPS coordinates
Time X Y Z
2013-06-01 00:00:00 13512.466575 -12220.845913 19279.970720
2013-06-01 00:00:00 -13529.778408 -14013.560399 -18060.112972
2013-06-01 00:00:00 25108.907276 8764.536182 1594.215305
2013-06-01 00:00:00 -8436.586675 -22468.562354 -11354.726511
2013-06-01 00:05:00 13559.288748 -11476.738832 19702.063737
2013-06-01 00:05:00 -13500.120049 -14702.564328 -17548.488127
2013-06-01 00:05:00 25128.357948 8883.802142 664.732379
2013-06-01 00:05:00 -8346.854582 -22878.993160 -10544.640975
and Glonass coordinates
Time X Y Z
2013-06-01 00:00:00 0.248752905273E+05 -0.557450976562E+04 -0.726176757812E+03
2013-06-01 00:15:00 0.148314306641E+05 0.510153710938E+04 0.201156157227E+05
2013-06-01 00:15:00 0.242346674805E+05 -0.562089208984E+04 0.561714257812E+04
2013-06-01 00:15:00 0.195601284180E+05 -0.122148081055E+05 -0.108823476562E+05
2013-06-01 00:15:00 0.336192968750E+04 -0.122589394531E+05 -0.220986958008E+05
and I need to merge them according to column Time - to get the coordinates of satellites from only the same time (I need all GPS coordinates and all Glonass coordinates from particular time), the result from above example should look like this:
Time X_gps Y_gps Z_gps X_glonass Y_glonass Z_glonass
0 2013-06-01 00:00:00 13512.466575 -12220.845913 19279.970720 0.248752905273E+05 -0.557450976562E+04 -0.726176757812E+03
1 2013-06-01 00:00:00 -13529.778408 -14013.560399 -18060.112972
2 2013-06-01 00:00:00 25108.907276 8764.536182 1594.215305
3 2013-06-01 00:00:00 -8436.586675 -22468.562354 -11354.726511
What I ended up doing is coord = pd.merge(d_gps, d_glonass, on = 'Time', how = 'inner', suffixes = ('_gps','_glonass')) but it copies glonass coordinates to fulfill empty spaces in data frame. What should I change to get the result I want?
I'm new to pandas so I really need your help.
After merging (I took the liberty of renaming the columns first), you can then iterate over the columns, test for duplicated and set these to NaN, you can't set to be blank as the column dtype is a float and setting to a blank string will raise invalid literal error:
In [272]:
df1 = df1.rename(columns={'X':'X_glonass', 'Y':'Y_glonass', 'Z':'Z_glonass'})
df = df.rename(columns={'X':'X_gps', 'Y':'Y_gps', 'Z':'Z_gps'})
merged = df.merge(df1, on='Time')
In [278]:
for col in merged.columns[1:]:
merged.loc[merged[col].duplicated(),col] = np.NaN
merged
Out[278]:
Time X_gps Y_gps Z_gps X_glonass \
0 2013-06-01 13512.466575 -12220.845913 19279.970720 24875.290527
1 2013-06-01 -13529.778408 -14013.560399 -18060.112972 NaN
2 2013-06-01 25108.907276 8764.536182 1594.215305 NaN
3 2013-06-01 -8436.586675 -22468.562354 -11354.726511 NaN
Y_glonass Z_glonass
0 -5574.509766 -726.176758
1 NaN NaN
2 NaN NaN
3 NaN NaN

How to access last element of a multi-index dataframe

I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.

Pandas resample dataframe

I have a resampling (downsampling) problem that should be straightforward to do but I'm not able!!
Here is a simplified example:
df:
Time A
0 0.01591 0.108929
1 0.27973 0.411764
2 0.55044 0.064253
3 0.81386 0.317394
4 1.07983 0.722707
5 1.35051 1.154193
6 1.61495 1.151492
7 1.88035 0.123389
8 2.15462 0.093583
9 2.41534 0.260944
10 2.67992 1.007564
11 2.95148 0.325353
12 3.21364 0.555593
13 3.47980 0.740621
15 4.01519 1.619669
16 4.28679 0.477371
17 4.55482 0.432049
18 4.81570 0.194224
19 5.07992 0.331936
The Time column is in seconds. I would like to make the Time column the index and downsample the dataframe to 1s. Help please?
You can use reindex and choose one fill method
In [37]: df.set_index('Time').reindex(range(0,6), method='bfill')
Out[37]:
A
0 0.108929
1 0.722707
2 0.093583
3 0.555593
4 1.619669
5 0.331936
First convert your index to datetime format:
df.index=pd.to_datetime(df.Time,unit='s')
Then resample by second (this is the mean value by default but can be changed to sum etc - e.g. add how='sum' as parameter):
d.resample('S')
Time A
Time
1970-01-01 00:00:00 0.414985 0.225585
1970-01-01 00:00:01 1.481410 0.787945
1970-01-01 00:00:02 2.550340 0.421861
1970-01-01 00:00:03 3.346720 0.648107
1970-01-01 00:00:04 4.418125 0.680828
1970-01-01 00:00:05 5.079920 0.331936
The year/date can be changed if important.

Categories