I have a pandas dataframe that I filled with this:
import pandas.io.data as web
test = web.get_data_yahoo('QQQ')
The dataframe looks like this in iPython:
In [13]: test
Out[13]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 729 entries, 2010-01-04 00:00:00 to 2012-11-23 00:00:00
Data columns:
Open 729 non-null values
High 729 non-null values
Low 729 non-null values
Close 729 non-null values
Volume 729 non-null values
Adj Close 729 non-null values
dtypes: float64(5), int64(1)
When I divide one column by another, I get a float64 result that has a satisfactory number of decimal places. I can even divide one column by another column offset by one, for instance test.Open[1:]/test.Close[:], and get a satisfactory number of decimal places. When I divide a column by itself offset, however, I get just 1:
In [83]: test.Open[1:] / test.Close[:]
Out[83]:
Date
2010-01-04 NaN
2010-01-05 0.999354
2010-01-06 1.005635
2010-01-07 1.000866
2010-01-08 0.989689
2010-01-11 1.005393
...
In [84]: test.Open[1:] / test.Open[:]
Out[84]:
Date
2010-01-04 NaN
2010-01-05 1
2010-01-06 1
2010-01-07 1
2010-01-08 1
2010-01-11 1
I'm probably missing something simple. What do I need to do in order to get a useful value out of that sort of calculation? Thanks in advance for the assistance.
If you're looking to do operations between the column and lagged values, you should be doing something like test.Open / test.Open.shift().
shift realigns the data and takes an optional number of periods.
You may not be getting what you think you are when you do test.Open[1:]/test.Close. Pandas matches up the rows based on their index, so you're still getting each element of one column divided by its corresponding element in the other column (not the element one row back). Here's an example:
>>> print d
A B C
0 1 3 7
1 -2 1 6
2 8 6 9
3 1 -5 11
4 -4 -2 0
>>> d.A / d.B
0 0.333333
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
>>> d.A[1:] / d.B
0 NaN
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
Notice that the values returned are the same for both operations. The second one just has nan for the first one, since there was no corresponding value in the first operand.
If you really want to operate on offset rows, you'll need to dig down to the numpy arrays that underpin the pandas DataFrame, to bypass pandas's index-aligning features. You can get at these innards with the values attribute of a column.
>>> d.A.values[1:] / d.B.values[:-1]
array([-0.66666667, 8. , 0.16666667, 0.8 ])
Now you really are getting each value divided by the one before it in the other column. Note that here you have to explicitly slice the second operand to leave off the last element, to make them equal in length.
So you can do the same to divide a column by an offset version of itself:
>>> d.A.values[1:] / d.A.values[:-1]
45: array([-2. , -4. , 0.125, -4. ])
Related
I have a Pandas series of random numbers from -1 to +1:
from pandas import Series
from random import random
x = Series([random() * 2 - 1. for i in range(1000)])
x
Output:
0 -0.499376
1 -0.386884
2 0.180656
3 0.014022
4 0.409052
...
995 -0.395711
996 -0.844389
997 -0.508483
998 -0.156028
999 0.002387
Length: 1000, dtype: float64
I can get the rolling standard deviation of the full Series easily:
x.rolling(30).std()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 0.575365
996 0.580220
997 0.580924
998 0.577202
999 0.576759
Length: 1000, dtype: float64
But what I would like to do is to get the standard deviation of only positive numbers within the rolling window. In our example, the window length is 30... say there are only 15 positive numbers in the window, I want the standard deviation of only those 15 numbers.
One could remove all negative numbers from the Series and calculate the rolling standard deviation:
x[x > 0].rolling(30).std()
Output:
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
...
988 0.286056
990 0.292455
991 0.283842
994 0.291798
999 0.291824
Length: 504, dtype: float64
...But this isn't the same thing, as there will always be 30 positive numbers in the window here, whereas for what I want, the number of positive numbers will change.
I want to avoid iterating over the Series; I was hoping there might be a more Pythonic way to solve my problem. Can anyone help ?
Mask the non positive values with NaN then calculate the rolling std with min_periods=1 and optionally set the first 29 values to NaN.
w = 30
s = x.mask(x <= 0).rolling(w, min_periods=1).std()
s.iloc[:w - 1] = np.nan
Note
Passing the argument min_periods=1 is important here because there can be certain windows where the number of non-null values is not equal to length of that window and in such case you will get the NaN result.
Another possible solution:
pd.Series(np.where(x >= 0, x, np.nan)).rolling(30, min_periods=1).std()
Output:
0 NaN
1 NaN
2 NaN
3 0.441567
4 0.312562
...
995 0.323768
996 0.312461
997 0.304077
998 0.308342
999 0.301742
Length: 1000, dtype: float64
You may first turn non-positive values into np.nan, then apply np.nanstd to each window. So
x[x.values <= 0] = np.nan
rolling_list = [np.nanstd(window.to_list()) for window in x.rolling(window=30)]
will return
[0.0,
0.0,
0.38190115685808856,
0.38190115685808856,
0.38190115685808856,
0.3704840425749437,
0.33234158296550925,
0.33234158296550925,
0.3045579286056045,
0.2962826377559198,
0.275920580105683,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554
...]
IIUC, after rolling, you want to calculate std of only positive values in each rolling window
out = x.rolling(30).apply(lambda w: w[w>0].std())
print(out)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 0.324031
996 0.298276
997 0.294917
998 0.304506
999 0.308050
Length: 1000, dtype: float64
Comparing 2 series objects of different sizes:
IN[248]:df['Series value 1']
Out[249]:
0 70
1 66.5
2 68
3 60
4 100
5 12
Name: Stu_perc, dtype: int64
IN[250]:benchmark_value
#benchamrk is a subset of data from df2 only based on certain filters
Out[251]:
0 70
Name: Stu_perc, dtype: int64
Basically I wish to compare df['Series value 1'] with benchmark_value and return the values which are greater than 95% of benchark value in a column Matching list. Type of both of these is Pandas series. However sizes are different for both, hence it is not comparing.
Input given:
IN[252]:df['Matching list']=(df2['Series value 1']>=0.95*benchmark_value)
OUT[253]: ValueError: Can only compare identically-labeled Series objects
Output wanted:
[IN]:
df['Matching list']=(df2['Stu_perc']>=0.95*benchmark_value)
#0.95*Benchmark value is 66.5 in this case.
df['Matching list']
[OUT]:
0 70
1 66.5
2 68
3 NULL
4 100
5 NULL
Because benchmark_value is Series, for scalar need select first value of Series by Series.iat and set NaNs by Series.where:
benchmark_value = pd.Series([70], index=[0])
val = benchmark_value.iat[0]
df2['Matching list']= df2['Stu_perc'].where(df2['Stu_perc']>=0.95*val)
print (df2)
Stu_perc Matching list
0 70.0 70.0
1 66.5 66.5
2 68.0 68.0
3 60.0 NaN
4 100.0 100.0
5 12.0 NaN
General solution also working if benchmark_value is empty is next with iter for return first value of Series and if not exist use default value - here 0:
benchmark_value = pd.Series([])
val = next(iter(benchmark_value), 0)
df2['Matching list']= df2['Stu_perc'].where(df2['Stu_perc']>=0.95*val)
print (df2)
Stu_perc Matching list
0 70.0 70.0
1 66.5 66.5
2 68.0 68.0
3 60.0 60.0
4 100.0 100.0
5 12.0 12.0
is your benchmark value is single-value?
If yes, you might need to convert benchmark_value which is a series to a number (without index) by using df['Matching list']=(df['Stu_perc']>=0.95*benchmark_value.values)
It seems benchmark value is a Series with a single row, so not an actual number, I believe you need to access it first.
But this will return a list of Booleans. To get just the values that you want, you can use the where function.
Try this:
df['Matching list']= df2['Stu_perc'].where(df2['Stu_perc'] >=0.95*benchmark_value[0][0]))
need to fill the NA values with the past three values mean of that NA
this is my dataset
RECEIPT_MONTH_YEAR NET_SALES
0 2014-01-01 818817.20
1 2014-02-01 362377.20
2 2014-03-01 374644.60
3 2014-04-01 NA
4 2014-05-01 NA
5 2014-06-01 NA
6 2014-07-01 NA
7 2014-08-01 46382.50
8 2014-09-01 55933.70
9 2014-10-01 292303.40
10 2014-10-01 382928.60
is this dataset a .csv file or a dataframe. This NA is a 'NaN' or a string ?
import pandas as pd
import numpy as np
df=pd.read_csv('your dataset',sep=' ')
df.replace('NA',np.nan)
df.fillna(method='ffill',inplace=True)
you mention something about mean of 3 values..the above simply forward fills the last observation before the NaNs begin. This is often a good way for forecasting (better than taking means in certain cases, if persistence is important)
ind = df['NET_SALES'].index[df['NET_SALES'].apply(np.isnan)]
Meanof3 = df.iloc[ind[0]-3:ind[0]].mean(axis=1,skipna=True)
df.replace('NA',Meanof3)
Maybe the answer can be generalised and improved if more info about the dataset is known - like if you always want to take the mean of last 3 measurements before any NA. The above will allow you to check the indices that are NaNs and then take mean of 3 before, while ignoring any NaNs
This is simple but it is working
df_data.fillna(0,inplace=True)
for i in range(0,len(df_data)):
if df_data['NET_SALES'][i]== 0.00:
condtn = df_data['NET_SALES'][i-1]+df_data['NET_SALES'][i-2]+df_data['NET_SALES'][i-3]
df_data['NET_SALES'][i]=condtn/3
You could use fillna (assuming that your NA is already np.nan) and rolling mean:
import pandas as pd
import numpy as np
df = pd.DataFrame([818817.2,362377.2,374644.6,np.nan,np.nan,np.nan,np.nan,46382.5,55933.7,292303.4,382928.6], columns=["NET_SALES"])
df["NET_SALES"] = df["NET_SALES"].fillna(df["NET_SALES"].shift(1).rolling(3, min_periods=1).mean())
Out:
NET_SALES
0 818817.2
1 362377.2
2 374644.6
3 518613.0
4 368510.9
5 374644.6
6 NaN
7 46382.5
8 55933.7
9 292303.4
10 382928.6
If you want to include the imputed values I guess you'll need to use a loop.
I've got a dataframe, and I'm trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case). But I noticed one weird thing along the way. Can you help me make sense of it?
Here is some data that has the right structure (code modeled on an answer here):
import pandas as pd
import numpy as np
import random
from itertools import product
random.seed(1) # so you can play along at home
np.random.seed(2) # ditto
# make a list of dates for a few periods
dates = pd.date_range(start='2013-10-01', periods=4).to_native_types()
# make a list of tickers
tickers = ['ticker_%d' % i for i in range(3)]
# make a list of all the possible (date, ticker) tuples
pairs = list(product(dates, tickers))
# put them in a random order
random.shuffle(pairs)
# exclude a few possible pairs
pairs = pairs[:-3]
# make some data for all of our selected (date, ticker) tuples
values = np.random.rand(len(pairs))
mydates, mytickers = zip(*pairs)
data = pd.DataFrame({'date': mydates, 'ticker': mytickers, 'value':values})
Ok, great. This gives me a frame like so:
date ticker value
0 2013-10-03 ticker_2 0.435995
1 2013-10-04 ticker_2 0.025926
2 2013-10-02 ticker_1 0.549662
3 2013-10-01 ticker_0 0.435322
4 2013-10-02 ticker_2 0.420368
5 2013-10-03 ticker_0 0.330335
6 2013-10-04 ticker_1 0.204649
7 2013-10-02 ticker_0 0.619271
8 2013-10-01 ticker_2 0.299655
My goal is to add a new column to this dataframe that will contain sequential changes. The data needs to be in order to do this, but the ordering and the differencing needs to be done "ticker-wise" so that gaps in another ticker don't cause NA's for a given ticker. I want to do this without perturbing the dataframe in any other way (i.e. I do not want the resulting DataFrame to be reordered based on what was necessary to do the differencing). The following code works:
data1 = data.copy() #let's leave the original data alone for later experiments
data1.sort(['ticker', 'date'], inplace=True)
data1['diffs'] = data1.groupby(['ticker'])['value'].transform(lambda x: x.diff())
data1.sort_index(inplace=True)
data1
and returns:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
So far, so good. If I replace the middle line above with the more concise code shown here, everything still works:
data2 = data.copy()
data2.sort(['ticker', 'date'], inplace=True)
data2['diffs'] = data2.groupby('ticker')['value'].diff()
data2.sort_index(inplace=True)
data2
A quick check shows that, in fact, data1 is equal to data2. However, if I do this:
data3 = data.copy()
data3.sort(['ticker', 'date'], inplace=True)
data3['diffs'] = data3.groupby('ticker')['value'].transform(np.diff)
data3.sort_index(inplace=True)
data3
I get a strange result:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0
1 2013-10-04 ticker_2 0.025926 NaN
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 NaN
5 2013-10-03 ticker_0 0.330335 0
6 2013-10-04 ticker_1 0.204649 NaN
7 2013-10-02 ticker_0 0.619271 NaN
8 2013-10-01 ticker_2 0.299655 0
What's going on here? When you call the .diff method on a Pandas object, is it not just calling np.diff? I know there's a diff method on the DataFrame class, but I couldn't figure out how to pass that to transform without the lambda function syntax I used to make data1 work. Am I missing something? Why is the diffs column in data3 screwy? How can I have call the Pandas diff method within transform without needing to write a lambda to do it?
Nice easy to reproduce example!! more questions should be like this!
Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2
In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)
In [34]: data3.sort_index(inplace=True)
In [25]: data3
Out[25]:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
[9 rows x 4 columns]
I believe that np.diff doesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__ on input __array_wrap__ on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff is not dealing with the index properly and doing its own calculation (which in this case is wrong).
Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.
You can see that the Series .diff() method is different to np.diff():
In [11]: data.value.diff() # Note the NaN
Out[11]:
0 NaN
1 -0.410069
2 0.523736
3 -0.114340
4 -0.014955
5 -0.090033
6 -0.125686
7 0.414622
8 -0.319616
Name: value, dtype: float64
In [12]: np.diff(data.value.values) # the values array of the column
Out[12]:
array([-0.41006867, 0.52373625, -0.11434009, -0.01495459, -0.09003298,
-0.12568619, 0.41462233, -0.31961629])
In [13]: np.diff(data.value) # on the column (Series)
Out[13]:
0 NaN
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 NaN
Name: value, dtype: float64
In [14]: np.diff(data.value.index) # er... on the index
Out[14]: Int64Index([8], dtype=int64)
In [15]: np.diff(data.value.index.values)
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1])
I have a large Pandas DataFrame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3425100 entries, 2011-12-01 00:00:00 to 2011-12-31 23:59:59
Data columns:
sig_qual 3425100 non-null values
heave 3425100 non-null values
north 3425099 non-null values
west 3425097 non-null values
dtypes: float64(4)
I select a subset of that DataFrame using .ix[start_datetime:end_datetime] and I pass this to a peakdetect function which returns the index and value of the local maxima and minima in two seperate lists. I extract the index position of the maxima and using DataFrame.index I get a list of pandas TimeStamps.
I then attempt to extract the relevant subset of the large DataFrame by passing the list of TimeStamps to .ix[] but it always seems to return an empty DataFrame. I can loop over the list of TimeStamps and get the relevant rows from the DataFrame but this is a lengthy process and I thought that ix[] should accept a list of values according to the docs?
(Although I see that the example for Pandas 0.7 uses a numpy.ndarray of numpy.datetime64)
Update:
A small 8 second subset of the DataFrame is selected below, # lines show some of the values:
y = raw_disp['heave'].ix[datetime(2011,12,30,0,0,0):datetime(2011,12,30,0,0,8)]
#csv representation of y time-series
2011-12-30 00:00:00,-310.0
2011-12-30 00:00:01,-238.0
2011-12-30 00:00:01.500000,-114.0
2011-12-30 00:00:02.500000,60.0
2011-12-30 00:00:03,185.0
2011-12-30 00:00:04,259.0
2011-12-30 00:00:04.500000,231.0
2011-12-30 00:00:05.500000,139.0
2011-12-30 00:00:06.500000,55.0
2011-12-30 00:00:07,-49.0
2011-12-30 00:00:08,-144.0
index = y.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-12-30 00:00:00, ..., 2011-12-30 00:00:08]
Length: 11, Freq: None, Timezone: None
#_max returned from the peakdetect function, one local maxima for this 8 seconds period
_max = [[5, 259.0]]
indexes = [x[0] for x in _max]
#[5]
timestamps = [index[z] for z in indexes]
#[<Timestamp: 2011-12-30 00:00:04>]
print raw_disp.ix[timestamps]
#Empty DataFrame
#Columns: array([sig_qual, heave, north, west, extrema], dtype=object)
#Index: <class 'pandas.tseries.index.DatetimeIndex'>
#Length: 0, Freq: None, Timezone: None
for timestamp in timestamps:
print raw_disp.ix[timestamp]
#sig_qual 0
#heave 259
#north 27
#west 132
#extrema 0
#Name: 2011-12-30 00:00:04
Update 2:
I created a gist, which actually works because when the data is loaded in from csv the index columns of timestamps are stored as numpy array of objects which appear to be strings. Unlike in my own code where the index is of type <class 'pandas.tseries.index.DatetimeIndex'> and each element is of type <class 'pandas.lib.Timestamp'>, I thought passing a list of pandas.lib.Timestamp would work the same as passing individual timestamps, would this be considered a bug?
If I create the original DataFrame with the index as a list of strings, querying with a list of strings works fine. It does increase the byte size of the DataFrame significantly though.
Update 3:
The error only appears to occur with very large DataFrames, I reran the code on varying sizes of DataFrame ( some detail in a comment below ) and it appears to occur on a DataFrame above 2.7 million records. Using strings as opposed to TimeStamps resolves the issue but increases memory usage.
Fixed
In latest github master (18/09/2012), see comment from Wes at bottom of page.
df.ix[my_list_of_dates] should work just fine.
In [193]: df
Out[193]:
A B C D
2012-08-16 2 1 1 7
2012-08-17 6 4 8 6
2012-08-18 8 3 1 1
2012-08-19 7 2 8 9
2012-08-20 6 7 5 8
2012-08-21 1 3 3 3
2012-08-22 8 2 3 8
2012-08-23 7 1 7 4
2012-08-24 2 6 0 6
2012-08-25 4 6 8 1
In [194]: row_pos = [2, 6, 9]
In [195]: df.ix[row_pos]
Out[195]:
A B C D
2012-08-18 8 3 1 1
2012-08-22 8 2 3 8
2012-08-25 4 6 8 1
In [196]: dates = [df.index[i] for i in row_pos]
In [197]: df.ix[dates]
Out[197]:
A B C D
2012-08-18 8 3 1 1
2012-08-22 8 2 3 8
2012-08-25 4 6 8 1