Fancy Index in Pandas Panels - python

Let we have following Panel:
companies = ["GOOG", "YHOO", "AMZN", "MSFT", "AAPL"]
p = data.DataReader(name = companies, data_source="google", start = "2013-01-01", end = "2017-02-22")
I want to extract values of "Low", "MSFT" for two dates "2013-01-02" and "2013-01-08". I use several options and some of them work, some not. Here are those methods:
Using .ix[] method
p.ix["Low", [0,4], "MSFT"] and the result is:
Date
2013-01-02 27.15
2013-01-08 26.46
Name: MSFT, dtype: float64
So it works, no problem at all.
Using .iloc[] method
p.iloc[2, [0,4], 3] and it also works.
Date
2013-01-02 27.15
2013-01-08 26.46
Name: MSFT, dtype: float64
Using .ix[] method again but in different way
p.ix["Low", ["2013-01-02", "2013-01-08"], "MSFT"] and it returns weird result as:
Date
2013-01-02 NaN
2013-01-08 NaN
Name: MSFT, dtype: float64
Using .loc[] method
p.loc["Low", ["2013-01-02", "2013-01-08"], "MSFT"] and this time an error raised
KeyError: "None of [['2013-01-02', '2013-01-08']] are in the [index]"
1 and 2 are the ones that work, and it is pretty straightforward. However, I don't understand the reason of getting NaN values in 3rd method and an error in 4th method.

Using the following
In [116]: wp = pd.Panel(np.random.randn(2, 5, 4), items=['Low', 'High'],
.....: major_axis=pd.date_range('1/1/2000', periods=5),
.....: minor_axis=['A', 'B', 'C', 'D'])
It's good to remember that:
.loc uses the labels in the index
.iloc uses integer positioning in the index
.ix tries acting like .loc but falls back to .iloc if it fails, so we can focus only on the .ix version
source
If I do wp.ix['Low'] I get
A B C D
2000-01-01 -0.864402 0.559969 1.226582 -1.090447
2000-01-02 0.288341 -0.786711 -0.662960 0.613778
2000-01-03 1.712770 1.393537 -2.230170 -0.082778
2000-01-04 -1.297067 1.076110 -1.384226 1.824781
2000-01-05 1.268253 -2.185574 0.090986 0.464095
Now if you want to access the data for 2000-01-01 through 2000-01-03, you need to use the syntax
wp.loc['Low','2000-01-01':'2000-01-03']
which returns
A B C D
2000-01-01 -0.864402 0.559969 1.226582 -1.090447
2000-01-02 0.288341 -0.786711 -0.662960 0.613778
2000-01-03 1.712770 1.393537 -2.230170 -0.082778

Related

Why does date_range give a result different from indexing [] for DataFrame Pandas dates?

Here is a simple code with date_range and indexing [ ] I used with Pandas
period_start = '2013-01-01'
period_end = '2019-12-24'
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns},index=pd.date_range(start=period_start,periods=6)))
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns})[period_start:'20130110'])
date_range gives Nan results
close returns
2013-01-01 NaN NaN
2013-01-02 NaN NaN
2013-01-03 NaN NaN
2013-01-04 NaN NaN
Indexing gives correct results
close returns
2013-01-02 00:00:00+00:00 68.732 0.028322
2013-01-03 00:00:00+00:00 68.032 -0.010184
2013-01-04 00:00:00+00:00 66.091 -0.028531
Based on how the dates are shown by date_range - I suppose the date format of date_range does not match the date format in the Pandas DataFrame.
1) Can you explaine please why it gives NaN?
2) What would you suggest to get a specific time range from the Panda DataFrame?
As I'm a beginner in Python and its libraries, I didn't understand that this question refers to the Quantopian library, not to Pandas.
I got a solution on their forum. All the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range method returns timezone naive dates. Simply supply the timezone information to date_range method. Like this
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns,},
index=pd.date_range(start=period_start, periods=6, tz='UTC'))
To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:
df = pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
my_selected_dates = df['2013-01-04':'2013-01-08']
This bracket notation is really shorthand for using the loc method
my_selected_dates = df.loc['2013-01-04':'2013-01-08']
Both work the same but the loc method has a bit more flexibility. This notation also works with datetimes if desired.

Selecting single row as dataframe with DatetimeIndex

I have a time series in a dataframe with DatetimeIndex like that:
import pandas as pd
dates= ["2015-10-01 00:00:00",
"2015-10-01 01:00:00",
"2015-10-01 02:00:00",
"2015-10-01 03:00:00",
"2015-10-01 04:00:00"]
df = pd.DataFrame(index=pd.DatetimeIndex(dates))
df["values"] = range(0,5)
Out[]:
values
2015-10-01 00:00:00 0
2015-10-01 01:00:00 1
2015-10-01 02:00:00 2
2015-10-01 03:00:00 3
2015-10-01 04:00:00 4
I would like to as simple clean as possible select a row looking like that, based on the date being the key, e.g. "2015-10-01 02:00:00":
Out[]:
values
2015-10-01 02:00:00 2
Simply using indexing results in a key error:
df["2015-10-01 02:00:00"]
Out[]:
KeyError: '2015-10-01 02:00:00'
Similarly this:
df.loc[["2015-10-01 02:00:00"]]
Out[]:
KeyError: "None of [['2015-10-01 02:00:00']] are in the [index]"
These surprisingly (?) result in the same series as follows:
df.loc["2015-10-01 02:00:00"]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
df.loc["2015-10-01 02:00:00",:]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
print(type(df.loc["2015-10-01 02:00:00"]))
print(type(df.loc["2015-10-01 02:00:00",:]))
print(df.loc["2015-10-01 02:00:00"].shape)
print(df.loc["2015-10-01 02:00:00",:].shape)
Out[]:
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
(1,)
(1,)
I could wrap any of those in DataFrame like that:
slize = pd.DataFrame(df.loc["2015-10-01 02:00:00",:])
Out[]:
2015-10-01 02:00:00
values 2
Of course I could do this to reach my result:
slize.T
Out[]:
values
2015-10-01 02:00:00 2
But as at this point, I could also expect a column as a series it is kinda hard to test if it is a row or columns series to add the T automatically.
Did I miss a way of selecting what I want?
I recommend to generate your index using pd.date_range for convenience, and then to use .loc with a Timestamp or datetime object.
from datetime import datetime
import pandas as pd
start = datetime(2015, 10, 1, 0, 0, 0)
end = datetime(2015, 10, 1, 4, 0, 0)
dates = pd.date_range(start, end, freq='H')
df = pd.DataFrame(index=pd.DatetimeIndex(dates))
df["values"] = range(0,5)
Then you can use .loc with a Timestamp or datetime object.
In [2]: df.loc[[start]]
Out[2]:
values
2015-10-01 0
Further details
Simply using indexing results in a key error:
df["2015-10-01 02:00:00"]
Out[]:
KeyError: '2015-10-01 02:00:00'
KeyError occurs because you try to return a view of the DataFrame by looking for a column named "2015-10-01 02:00:00"
Similarly this:
df.loc[["2015-10-01 02:00:00"]]
Out[]:
KeyError: "None of [['2015-10-01 02:00:00']] are in the [index]"
Your second option cannot work with str indexing, you should use exact indexing as mentioned instead.
These surprisingly (?) result in the same series as follows:
df.loc["2015-10-01 02:00:00"]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
If you use .loc on a single row you will have a coercion to Series type as you noticed. Hence you shall cast to DataFrame and then transpose the result.
You can convert string to datetime - using exact indexing:
print (df.loc[[pd.to_datetime("2015-10-01 02:00:00")]])
values
2015-10-01 02:00:00 2
Or convert Series to DataFrame and transpose:
print (df.loc["2015-10-01 02:00:00"].to_frame().T)
values
2015-10-01 02:00:00 2
df[df[time_series_row] == “data_to_match”]
Sorry for the formatting. On my phone, will update when I’m back at a computer.
Edit:
I would generally write it like this:
bitmask = df[time_seried_row] == "data_to_match"
row = df[bitmask]

Retrieve date from column index position in pandas and paste in PyQt

I want to retrieve the date of one index position of a Pandas data frame and paste it into the LineEdit of a PyQt Application.
What I have so far is:
purchase = sales [['Total','Date']]
pandas_value = purchase.iloc[-1:]['Date'] # last position of the "Date" column
pyqt_value = str(pandas_value)
# This returns :
67 2016-10-20
Name: Data, dtype: datetime64[ns]
The entire output appears in the LineEdit as : 67 2016-10-20 Name: Data, dtype: datetime64[ns]
I have also tried converting the date, to no avail:
pandas_value.strftime('%Y-%m-%d')
'Series' object has no attribute 'strftime'
Is there a way to retrieve and paste just the date like : 2016-10-20 ?
Or better : Is there a way to retrieve any value as a string from any index position in pandas?
Thanks in advance for any help.
You can do it this way:
In [37]: df
Out[37]:
Date a
0 2016-01-01 0.228208
1 2016-01-02 0.695593
2 2016-01-03 0.493608
3 2016-01-04 0.728678
4 2016-01-05 0.369823
5 2016-01-06 0.336615
6 2016-01-07 0.012200
7 2016-01-08 0.481646
8 2016-01-09 0.773467
9 2016-01-10 0.550114
In [38]: df.iloc[-1, df.columns.get_loc('Date')].strftime('%Y-%m-%d')
Out[38]: '2016-01-10'
pandas returns it as Series which is like a list (normally it keeps one row or one column of data) so you have to use index to get value. You Series has only one value so you can use index [0] (or maybe [67] because your text shows value 67 as index)
pyqt_values = str(panda_values[0])

Pandas concat: ValueError: Shape of passed values is blah, indices imply blah2

I'm trying to merge a (Pandas 14.1) dataframe and a series. The series should form a new column, with some NAs (since the index values of the series are a subset of the index values of the dataframe).
This works for a toy example, but not with my data (detailed below).
Example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6, 4), columns=['A', 'B', 'C', 'D'], index=pd.date_range('1/1/2011', periods=6, freq='D'))
df1
A B C D
2011-01-01 -0.487926 0.439190 0.194810 0.333896
2011-01-02 1.708024 0.237587 -0.958100 1.418285
2011-01-03 -1.228805 1.266068 -1.755050 -1.476395
2011-01-04 -0.554705 1.342504 0.245934 0.955521
2011-01-05 -0.351260 -0.798270 0.820535 -0.597322
2011-01-06 0.132924 0.501027 -1.139487 1.107873
s1 = pd.Series(np.random.randn(3), name='foo', index=pd.date_range('1/1/2011', periods=3, freq='2D'))
s1
2011-01-01 -1.660578
2011-01-03 -0.209688
2011-01-05 0.546146
Freq: 2D, Name: foo, dtype: float64
pd.concat([df1, s1],axis=1)
A B C D foo
2011-01-01 -0.487926 0.439190 0.194810 0.333896 -1.660578
2011-01-02 1.708024 0.237587 -0.958100 1.418285 NaN
2011-01-03 -1.228805 1.266068 -1.755050 -1.476395 -0.209688
2011-01-04 -0.554705 1.342504 0.245934 0.955521 NaN
2011-01-05 -0.351260 -0.798270 0.820535 -0.597322 0.546146
2011-01-06 0.132924 0.501027 -1.139487 1.107873 NaN
The situation with the data (see below) seems basically identical - concatting a series with a DatetimeIndex whose values are a subset of the dataframe's. But it gives the ValueError in the title (blah1 = (5, 286) blah2 = (5, 276) ). Why doesn't it work?:
In[187]: df.head()
Out[188]:
high low loc_h loc_l
time
2014-01-01 17:00:00 1.376235 1.375945 1.376235 1.375945
2014-01-01 17:01:00 1.376005 1.375775 NaN NaN
2014-01-01 17:02:00 1.375795 1.375445 NaN 1.375445
2014-01-01 17:03:00 1.375625 1.375515 NaN NaN
2014-01-01 17:04:00 1.375585 1.375585 NaN NaN
In [186]: df.index
Out[186]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 271, Freq: None, Timezone: None
In [189]: hl.head()
Out[189]:
2014-01-01 17:00:00 1.376090
2014-01-01 17:02:00 1.375445
2014-01-01 17:05:00 1.376195
2014-01-01 17:10:00 1.375385
2014-01-01 17:12:00 1.376115
dtype: float64
In [187]:hl.index
Out[187]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 89, Freq: None, Timezone: None
In: pd.concat([df, hl], axis=1)
Out: [stack trace] ValueError: Shape of passed values is (5, 286), indices imply (5, 276)
I had a similar problem (join worked, but concat failed).
Check for duplicate index values in df1 and s1, (e.g. df1.index.is_unique)
Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here https://stackoverflow.com/a/34297689/7163376 should resolve it.
My problem were different indices, the following code solved my problem.
df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
df = pd.concat([df1, df2], axis=1)
To drop duplicate indices, use df = df.loc[df.index.drop_duplicates()]. C.f. pandas.pydata.org/pandas-docs/stable/generated/… – BallpointBen Apr 18 at 15:25
This is wrong but I can't reply directly to BallpointBen's comment due to low reputation. The reason its wrong is that df.index.drop_duplicates() returns a list of unique indices, but when you index back into the dataframe using those the unique indices it still returns all records. I think this is likely because indexing using one of the duplicated indices will return all instances of the index.
Instead, use df.index.duplicated(), which returns a boolean list (add the ~ to get the not-duplicated records):
df = df.loc[~df.index.duplicated()]
Aus_lacy's post gave me the idea of trying related methods, of which join does work:
In [196]:
hl.name = 'hl'
Out[196]:
'hl'
In [199]:
df.join(hl).head(4)
Out[199]:
high low loc_h loc_l hl
2014-01-01 17:00:00 1.376235 1.375945 1.376235 1.375945 1.376090
2014-01-01 17:01:00 1.376005 1.375775 NaN NaN NaN
2014-01-01 17:02:00 1.375795 1.375445 NaN 1.375445 1.375445
2014-01-01 17:03:00 1.375625 1.375515 NaN NaN NaN
Some insight into why concat works on the example but not this data would be nice though!
Your indexes probably contains duplicated values.
import pandas as pd
T1_INDEX = [
0,
1, # <= !!! if I write e.g.: "0" here then it fails
0.2,
]
T1_COLUMNS = [
'A', 'B', 'C', 'D'
]
T1 = [
[1.0, 1.1, 1.2, 1.3],
[2.0, 2.1, 2.2, 2.3],
[3.0, 3.1, 3.2, 3.3],
]
T2_INDEX = [
1.2,
2.11,
]
T2_COLUMNS = [
'D', 'E', 'F',
]
T2 = [
[54.0, 5324.1, 3234.2],
[55.0, 14.5324, 2324.2],
# [3.0, 3.1, 3.2],
]
df1 = pd.DataFrame(T1, columns=T1_COLUMNS, index=T1_INDEX)
df2 = pd.DataFrame(T2, columns=T2_COLUMNS, index=T2_INDEX)
print(pd.concat([pd.DataFrame({})] + [df2, df1], axis=1))
Try sorting index after concatenating them
result=pd.concat([df1,df2]).sort_index()
Maybe it is simple, try this
if you have a DataFrame. then make sure that both matrices or vectros that you're trying to combine have the same rows_name/index
I had the same issue. I changed the name indices of the rows to make them match each other
here is an example for a matrix (principal component) and a vector(target) have the same row indicies (I circled them in the blue in the leftside of the pic)
Before, "when it was not working", I had the matrix with normal row indicies (0,1,2,3) while I had the vector with row indices (ID0, ID1, ID2, ID3)
then I changed the vector's row indices to (0,1,2,3) and it worked for me.
enter image description here

Python pandas resample added dates not present in the original data

I am using pandas to convert intraday data, stored in data_m, to daily data. For some reason resample added rows for days that were not present in the intraday data. For example, 1/8/2000 is not in the intraday data, yet the daily data contains a row for that date with NaN as the value. DatetimeIndex has more entries than the actual data. Am I doing anything wrong?
data_m.resample('D', how = mean).head()
Out[13]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
2000-01-08 NaN
data_m.resample('D', how = mean)
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4729 entries, 2000-01-04 00:00:00 to 2012-12-14 00:00:00
Freq: D
Data columns:
x 3241 non-null values
dtypes: float64(1)
What you are doing looks correct, it's just that pandas gives NaN for the mean of an empty array.
In [1]: Series().mean()
Out[1]: nan
resample converts to a regular time interval, so if there are no samples that day you get NaN.
Most of the time having NaN isn't a problem. If it is we can either use fill_method (for example 'ffill') or if you really wanted to remove them you could use dropna (not recommended):
data_m.resample('D', how = mean, fill_method='ffill')
data_m.resample('D', how = mean).dropna()
Update: The modern equivalent seems to be:
In [21]: s.resample("D").mean().ffill()
Out[21]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
2000-01-08 8780.037433
In [22]: s.resample("D").mean().dropna()
Out[22]:
x
2000-01-04 8803.879581
2000-01-05 8765.036649
2000-01-06 8893.156250
2000-01-07 8780.037433
See resample docs.
Prior to 0.10.0, pandas labeled resample bins with the right-most edge, which for daily resampling, is the next day. Starting with 0.10.0, the default binning behavior for daily and higher frequencies changed to label='left', closed='left' to minimize this confusion. See http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#api-changes for more information.

Categories