concat pandas DataFrame along timeseries indexes - python

I have two largish (snippets provided) pandas DateFrames with unequal dates as indexes that I wish to concat into one:
NAB.AX CBA.AX
Close Volume Close Volume
Date Date
2009-06-05 36.51 4962900 2009-06-08 21.95 0
2009-06-04 36.79 5528800 2009-06-05 21.95 8917000
2009-06-03 36.80 5116500 2009-06-04 22.21 18723600
2009-06-02 36.33 5303700 2009-06-03 23.11 11643800
2009-06-01 36.16 5625500 2009-06-02 22.80 14249900
2009-05-29 35.14 13038600 --AND-- 2009-06-01 22.52 11687200
2009-05-28 33.95 7917600 2009-05-29 22.02 22350700
2009-05-27 35.13 4701100 2009-05-28 21.63 9679800
2009-05-26 35.45 4572700 2009-05-27 21.74 9338200
2009-05-25 34.80 3652500 2009-05-26 21.64 8502900
Problem is, if I run this:
keys = ['CBA.AX','NAB.AX']
mv = pandas.concat([data['CBA.AX'][650:660],data['NAB.AX'][650:660]], axis=1, keys=stocks,)
the following DateFrame is produced:
CBA.AX NAB.AX
Close Volume Close Volume
Date
2200-08-16 04:24:21.460041 NaN NaN NaN NaN
2203-05-13 04:24:21.460041 NaN NaN NaN NaN
2206-02-06 04:24:21.460041 NaN NaN NaN NaN
2208-11-02 04:24:21.460041 NaN NaN NaN NaN
2211-07-30 04:24:21.460041 NaN NaN NaN NaN
2219-10-16 04:24:21.460041 NaN NaN NaN NaN
2222-07-12 04:24:21.460041 NaN NaN NaN NaN
2225-04-07 04:24:21.460041 NaN NaN NaN NaN
2228-01-02 04:24:21.460041 NaN NaN NaN NaN
2230-09-28 04:24:21.460041 NaN NaN NaN NaN
2238-12-15 04:24:21.460041 NaN NaN NaN NaN
Does anybody have any idea why this might be the case?
On another point: is there any python libraries around that pull data from yahoo and normalise it?
Cheers.
EDIT: For reference:
data = {
'CBA.AX': <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2313 entries, 2011-12-29 00:00:00 to 2003-01-01 00:00:00
Data columns:
Close 2313 non-null values
Volume 2313 non-null values
dtypes: float64(1), int64(1),
'NAB.AX': <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2329 entries, 2011-12-29 00:00:00 to 2003-01-01 00:00:00
Data columns:
Close 2329 non-null values
Volume 2329 non-null values
dtypes: float64(1), int64(1)
}

It is possible to read the data with pandas and to concatenate it.
First import the data
In [449]: import pandas.io.data as web
In [450]: nab = web.get_data_yahoo('NAB.AX', start='2009-05-25',
end='2009-06-05')[['Close', 'Volume']]
In [451]: cba = web.get_data_yahoo('CBA.AX', start='2009-05-26',
end='2009-06-08')[['Close', 'Volume']]
In [453]: nab
Out[453]:
Close Volume
Date
2009-05-25 21.15 9685100
2009-05-26 21.64 8541900
2009-05-27 21.74 9042900
2009-05-28 21.63 9701000
2009-05-29 22.02 14665700
2009-06-01 22.52 6782000
2009-06-02 22.80 10473400
2009-06-03 23.11 9931400
2009-06-04 22.21 17869000
2009-06-05 21.95 8214300
In [454]: cba
Out[454]:
Close Volume
Date
2009-05-26 35.45 4529600
2009-05-27 35.13 4521500
2009-05-28 33.95 7945400
2009-05-29 35.14 12548500
2009-06-01 36.16 4509400
2009-06-02 36.33 4304900
2009-06-03 36.80 4845400
2009-06-04 36.79 4592300
2009-06-05 36.51 4417500
2009-06-08 36.51 0
Than concatenate it:
In [455]: keys = ['CBA.AX','NAB.AX']
In [456]: pd.concat([cba, nab], axis=1, keys=keys)
Out[456]:
CBA.AX NAB.AX
Close Volume Close Volume
Date
2009-05-25 NaN NaN 21.15 9685100
2009-05-26 35.45 4529600 21.64 8541900
2009-05-27 35.13 4521500 21.74 9042900
2009-05-28 33.95 7945400 21.63 9701000
2009-05-29 35.14 12548500 22.02 14665700
2009-06-01 36.16 4509400 22.52 6782000
2009-06-02 36.33 4304900 22.80 10473400
2009-06-03 36.80 4845400 23.11 9931400
2009-06-04 36.79 4592300 22.21 17869000
2009-06-05 36.51 4417500 21.95 8214300
2009-06-08 36.51 0 NaN NaN

Try to join on outer.
When I am working with a number of stocks, I would usually have a frame titled "open high,low,close,etc" with column as a ticker. If you want one data structure, I would use Panels for this.
for Yahoo data, you can use pandas:
import pandas.io.data as data
spy = data.DataReader("SPY","yahoo","1991/1/1")

Related

yfinance shows 2 rows for the same day with Nan values

I'm using yfinance library with 2 tickers(^BVSP and BRL=X) but when i display the dateframe it show 2 rows per day where each row shows the information of only one ticket. The information about the other ticket is Nan. I want to put all the information in one row
How can i solve this?
I tried this
dados_bolsa =\["^BVSP","BRL=X"\]
today = datetime.datetime.now()
one_year = today - datetime.timedelta(days=365)
print(one_year)
dados_mercado = yf.download(dados_bolsa , one_year,today)
display(dados_mercado)
i get
2022-02-06 13:27:29.158181
[*********************100%***********************] 2 of 2 completed
Adj Close Close High Low Open Volume
BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP
Date
2022-02-07 00:00:00+00:00 5.3269 NaN 5.3269 NaN 5.3430 NaN 5.276800 NaN 5.326200 NaN 0.0 NaN
2022-02-07 03:00:00+00:00 NaN 111996.00000 NaN 111996.00000 NaN 112517.000000 NaN 111490.00000 NaN 112247.000000 NaN 10672800.0
2022-02-08 00:00:00+00:00 5.2626 NaN 5.2626 NaN 5.2849 NaN 5.251000 NaN 5.262800 NaN 0.0 NaN
2022-02-08 03:00:00+00:00 NaN 112234.00000 NaN 112234.00000 NaN 112251.000000 NaN 110943.00000 NaN 111995.000000 NaN 10157500.0
2022-02-09 00:00:00+00:00 5.2584 NaN 5.2584 NaN 5.2880 NaN 5.232774 NaN 5.256489 NaN 0.0 NaN
Look that we have 2 rows for the same day with Nan. I want just one row but with all information.

Generate Dates for next 10 Weekdays in Dataframe from the after last date in dataframe

I have dataframe df. Last Date is 2022-04-29 in df. I want to Generate next 10 weekday(exclding Saturday and Sunday) dates in this dataframe, other columns can have Nan values for generated dates.
df contains index as df.datetimeindex
df = df.set_index('Date').
df
Open High Low Close Volume Currency
Date
2021-04-26 14449.45 14557.50 14421.30 14485.00 448533331968 INR
2021-04-27 14493.80 14667.55 14484.85 14653.05 442211696640 INR
2021-04-28 14710.50 14890.25 14694.95 14864.55 453990809600 INR
2021-04-29 14979.00 15044.35 14814.45 14894.90 511466668032 INR
2021-04-30 14747.35 14855.45 14601.70 14631.10 594744508416 INR
... ... ... ... ... ... ...
2022-04-25 17006.10 17052.10 16889.75 16953.95 275571 INR
2022-04-26 17121.30 17223.85 17064.45 17200.80 261066000 INR
2022-04-27 17073.35 17110.70 16958.45 17038.40 265140000 INR
2022-04-28 17153.40 17314.45 17071.20 17245.05 312794 INR
2022-04-29 17329.25 17377.65 17053.25 17102.55 336244000 INR
Expected Output-
df
Open High Low Close Volume Currency
Date
2021-04-26 14449.45 14557.50 14421.30 14485.00 448533331968 INR
2021-04-27 14493.80 14667.55 14484.85 14653.05 442211696640 INR
2021-04-28 14710.50 14890.25 14694.95 14864.55 453990809600 INR
2021-04-29 14979.00 15044.35 14814.45 14894.90 511466668032 INR
2021-04-30 14747.35 14855.45 14601.70 14631.10 594744508416 INR
... ... ... ... ... ... ...
2022-05-02 Nan Nan Nan Nan Nan Nan
2022-05-03 Nan Nan Nan Nan Nan Nan
.....
Try with pd.date_range and reindex:
df = df.reindex(df.index.union(pd.date_range(df.index.max(),periods=10,freq="B")))
>>> df
Open High Low Close Volume Currency
2022-04-25 17006.10 17052.10 16889.75 16953.95 275571.0 INR
2022-04-26 17121.30 17223.85 17064.45 17200.80 261066000.0 INR
2022-04-27 17073.35 17110.70 16958.45 17038.40 265140000.0 INR
2022-04-28 17153.40 17314.45 17071.20 17245.05 312794.0 INR
2022-04-29 17329.25 17377.65 17053.25 17102.55 336244000.0 INR
2022-05-02 NaN NaN NaN NaN NaN NaN
2022-05-03 NaN NaN NaN NaN NaN NaN
2022-05-04 NaN NaN NaN NaN NaN NaN
2022-05-05 NaN NaN NaN NaN NaN NaN
2022-05-06 NaN NaN NaN NaN NaN NaN
2022-05-09 NaN NaN NaN NaN NaN NaN
2022-05-10 NaN NaN NaN NaN NaN NaN
2022-05-11 NaN NaN NaN NaN NaN NaN
2022-05-12 NaN NaN NaN NaN NaN NaN

python datetime dataframe add some dates if there are lack of dates than I want

I have two data files and both have different periods of datetime.
As you can see below, the first 'Date' is from 2013-10-14 to 2015-11-25, and the second 'Date' is from 2014-01-01 to 2015-11-27.
If I want to make the date from 2013-10-14 to 2015-11-27 and fill the blank as np.nan, what do I have to do in the code?
If you know how to do it or any idea, please let me know.
dvv : Date
2013-10-14 -0.038875
2013-10-15 -0.038875
2013-10-16 -0.038875
2013-10-17 -0.038875
2013-10-18 -0.038875
2015-11-21 0.081939
2015-11-22 -0.097986
2015-11-23 -0.096201
2015-11-24 -0.033913
2015-11-25 -0.050553
Name: dvv, Length: 773, dtype: float64
Stations Sensor EL GL Pressure Temp EC Barometa
Date
2014-01-01 JRee3 S11 NaN NaN NaN NaN NaN NaN
2014-01-02 JRee3 S11 NaN NaN NaN NaN NaN NaN
2014-01-02 JRee3 S11 NaN NaN NaN NaN NaN NaN
2014-01-04 JRee3 S11 NaN NaN NaN NaN NaN NaN
2014-01-05 JRee3 S11 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
2015-11-23 JRee3 S11 213.46 202.21 99.83 14.22 105.0 1008.13
2015-11-24 JRee3 S11 213.36 202.31 99.73 14.22 105.0 1008.36
2015-11-25 JRee3 S11 213.34 202.33 99.71 14.22 105.0 1004.40
2015-11-26 JRee3 S11 213.30 202.37 99.67 14.22 105.0 1003.13
2015-11-27 JRee3 S11 213.24 202.44 99.61 14.21 105.0 1011.00
[696 rows x 8 columns]
You can generate new dates this way (replace periods with sufficient number):
days = pd.date_range('14/10/2013', periods=365, freq='D')
You will get something like this which you can add to your dataframe:
DatetimeIndex(['2013-10-14', '2013-10-15', '2013-10-16', '2013-10-17',
'2013-10-18', '2013-10-19', '2013-10-20', '2013-10-21',
'2013-10-22', '2013-10-23',
...
'2014-10-04', '2014-10-05', '2014-10-06', '2014-10-07',
'2014-10-08', '2014-10-09', '2014-10-10', '2014-10-11',
'2014-10-12', '2014-10-13'],
dtype='datetime64[ns]', length=365, freq='D')
Assuming you have no missing values in the dates, then you can simply exploit pandas.date_range and an outer join.
Toy example below:
import pandas as pd
dates1 = pd.date_range('2013-10-14', '2015-11-25', freq='D')
dates2 = pd.date_range('2014-01-01', '2015-11-27', freq='D')
df1 = pd.DataFrame(data=[1]*len(dates1), index=dates1, columns=['var'])
df2 = pd.DataFrame(data=[2]*len(dates2), index=dates2, columns=['var'])
df1.merge(df2, left_index=True, right_index=True, how='outer')

Filling NaN rows in big pandas datetime indexed dataframe using other not NaN rows values

I have a big weather csv dataframe containing several hundred thousand of rows as well many columns. The rows are time-series sampled every 10 minutes over many years. The index data column that represents datetime consists of year, month, day, hour, minute and second. Unfortunately, there were several thousand missing rows containing only NaNs. The goal is to fill these ones using the values of other rows collected at the same time but of other years if they are not NaNs.
I wrote a python for loop code but it seems like a very time consuming solution. I need your help for a more efficient and faster solution.
The raw dataframe is as follows:
print(df)
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:10:00 996.52 -8.02 265.40 -8.90 93.30
2004-01-01 00:20:00 996.57 -8.41 265.01 -9.28 93.40
2004-01-01 00:40:00 996.51 -8.31 265.12 -9.07 94.20
2004-01-01 00:50:00 996.51 -8.27 265.15 -9.04 94.10
2004-01-01 01:00:00 996.53 -8.51 264.91 -9.31 93.90
... ... ... ... ... ...
2020-12-31 23:20:00 1000.07 -4.05 269.10 -8.13 73.10
2020-12-31 23:30:00 999.93 -3.35 269.81 -8.06 69.71
2020-12-31 23:40:00 999.82 -3.16 270.01 -8.21 67.91
2020-12-31 23:50:00 999.81 -4.23 268.94 -8.53 71.80
2021-01-01 00:00:00 999.82 -4.82 268.36 -8.42 75.70
[820551 rows x 5 columns]
For any reason, there were missing rows in the df dataframe. To identify them, it is possible to apply the below function:
findnanrows(df.groupby(pd.Grouper(freq='10T')).mean())
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 NaN NaN NaN NaN NaN
2009-10-08 09:50:00 NaN NaN NaN NaN NaN
2009-10-08 10:00:00 NaN NaN NaN NaN NaN
2013-05-16 09:00:00 NaN NaN NaN NaN NaN
2014-07-30 08:10:00 NaN NaN NaN NaN NaN
... ... ... ... ... ...
2016-10-28 12:00:00 NaN NaN NaN NaN NaN
2016-10-28 12:10:00 NaN NaN NaN NaN NaN
2016-10-28 12:20:00 NaN NaN NaN NaN NaN
2016-10-28 12:30:00 NaN NaN NaN NaN NaN
2016-10-28 12:40:00 NaN NaN NaN NaN NaN
[5440 rows x 5 columns]
The aim is to fill all these NaN rows. As an example, the first NaN row which corresponds to the datetime 2004-01-01 00:30:00 should be filled with the not NaN values of another row collected on the same datetime xxxx-01-01 00:30:00 of another year like 2005-01-01 00:30:00 or 2006-01-01 00:30:00 and so on, even 2003-01-01 00:30:00 or 2002-01-01 00:30:00 if they existing. It is possible to apply an average over all these other years.
Seeing the values of the row with the datetime index 2005-01-01 00:30:00:
print(df.loc["2005-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2005-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
After filling the row corresponding to the index datetime 2004-01-01 00:30:00 using the values of the row having the index datetime 2005-01-01 00:30:00, the df dataframe will have the following row:
print(df.loc["2004-01-01 00:30:00", :])
p (mbar) T (degC) Tpot (K) Tdew (degC) rh (%)
datetime
2004-01-01 00:30:00 996.36 12.67 286.13 7.11 68.82
The two functions that I created are the following. The first is to identify the NaN rows. The second is to fill them.
def findnanrows(df):
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]
return rows_with_NaN
def filldata(weatherdata):
fillweatherdata = weatherdata.copy()
allyears = fillweatherdata.index.year.unique().tolist()
dfnan = findnanrows(fillweatherdata.groupby(pd.Grouper(freq='10T')).mean())
for i in range(dfnan.shape[0]):
dnan = dfnan.index[i]
if dnan.year == min(allyears):
y = 0
dnew = dnan.replace(year=dnan.year+y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year+y)
y += 1
else:
y = 0
dnew = dnan.replace(year=dnan.year-y)
while dnew in dfnan.index:
dnew = dnew.replace(year=dnew.year-y)
y += 1
new_row = pd.DataFrame(np.array([fillweatherdata.loc[dnew, :]]).tolist(), columns=fillweatherdata.columns.tolist(), index=[dnan])
fillweatherdata = pd.concat([fillweatherdata, pd.DataFrame(new_row)], ignore_index=False)
#fillweatherdata = fillweatherdata.drop_duplicates()
fillweatherdata = fillweatherdata.sort_index()
return fillweatherdata

Python Pandas resample, odd behaviour

I have 2 datasets (cex2.txt and cex3) wich I would like to resample in pandas. With one dataset I get the expected output, with the other not.
The datasets are tick data and are exactly equally formatted. Actually, the 2 datasets are only from two different days.
import pandas as pd
import datetime as dt
pd.set_option ('display.mpl_style', 'default')
time_converter = lambda x: dt.datetime.fromtimestamp(float(x))
data_frame = pd.read_csv('cex2.txt', sep=';', converters={'time': time_converter})
data_frame.drop('Unnamed: 7', axis=1, inplace=True)
data_frame.drop('low', axis=1, inplace=True)
data_frame.drop('high', axis=1, inplace=True)
data_frame.drop('last', axis=1, inplace=True)
data_frame = data_frame.reindex_axis(['time', 'ask', 'bid', 'vol'], axis=1)
data_frame.set_index(pd.DatetimeIndex(data_frame['time']), inplace=True)
ask = data_frame['ask'].resample('15Min', how='ohlc')
bid = data_frame['bid'].resample('15Min', how='ohlc')
vol = data_frame['vol'].resample('15Min', how='sum')
print ask
from the cex2.txt dataset I get this wrong output:
open high low close
1970-01-01 01:00:00 NaN NaN NaN NaN
1970-01-01 01:15:00 NaN NaN NaN NaN
1970-01-01 01:30:00 NaN NaN NaN NaN
1970-01-01 01:45:00 NaN NaN NaN NaN
1970-01-01 02:00:00 NaN NaN NaN NaN
1970-01-01 02:15:00 NaN NaN NaN NaN
1970-01-01 02:30:00 NaN NaN NaN NaN
1970-01-01 02:45:00 NaN NaN NaN NaN
1970-01-01 03:00:00 NaN NaN NaN NaN
1970-01-01 03:15:00 NaN NaN NaN NaN
from the cex3.txt dataset I get correct values:
open high low close
2014-08-10 13:30:00 0.003483 0.003500 0.003483 0.003485
2014-08-10 13:45:00 0.003485 0.003570 0.003467 0.003471
2014-08-10 14:00:00 0.003471 0.003500 0.003470 0.003494
2014-08-10 14:15:00 0.003494 0.003500 0.003493 0.003498
2014-08-10 14:30:00 0.003498 0.003549 0.003498 0.003500
2014-08-10 14:45:00 0.003500 0.003533 0.003487 0.003533
2014-08-10 15:00:00 0.003533 0.003600 0.003520 0.003587
I'm really at my wits' end. Does anyone have an idea why this happens?
Edit:
Here are the data sources:
https://dl.dropboxusercontent.com/u/14055520/cex2.txt
https://dl.dropboxusercontent.com/u/14055520/cex3.txt
Thanks!

Categories