I'm using yfinance library with 2 tickers(^BVSP and BRL=X) but when i display the dateframe it show 2 rows per day where each row shows the information of only one ticket. The information about the other ticket is Nan. I want to put all the information in one row
How can i solve this?
I tried this
dados_bolsa =\["^BVSP","BRL=X"\]
today = datetime.datetime.now()
one_year = today - datetime.timedelta(days=365)
print(one_year)
dados_mercado = yf.download(dados_bolsa , one_year,today)
display(dados_mercado)
i get
2022-02-06 13:27:29.158181
[*********************100%***********************] 2 of 2 completed
Adj Close Close High Low Open Volume
BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP BRL=X ^BVSP
Date
2022-02-07 00:00:00+00:00 5.3269 NaN 5.3269 NaN 5.3430 NaN 5.276800 NaN 5.326200 NaN 0.0 NaN
2022-02-07 03:00:00+00:00 NaN 111996.00000 NaN 111996.00000 NaN 112517.000000 NaN 111490.00000 NaN 112247.000000 NaN 10672800.0
2022-02-08 00:00:00+00:00 5.2626 NaN 5.2626 NaN 5.2849 NaN 5.251000 NaN 5.262800 NaN 0.0 NaN
2022-02-08 03:00:00+00:00 NaN 112234.00000 NaN 112234.00000 NaN 112251.000000 NaN 110943.00000 NaN 111995.000000 NaN 10157500.0
2022-02-09 00:00:00+00:00 5.2584 NaN 5.2584 NaN 5.2880 NaN 5.232774 NaN 5.256489 NaN 0.0 NaN
Look that we have 2 rows for the same day with Nan. I want just one row but with all information.
Related
I'm trying to get a list of the major world indices in Yahoo Finance at this URL: https://finance.yahoo.com/world-indices.
I tried first to get the indices in a table by just running
major_indices=pd.read_html("https://finance.yahoo.com/world-indices")[0]
In this case the error was:
ValueError: No tables found
So I read a solution using selenium at pandas read_html - no tables found
the solution they came up with is (with some adjustment):
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from webdrivermanager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().download_and_install())
driver.get("https://finance.yahoo.com/world-indices")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
Again this code gave me another error:
ValueError: No tables found
I don't know whether to keep using selenium or the pd.read_html is just fine. Either way I'm trying to get this data and don't know how to procede. Can anyone help me?
You don't need Selenium here, you just have to set the euConsentId cookie:
import pandas as pd
import requests
import uuid
url = 'https://finance.yahoo.com/world-indices'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
df = pd.read_html(html)[0]
Output:
>>> df
Symbol Name Last Price Change % Change Volume Intraday High/Low 52 Week Range Day Chart
0 ^GSPC S&P 500 4023.89 93.81 +2.39% 2.545B NaN NaN NaN
1 ^DJI Dow 30 32196.66 466.36 +1.47% 388.524M NaN NaN NaN
2 ^IXIC Nasdaq 11805.00 434.04 +3.82% 5.15B NaN NaN NaN
3 ^NYA NYSE COMPOSITE (DJ) 15257.36 326.26 +2.19% 0 NaN NaN NaN
4 ^XAX NYSE AMEX COMPOSITE INDEX 4025.81 122.66 +3.14% 0 NaN NaN NaN
5 ^BUK100P Cboe UK 100 739.68 17.83 +2.47% 0 NaN NaN NaN
6 ^RUT Russell 2000 1792.67 53.28 +3.06% 0 NaN NaN NaN
7 ^VIX CBOE Volatility Index 28.87 -2.90 -9.13% 0 NaN NaN NaN
8 ^FTSE FTSE 100 7418.15 184.81 +2.55% 0 NaN NaN NaN
9 ^GDAXI DAX PERFORMANCE-INDEX 14027.93 288.29 +2.10% 0 NaN NaN NaN
10 ^FCHI CAC 40 6362.68 156.42 +2.52% 0 NaN NaN NaN
11 ^STOXX50E ESTX 50 PR.EUR 3703.42 89.99 +2.49% 0 NaN NaN NaN
12 ^N100 Euronext 100 Index 1211.74 28.89 +2.44% 0 NaN NaN NaN
13 ^BFX BEL 20 3944.56 14.35 +0.37% 0 NaN NaN NaN
14 IMOEX.ME MOEX Russia Index 2307.50 9.61 +0.42% 0 NaN NaN NaN
15 ^N225 Nikkei 225 26427.65 678.93 +2.64% 0 NaN NaN NaN
16 ^HSI HANG SENG INDEX 19898.77 518.43 +2.68% 0 NaN NaN NaN
17 000001.SS SSE Composite Index 3084.28 29.29 +0.96% 3.109B NaN NaN NaN
18 399001.SZ Shenzhen Component 11159.79 64.92 +0.59% 3.16B NaN NaN NaN
19 ^STI STI Index 3191.16 25.98 +0.82% 0 NaN NaN NaN
20 ^AXJO S&P/ASX 200 7075.10 134.10 +1.93% 0 NaN NaN NaN
21 ^AORD ALL ORDINARIES 7307.70 141.10 +1.97% 0 NaN NaN NaN
22 ^BSESN S&P BSE SENSEX 52793.62 -136.69 -0.26% 0 NaN NaN NaN
23 ^JKSE Jakarta Composite Index 6597.99 -1.85 -0.03% 0 NaN NaN NaN
24 ^KLSE FTSE Bursa Malaysia KLCI 1544.41 5.61 +0.36% 0 NaN NaN NaN
25 ^NZ50 S&P/NZX 50 INDEX GROSS 11168.18 -9.18 -0.08% 0 NaN NaN NaN
26 ^KS11 KOSPI Composite Index 2604.24 54.16 +2.12% 788539 NaN NaN NaN
27 ^TWII TSEC weighted index 15832.54 215.86 +1.38% 0 NaN NaN NaN
28 ^GSPTSE S&P/TSX Composite index 20099.81 400.76 +2.03% 294.637M NaN NaN NaN
29 ^BVSP IBOVESPA 106924.18 1236.54 +1.17% 0 NaN NaN NaN
30 ^MXX IPC MEXICO 49579.90 270.58 +0.55% 212.868M NaN NaN NaN
31 ^IPSA S&P/CLX IPSA 5058.88 0.00 0.00% 0 NaN NaN NaN
32 ^MERV MERVAL 38390.84 233.89 +0.61% 0 NaN NaN NaN
33 ^TA125.TA TA-125 1964.95 23.38 +1.20% 0 NaN NaN NaN
34 ^CASE30 EGX 30 Price Return Index 10642.40 -213.50 -1.97% 36.837M NaN NaN NaN
35 ^JN0U.JO Top 40 USD Net TRI Index 4118.19 65.63 +1.62% 0 NaN NaN NaN
I have dataframe df. Last Date is 2022-04-29 in df. I want to Generate next 10 weekday(exclding Saturday and Sunday) dates in this dataframe, other columns can have Nan values for generated dates.
df contains index as df.datetimeindex
df = df.set_index('Date').
df
Open High Low Close Volume Currency
Date
2021-04-26 14449.45 14557.50 14421.30 14485.00 448533331968 INR
2021-04-27 14493.80 14667.55 14484.85 14653.05 442211696640 INR
2021-04-28 14710.50 14890.25 14694.95 14864.55 453990809600 INR
2021-04-29 14979.00 15044.35 14814.45 14894.90 511466668032 INR
2021-04-30 14747.35 14855.45 14601.70 14631.10 594744508416 INR
... ... ... ... ... ... ...
2022-04-25 17006.10 17052.10 16889.75 16953.95 275571 INR
2022-04-26 17121.30 17223.85 17064.45 17200.80 261066000 INR
2022-04-27 17073.35 17110.70 16958.45 17038.40 265140000 INR
2022-04-28 17153.40 17314.45 17071.20 17245.05 312794 INR
2022-04-29 17329.25 17377.65 17053.25 17102.55 336244000 INR
Expected Output-
df
Open High Low Close Volume Currency
Date
2021-04-26 14449.45 14557.50 14421.30 14485.00 448533331968 INR
2021-04-27 14493.80 14667.55 14484.85 14653.05 442211696640 INR
2021-04-28 14710.50 14890.25 14694.95 14864.55 453990809600 INR
2021-04-29 14979.00 15044.35 14814.45 14894.90 511466668032 INR
2021-04-30 14747.35 14855.45 14601.70 14631.10 594744508416 INR
... ... ... ... ... ... ...
2022-05-02 Nan Nan Nan Nan Nan Nan
2022-05-03 Nan Nan Nan Nan Nan Nan
.....
Try with pd.date_range and reindex:
df = df.reindex(df.index.union(pd.date_range(df.index.max(),periods=10,freq="B")))
>>> df
Open High Low Close Volume Currency
2022-04-25 17006.10 17052.10 16889.75 16953.95 275571.0 INR
2022-04-26 17121.30 17223.85 17064.45 17200.80 261066000.0 INR
2022-04-27 17073.35 17110.70 16958.45 17038.40 265140000.0 INR
2022-04-28 17153.40 17314.45 17071.20 17245.05 312794.0 INR
2022-04-29 17329.25 17377.65 17053.25 17102.55 336244000.0 INR
2022-05-02 NaN NaN NaN NaN NaN NaN
2022-05-03 NaN NaN NaN NaN NaN NaN
2022-05-04 NaN NaN NaN NaN NaN NaN
2022-05-05 NaN NaN NaN NaN NaN NaN
2022-05-06 NaN NaN NaN NaN NaN NaN
2022-05-09 NaN NaN NaN NaN NaN NaN
2022-05-10 NaN NaN NaN NaN NaN NaN
2022-05-11 NaN NaN NaN NaN NaN NaN
2022-05-12 NaN NaN NaN NaN NaN NaN
My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky
Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc
I have created 2 panda data frames, the first called 'dfmas' with an index 'Date', then dates, data and 3 moving average columns;
OPEN HIGH LOW LAST ma5 ma8 ma21
Date
11/23/2009 88.84 89.19 88.58 88.97 NaN NaN NaN
11/24/2009 88.97 89.07 88.36 88.50 NaN NaN NaN
11/25/2009 88.50 88.63 87.22 87.35 NaN NaN NaN
11/26/2009 87.35 87.48 86.30 86.59 NaN NaN NaN
11/27/2009 86.59 87.02 84.83 86.53 87.588 NaN NaN
11/30/2009 87.17 87.17 85.87 86.41 87.076 NaN NaN
12/1/2009 86.41 87.53 86.17 86.68 86.712 NaN NaN
12/2/2009 86.68 87.49 86.59 87.39 86.720 87.302 NaN
12/3/2009 87.39 88.48 87.32 88.26 87.054 87.214 NaN
12/4/2009 88.26 90.77 88.00 90.56 87.860 87.471 NaN
the second dataframe is made from the above data looking at when the moving averages crossover;
ma = [0,]
ma5Last = ma5[0]
ma8Last = ma8[0]
for ma5Curr, ma8Curr in zip(ma5[1:], ma8[1:]):
if ma5Curr > ma5Last and ma8Curr > ma8Last:
ma.append(1)
elif ma5Curr < ma5Last and ma8Curr < ma8Last:
ma.append(-1)
else:
ma.append(0)
ma5Last = ma5Curr
ma8Last = ma8Curr
maX = pd.DataFrame(ma).astype('float')
maX.columns = ['maX']
and is called 'maX' below;
maX
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 1.0
However I'm unable to merge/concat the 2 data frames.
How do I add the 'Date" index to the second 'maX'dataframe and then merge/concat/combine the two dataframes together? Many thanks in advance.
Is this what you are after?
df['maX'] = maX.maX.values
df
Out[1263]:
OPEN HIGH LOW LAST ma5 ma8 ma21 maX
Date
11/23/2009 88.84 89.19 88.58 88.97 NaN NaN NaN 0.0
11/24/2009 88.97 89.07 88.36 88.50 NaN NaN NaN 0.0
11/25/2009 88.50 88.63 87.22 87.35 NaN NaN NaN 0.0
11/26/2009 87.35 87.48 86.30 86.59 NaN NaN NaN 0.0
11/27/2009 86.59 87.02 84.83 86.53 87.588 NaN NaN 0.0
11/30/2009 87.17 87.17 85.87 86.41 87.076 NaN NaN 0.0
12/1/2009 86.41 87.53 86.17 86.68 86.712 NaN NaN 0.0
12/2/2009 86.68 87.49 86.59 87.39 86.720 87.302 NaN 0.0
12/3/2009 87.39 88.48 87.32 88.26 87.054 87.214 NaN 0.0
12/4/2009 88.26 90.77 88.00 90.56 87.860 87.471 NaN 1.0
If dataframes have same length simply add index from original DataFrame for align indexes:
maX = pd.DataFrame(ma, index=df.index).astype('float')