Scraping Public Tableau - python

I am collecting the Indian renewable energy data, and I am kinda stuck on gathering the data.
From the public Tableau, it seems that I can select the year and the tab.
But when I try to know the parameter or filter values to approach the yearly value, the results return nothing.
Here is my python code. I select only the worksheet of "Capacity by EnergySource".
(There are three worksheets in this data. I select only one worksheet for reference.)
But ultimately, I want to scrap the yearly data for all worksheets in each tab.
from tableauscraper import TableauScraper as TS
url = "https://public.tableau.com/views/RenewableCapacity/CapacitybySource"
ts = TS()
ts.loads(url)
workbook = ts.getWorkbook()
ws = ts.getWorksheet("Capacity by EnergySource")
print(ws.data)
ws.data.to_csv('table.csv',index=False)
parameters = workbook.getParameters()
print(parameters)
filters = ws.getFilters()
print(filters)
I am following each step of this thread and learning new ways to download the data from the public tableau.
https://github.com/bertrandmartel/tableau-scraping
Could anybody help on this issue?

Related

How do I limit the rate of a scraper?

So I am trying to create a table by scraping hundreds of similar pages at a time and then saving them into the same Excel table, with something like this:
#let urls be a list of hundreds of different URLs
def save_table(urls):
<define columns and parameters of the dataframe to be saved, df>
writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
for i in range(0, len(urls)):
#here, return_html_soup is the function returning the html soup of any individual URL
soup = return_html_soup(urls[i])
temp_table = some_function(soup)
df = df.append(temp_table, ignore_index = True)
#I chose to_excel instead of to_csv here because there are certain letters on the
#original website that don't show up in a CSV
df.to_excel(writer, sheet_name = <some name>)
writer.save()
writer.close()
I now hit the HTTP Error 429: too many requests, without any retry-after header.
Is there a way for me to get around this? I know that this error happens because I've basically asked to scrape too many websites in too short of an interval. Is there a way for me to limit the rate that my code opens links?
Python official documentation is the best place to go:
https://docs.python.org/3/library/time.html#time.sleep
Here an example using 5 seconds. But you can customize it according to what you need and the restrictions you have.
import time
#let urls be a list of hundreds of different URLs
def save_table(urls):
<define columns and parameters of the dataframe to be saved, df>
writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
for i in range(0, len(urls)):
#here, return_html_soup is the function returning the html soup of any individual URL
soup = return_html_soup(urls[i])
temp_table = some_function(soup)
df = df.append(temp_table, ignore_index = True)
#New cote to wait for some time
time.sleep(5)
#I chose to_excel instead of to_csv here because there are certain letters on the
#original website that don't show up in a CSV
df.to_excel(writer, sheet_name = <some name>)
writer.save()
writer.close()
The way I did it,
scrape the whole element once and then parse through the header, tag, or name. used bs4 with robinstocks for market data, runs every 10 mins or so, and works fine. specifically the get_element_by_name functionality. or maybe just use time delay from the time lib

Retrieve a lot of data from Yahoo finance

I have a csv file which contains the ticker symbols for all the stocks listed on Nasdaq. Here is a link to that csv file. One can download it from there. There are more than 8000 stocks listed. Following is the code
import pandas as pd
import yfinance # pip install yfinance
tick_pd = pd.read_csv("/path/to/the/csv/file/nasdaq_screener_1654004691484.csv",
usecols = [0])
I have made a function which retrieves the historical stock prices for a ticker symbol. That function is as following:-
## function to be applied on each stock symbol
def appfunc(ticker):
A = yf.Ticker(ticker).history(period="max")
A["symbol"] = ticker
return A
And I apply this function to each row of the tick_pd, the following way:-
hist_prices = tick_pd.apply(appfunc)
But this takes way too much time, way way too much time. I was hoping if someone could find a way with which I can retrieve this data quite quickly. Or if there is a way I could parallelize it. I am quite new to python, so, I don't really know a lot of ways to do this.
Thanks in advance
You can use yf.download to download all tickers asynchronously::
tick_pd = pd.read_csv('nasdaq_screener_1654024849057.csv', usecols=[0])
df = yf.download(tick_pd['Symbol'].tolist(), period='max')
You can use threads as parameter of yf.download:
# Enable mass downloading (default is True)
df = yf.download(tick_pd['Symbol'].tolist(), period='max', threads=True)
# OR
# You can control the number of threads
df = yf.download(tick_pd['Symbol'].tolist(), period='max', threads=8)

Append new data into an existing frame and upload to sheets Python

I'm connected to my APIs client, sent the credentials, I made the request, I asked the API for data and put it to a DF.
Then, I have to upload this data to a sheet, so then this sheet is gonna be connected to PowerBI as a datasource in order to develop a dashboard and monitor some KPIs and so on..
Simple and common ETL process. BUT: to be honest, I'm a rookie and I'm doing my best.
Above here is just code to connect to the API, here is where the "extraction" begins
if response_page.status_code == 200:
if page == 1 :
df = pd.DataFrame(json.loads(response_page.content)["list"])
else :
df2 = pd.DataFrame(json.loads(response_page.content)["list"])
df = df.append(df2)
Then I just pick up but I need:
columnas = ['orderId','totalValue','paymentNames']
df2 = df[columnas]
df2
This is what the DF looks like:
example: this df is which I need to append the new data
Then I've just connected to Sheets here, send the credentials, open the sheet("carrefourMetodosDePago") and the page("transacciones")
sa = gspread.service_account(filename="service_account.json")
sh = sa.open("carrefourMetodosDePago")
wks = sh.worksheet("transacciones")
The magic begins:
wks.update([df2.columns.values.tolist()] + df2.values.tolist())
With this sentence I upload what the picture shows, to the sheet!
I need the new data that generates the API to be appended/merged/concatenated to the current data so the code upload the current data PLUS the new everytime I run it and so forth.
How can I do that? Should I use a for loop and iterate over every new data en append it to the sheet?
This is the best I could have done, I think I reached my turning point here...
If I explained myself wrong just let me know.
If you reach up to here just let me thank you to give me some time :)

Is there a way to merge identical excel tabs into one dataframe using python?

I'm extremely new to python, and am having some issues attempting to merge identical excel tabs (same column labels and data) into a aggregated tab.
I've been looking through various resources, and this link appears to be the best resource, but it doesn't quite mesh with the way I've been shown.
I am interested in seeing if there is any part of the code below that is viable when attempting this merge, and by extension, I'm interested in seeing if there are any other resources out there that may be a potential solution or mirror the style of coding that I've been taught.
Thank You!
import pandas as pd
#import numpy as np
#Define Dataframes
df_ui = pd.read_excel('U:/ABA/Data/Trade/GTIS/GTIS Trade Shell.xlsx', sheet_name='UreaImports')
df_ue = pd.read_excel('U:/ABA/Data/Trade/GTIS/GTIS Trade Shell.xlsx', sheet_name='UreaExports')
df_uani = pd.read_excel('U://ABA/Data/Trade/GTIS/GTIS Trade Shell.xlsx', sheet_name='UANImports')
df_uane = pd.read_excel('U://ABA/Data/Trade/GTIS/GTIS Trade Shell.xlsx', sheet_name='UANExports')
df_nh3i = pd.read_excel('U://ABA/Data/Trade/GTIS/GTIS Trade Shell.xlsx', sheet_name='NH3Imports')
df_nh3e = pd.read_excel('U://ABA/Data/Trade/GTIS/GTIS Trade Shell.xlsx', sheet_name='NH3Exports')
#Merge Dataframes
df_raw = pd.merge(df_ui,df_ue,df_uani,df_uane,df_nh3i,df_nh3e)
# Sends all data tables to an excel file
writer = pd.ExcelWriter('U:/ABA/Data/Trade/GTIS/GTIS Python Build.xlsx',
engine='xlsxwriter')
df_raw.to_excel(writer, sheet_name ='raw')
writer.save()

Is it possible to use the Yahoo Query Language to download historical financial data?

I've used the Yahoo Finance site to download historical data, using queries like this:
http://ichart.finance.yahoo.com/table.csv?s=AAPL&c=1962
and the accompanying Python code:
import urllib.request
with open("data.csv", "wb") as w:
url = "http://ichart.finance.yahoo.com/table.csv?s=AAPL&c=1962"
r = urllib.request.urlopen(url).read()
w.write(r)
I've also used the Yahoo Query Language to download pseudo-realtime data (i.e. data delayed by a few minutes) with queries like this:
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20in%20(%22AAPL%22)&env=store://datatables.org/alltableswithkeys
However, I can't find documentation on how to download historical data (as shown in the first query) using the YQL (as shown in the second query). I'd like to do this before the tables returned by the YQL contain much more data than simply opening/closing prices, volume, etc.
Is there a way to download historical data using the YQL in such a way that it contains the same depth of detail as the pseudo-realtime data?
Unfortunately, the YQL table yahoo.finance.historicaldata contains the same fields as the CSV files, specifically the opening price, closing price, high, low, volume, etc. To download this data using YQL, this is an example of the proper query:
select * from yahoo.finance.historicaldata where symbol = "IBM" and startDate = "2012-01-01" and endDate = "2012-01-11"

Categories