Need work around for downloading xlsx from website - python

i am trying to obtain data provided in an xslx spreadsheet for download from a url link via python. my fist approach was to read it into a dataframe and save it down to a file that can be manipulated via another script.
i have realized that xlsx is no longer supported by xlrd due to security concerns. my current thought of a workaround is to download to a seperate file, convert to xls and vthen do my initial process/manipulations. i am new to python and wondering if this is the best way to accomplish. i see a potential problem in this method, as the security concern is still present. this particular document probably is downloaded by many institutions daily, so incentive for hacking source doc and deploying bug is high. am i overthinking?
what method would you use to call xlsx into pandas from a static url...additionally, this is my next problem - downloading a document from a dynamic URL and any tips on where to look would be helpful.
my original source code is below, the problem i am trying to accomplish is maintaining a database of all s&p500 constituents and their current weightings.
thank you.
# packages
import pandas as pd
url = 'https://www.ssga.com/us/en/institutional/etfs/library-content/products/fund-data/etfs/us/holdings-daily-us-en-spy.xlsx'
# Load the first sheet of the Excel file into a data frame
df = pd.read_excel(url, sheet_name=0, header=1)
# View the first ten rows
df.head(10)
#is it worth it to download file to a repisotory, convert to xls, then read in?

You can always make the request with requests and then read the xlsx into a pandas dataframe like so:
import pandas as pd
import requests
from io import BytesIO
url = ("https://www.ssga.com/us/en/institutional/etfs/library-content/"
"products/fund-data/etfs/us/holdings-daily-us-en-spy.xlsx")
r = requests.get(url)
bts = BytesIO(r.content)
df = pd.read_excel(bts)
I'm not sure about the security concerns but this would be equivalent to just making the same request in a browser. As for the dynamic url, if you can figure out which parts of the url are changing you can just modify it as follows
stock = 'spy'
url = ("https://www.ssga.com/us/en/institutional/etfs/library-content/"
f"products/fund-data/etfs/us/holdings-daily-us-en-{stock}.xlsx")

Related

How to download csv in python when the link is not directly linked to the csv?

I'm trying to download the economic calendar events on myfxbook.com.
They have an option to download the calendar in a csv and an example of the url is as such:
https://www.myfxbook.com/calendar_statement.csv?filter=0-1-2-3_ANG-ARS-AUD-BRL-CAD-CHF-CLP-CNY-COP-CZK-DKK-EEK-EUR-GBP-HKD-HUF-IDR-INR-ISK-JPY-KPW-KRW-MXN-NOK-NZD-PEI-PLN-QAR-ROL-RUB-SEK-SGD-TRY-USD-ZAR&end=2021-08-20%2015:59:59.059&start=2021-08-18%2016:00:00.0&calPeriod=10
I'm aware that this is not an explicit csv link (must end with a ".csv") so I can't take the liberty of just doing a "pandas.read_csv".
I've tried using requests to download it but it seems like there's no content (content-length is 0):
Any help would be appreciated here..

Download csv schedule from https URL using Python

I cannot seem to locate a URL path to the CSV file so as to try and use python csv.reader, requests, urllib, or pandas.
Here is the webpage: https://www.portconnect.co.nz/#/vessel-schedule/expected-arrivals
Am I on the right track?
Could you please suggest a solution?
normally it should be quite easy by using pandas if you only had the direct link to the csv file. With the provided url, I could download the csv from https://www.portconnect.co.nz/568a8a10-3379-4096-a1e8-76e0f1d1d847
import pandas as pd
portconnecturl = "https://www.portconnect.co.nz/568a8a10-3379-4096-a1e8-76e0f1d1d847"
print("Downloading data from portconnect...")
df_csv = pd.read_csv(portconnecturl)
print(df_csv)
.csv files can't be integrated inside a webpage,hence you won't get a webpage+.csv file url. You will have to download it exclusively and read it in your code. Else if will need to use web scraping and scrape the details that you want using beautiful soup or selenium python modules.

How to read several web pages without closing url link

I use Python 3.7 with Spyder in Anaconda. I don't have a lot of experience with Python so I might use the wrong technical terms in my problem description.
I use the requests library to read process data of a list of part numbers, from a database with a web page interface. I use the following code. I found most of it on StackOverflow.
# Libraries
import requests
import pandas as pd
import lxml.html as LH
# Get link for part results from hyperlink list
for link in hyperlink_list:
# Add part number to database link
process_url = database_link + link
html = requests.get(process_url).content
# Read data to dataframe
df_list = pd.read_html(html)
The for loop fetches the link for the next part number from the hyperlink list and then modifies the process_url to extract the data for that part number. The code above works well except that it takes more than twice as long (2.2 seconds) as my vba code that does the same. It looks like it opens and closes the link for every part number. Is there any way to open the url link and read many different web pages before closing the link.
I'm making the assumption that it opens and closes the link for every part based on the fact that I had the same time delay when I used Excel vba code that opened and closed internet explorer for every data read. When I changed the vba code to keep explorer open and read all the web pages, it took less than a second.
I managed to reduce the time by 0.5 seconds by removing requests.get(process_url).content
and using pandas to directly read the data with df_list = pd.read_html(process_url). It now takes around 1.7 seconds to read the 400 rows of data in the table for each part. This adds up to a good time saving when I have to read thousands of tables but is still slower than the vba script. Below is my new code
import pandas as pd
# Get link for part results from hyperlink list
for link in hyperlink_list:
# Add part number to database link
process_url = database_link + link
df_list = pd.read_html(process_url)
df = df_list[-1]

How can I transform an html button link to a pandas dataframe?

This is Alex, a python newbie which needs a little bit of help with an automation task.
For a monthly report, I would need to be able to extract some values monthly (i.e. 119.032 - " Kredite für Wohnbau" Sep. 20) from the following table (https://www.oenb.at/isaweb/report.do?lang=DE&report=1.5.13).
On the upper left side there is an html button (Go) to export the data either in excel or csv.
My goal would be to access the download file url and directly import it with pandas read_csv.
I have tried the following without success:
Screenshot of chrome navigator
1). pd.read_html(r'https://www.oenb.at/isaweb/reportExcelHtml.do?report=1.5.13&sort=ASC&dynValue=0&lang=DE&linesPerPage=allen&timeSeries=All&page=1')
Which results in a rather chaotic output.
2). df = pd.read_csv('https://www.oenb.at/isaweb/reportExcelHtml.do?report=1.5.13&sort=ASC&dynValue=0&lang=DE&linesPerPage=allen&timeSeries=All&page=1', error_bad_lines=False)
Which instead of the values shows only the following:
Read_CSV outputs
I would be extremely grateful for any advice on how to get this working.
Thank you very much in advance!
You can achieve this by sending POST request and then creating DataFrame from content of request using pandas. Make sure you have properly decoded string containing data.
data = requests.post('https://www.oenb.at/isaweb/reportExcelHtml.do;jsessionid=F96C45B1B1CA4A34D676CEDF8A21FE32?report=1.5.13&sort=ASC&dynValue=0&lang=DE&linesPerPage=allen&timeSeries=All&page=1&export=CSV&buttonExport=Go')
pd.read_csv(StringIO(data.content.decode('iso-8859-1')), sep=';')

Accessing the contents of a public Google Sheet as CSV using Requests (or other library)

I wrote a small python program that works with data from a CSV file. I am tracking some numbers in a google sheet and I created the CSV file by downloading the google sheet. I am trying to find a way to have python read in the CSV file directly from google sheets, so that I do not have to download a new CSV when I update the spreadsheet.
I see that the requests library may be able to handle this, but I'm having a hard time figuring it out. I've chosen not to try the google APIs because this way seems simpler as long as I don't mind making the sheet public to those with the link, which is fine.
I've tried working with the requests documentation but I'm a novice programmer and I can't get it to read in as a CSV.
This is how the data is currently taken into python:
file = open('data1.csv', newline='')
reader = csv.reader(file)
I would like the file = open() to ideally be replaced by the requests library and pull directly from the spreadsheet.
You need to find the correct URL request that download the file.
Sample URL:
csv_url='https://docs.google.com/spreadsheets/d/169AMdEzYzH7NDY20RCcyf-JpxPSUaO0nC5JRUb8wwvc/export?format=csv&id=169AMdEzYzH7NDY20RCcyf-JpxPSUaO0nC5JRUb8wwvc&gid=0'
The way to doing it is by manually download your file while inspecting the requests URL at the Network tab in the Developer Tools in your browser.
Then the following is enough:
import requests as rs
csv_url=YOUR_CSV_DOWNLOAD_URL
res=rs.get(url=csv_url)
open('google.csv', 'wb').write(res.content)
It will save CSV file with the name 'google.csv' in the folder of you python script file.
import pandas as pd
import requests
YOUR_SHEET_ID=''
r = requests.get(f'https://docs.google.com/spreadsheet/ccc?key={YOUR_SHEET_ID}&output=csv')
open('dataset.csv', 'wb').write(r.content)
df = pd.read_csv('dataset.csv')
df.head()
I tried #adirmola's solution but I had to tweak it a little.
When he wrote "You need to find the correct URL request that download the file" he has a point. An easy solution is what I'm showing here. Adding "&output=csv" after your google sheet id.
Hope it helps!
I'm not exactly sure about your usage scenario, and Adirmola already provided a very exact answer to your question, but my immediate question is why you want to download the CSV in the first place.
Google Sheets has a python library so you can just get the data from the GSheet directly.
You may also be interested in this answer since you're interested in watching for changes in GSheets
I would just like to say using the Oauth keys and google python API is not always an option. I found the above to be quite useful for my current application.

Categories