How to read several web pages without closing url link - python

I use Python 3.7 with Spyder in Anaconda. I don't have a lot of experience with Python so I might use the wrong technical terms in my problem description.
I use the requests library to read process data of a list of part numbers, from a database with a web page interface. I use the following code. I found most of it on StackOverflow.
# Libraries
import requests
import pandas as pd
import lxml.html as LH
# Get link for part results from hyperlink list
for link in hyperlink_list:
# Add part number to database link
process_url = database_link + link
html = requests.get(process_url).content
# Read data to dataframe
df_list = pd.read_html(html)
The for loop fetches the link for the next part number from the hyperlink list and then modifies the process_url to extract the data for that part number. The code above works well except that it takes more than twice as long (2.2 seconds) as my vba code that does the same. It looks like it opens and closes the link for every part number. Is there any way to open the url link and read many different web pages before closing the link.
I'm making the assumption that it opens and closes the link for every part based on the fact that I had the same time delay when I used Excel vba code that opened and closed internet explorer for every data read. When I changed the vba code to keep explorer open and read all the web pages, it took less than a second.

I managed to reduce the time by 0.5 seconds by removing requests.get(process_url).content
and using pandas to directly read the data with df_list = pd.read_html(process_url). It now takes around 1.7 seconds to read the 400 rows of data in the table for each part. This adds up to a good time saving when I have to read thousands of tables but is still slower than the vba script. Below is my new code
import pandas as pd
# Get link for part results from hyperlink list
for link in hyperlink_list:
# Add part number to database link
process_url = database_link + link
df_list = pd.read_html(process_url)
df = df_list[-1]

Related

Need work around for downloading xlsx from website

i am trying to obtain data provided in an xslx spreadsheet for download from a url link via python. my fist approach was to read it into a dataframe and save it down to a file that can be manipulated via another script.
i have realized that xlsx is no longer supported by xlrd due to security concerns. my current thought of a workaround is to download to a seperate file, convert to xls and vthen do my initial process/manipulations. i am new to python and wondering if this is the best way to accomplish. i see a potential problem in this method, as the security concern is still present. this particular document probably is downloaded by many institutions daily, so incentive for hacking source doc and deploying bug is high. am i overthinking?
what method would you use to call xlsx into pandas from a static url...additionally, this is my next problem - downloading a document from a dynamic URL and any tips on where to look would be helpful.
my original source code is below, the problem i am trying to accomplish is maintaining a database of all s&p500 constituents and their current weightings.
thank you.
# packages
import pandas as pd
url = 'https://www.ssga.com/us/en/institutional/etfs/library-content/products/fund-data/etfs/us/holdings-daily-us-en-spy.xlsx'
# Load the first sheet of the Excel file into a data frame
df = pd.read_excel(url, sheet_name=0, header=1)
# View the first ten rows
df.head(10)
#is it worth it to download file to a repisotory, convert to xls, then read in?
You can always make the request with requests and then read the xlsx into a pandas dataframe like so:
import pandas as pd
import requests
from io import BytesIO
url = ("https://www.ssga.com/us/en/institutional/etfs/library-content/"
"products/fund-data/etfs/us/holdings-daily-us-en-spy.xlsx")
r = requests.get(url)
bts = BytesIO(r.content)
df = pd.read_excel(bts)
I'm not sure about the security concerns but this would be equivalent to just making the same request in a browser. As for the dynamic url, if you can figure out which parts of the url are changing you can just modify it as follows
stock = 'spy'
url = ("https://www.ssga.com/us/en/institutional/etfs/library-content/"
f"products/fund-data/etfs/us/holdings-daily-us-en-{stock}.xlsx")

Web Scraping the Registration Reset Website

I am trying to get some perspective on web scraping this website. Essentially, what I am going to do is use the header keys as a way to scrape the data from the website and create a list of tuples, which I will convert into a data frame.
The issue is navigating to display different results and using a for loop to do so (example navigating from the first 50 results to the next 50 results.
What attribute, class, etc would I need to access so that I can iterate from tab to tab till the maximum number of rows is reached?
https://www6.sos.state.oh.us/ords/f?p=119:REGRESET:0:
What happens is what classes are shown in the inspect element and real classes are different sometimes.
Try to write the page as a binary file like:
import requests
html = requests.request("GET","https://www6.sos.state.oh.us/ords/f?p=119:REGRESET:0"
f = open("file.html", "w+")
f.write(str(html))
f.close()
Open the file in a browser and then inspect it, you will get the correct classes to scrape.

Incomplete data after scraping a website for Data

I am working on some web scraping using Python and experienced some issues with extracting the table values. For example, I am interested in scraping the ETFs values from http://www.etf.com/etfanalytics/etf-finder. Below is a snapshot of the tables I am trying to scrap values from.
Here is the codes which I am trying to use in the scraping.
#Import packages
import pandas as pd
import requests
#Get website url and get request
etf_list = "http://www.etf.com/etfanalytics/etf-finder"
etf_df = pd.read_html(requests.get(etf_list, headers={'User-agent':
'Mozilla/5.0'}).text)
#printing the scraped data to screen
print(etf_df)
# Output the read data into dataframes
for i in range(0,len(etf_df)):
frame[i] = pd.DataFrame(etf_df[i])
print(frame[i])
I have several issues.
The tables only consist of 20 entries while the total entries per table from the website should be 2166 entries. How do I amend the code to pull all the values?
Some of the dataframes could not be properly assigned after scraping from the site. For example, the outputs for frame[0] is not a dataframe format and nothing was seen for frame[0] when trying to view as DataFrame under the Python console. However it seems fine when printing to the screen. Would it be better if I phase the HTML using beautifulSoup instead?
As noted by Alex, the website requests the data from http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1, which checks the Referer header to see if you're allowed to see it.
However, Alex is wrong in saying that you're unable to change the header.
It is in fact very easy to send custom headers using requests:
>>> r = requests.get('http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1', headers={'Referer': 'http://www.etf.com/etfanalytics/etf-finder'})
>>> data = r.json()
>>> len(data)
2166
At this point, data is a dict containing all the data you need, pandas probably has a simple way of loading it into a dataframe.
You get only 20 rows of the table, because only 20 rows are present on the html page by default. View the source-code of the page, you are trying to parse. There could be a possible solution to iterate through the pagination til the end, but pagination there is implemented with JS, it is not reflected in the URL, so I don't see, how you can access next pages of the table directly.
Looks like there is a request to
http://www.etf.com/etf-finder-funds-api//-aum/100/100/1
on that page, when I try to load the 2nd group of 100 rows. But getting an access to that URL might very tricky if possible. Maybe for this particular site you should use something, like WebBrowser in C# (I don't know what it will be in python, but I'm sure that python can do everything). You will be able to imitate browser and execute javascript.
Edit: I've tried to run the next JS code in console on the page, you provided.
jQuery.ajax({
url: "http://www.etf.com/etf-finder-funds-api//-aum/0/3000/1",
success: function(data) {
console.log(JSON.parse(data));
}
});
It logged an array of all 2166 objects, representing table rows, you are looking for. Try it yourself to see the result. Looks like in the request url "0" is a start index and "3000" is a limit.
But if you try this from some other domain you will get 403 Forbidden. This is because of they have a Referer header check.
Edit again as mentioned by #stranac it is easy to set that header. Just set it to http://www.etf.com/etfanalytics/etf-finder and enjoy.

Get all opened websites from Chrome in Python

I am on Windows 8.1, Python 3.6.
Is it possible to get all currently open websites in the Latest version Of Chrome and save the websites to a text file in D:/.
I tried opening file:
C:\Users\username\AppData\Local\Google\Chrome\User Data\Default\Current Tabs
But I receive an error saying that the file is opened in another program.
There is another file named History that contains URLs that are opened but it also contain characters like NULL.
I tried reading the file in python but I received UndicodeDecodeError(Not sure About This Word).
then I tried opening file by the following code:
with open('C:/Users/username/AppData/Local/Google/Chrome/User Data/Default/History',"r+",encoding='latin') as file:
data = file.read()
print(data)
And it worked. But I got 1 or 2 URLs while in the text file, there were no URLs.
Maybe there's another way something like importing a module.
Something like:
import chrome
url = chrome.get_url()
print(url)
Maybe selenium can also do this. But I don't know how.
Maybe there's another way to read the file with all links in python.
Want I want with it is that it detect websites opened, if mywebsite.com is opened for more than 10 minutes, it will automatically be blocked. The system has its own file:
C:\Windows\System32\drivers\etc\hosts
It will add the following at the end:
127.0.0.1 www.mywebsite.com
And the website will no longer be available to use.
You can use this methodology to store the tab data and manipulate it:
windows = driver.window_handles
You can store the windows using the above method.
current_window = driver.current_window_handle
This method will give you the current window that is being handled. You can go through the list 'windows' and check if it is current_window to navigate between the tabs.
driver.switch_to.window(windows[5])
This method will switch to a desired tab but I assume you already have it.
Now how do you store the time spent after the tabs are opened?
There are two ways to do it:
Internally, by referring to a pandas dataframe or list
Reading and writing to a file.
First you need to import the 'time' library inside the script
current_time=time.time()
current_time is an int representation of the current time. It's a linux timestamp.
In either one of these scenarios, you will need a structure such as this:
data=[]
for i in range(0,len(windows)):
data.append([ windows[i] , time.time() ])
This will give a structure as below:
[[window[0],1234564879],
[window[1],1234567896],...]
Here's the thing you miss:
for i in range(0,len(data)):
if time.time()-data[i][1] > 600 # If new timestamp minus the old one is bigger than 600 seconds
driver.switch_to(data[i][0])
driver.close()
My personal advice is that you start with stable API services to get whatever data you want instead of selenium. I would recommend SerpApi since I work there. It has variety of scrapers including a google results scraper, and it has 5000 free calls for new accounts.

Python- Downloading a file from a webpage by clicking on a link

I've looked around the internet for a solution to this but none have really seemed applicable here. I'm writing a Python program to predict the next day's stock price using historical data. I don't need all the historical data since inception as Yahoo finance provides but only the last 60 days or so. The NASDAQ website provides just the right amount of historical data and I wanted to use that website.
What I want to do is, go to a particular stock's profile on NASDAQ. For Example: (www.nasdaq.com/symbol/amd/historical) and click on the "Download this File in Excel Format" link at the very bottom. I inspected the page's HTML to see if there was an actual link I can just use with urllib to get the file but all I got was:
<a id="lnkDownLoad" href="javascript:getQuotes(true);">
Download this file in Excel Format
</a>
No link. So my question is,how can I write a Python script that goes to a given stock's NASDAQ page, click on the Download file in excel format link and actually download the file from it. Most solutions online require you to know the url where the file is stored but in this case, I don't have access to that. So how do I go about doing this?
Using Chrome, go to View > Developer > Developer Tools
In this new developer tools UI, change to the Network tab
Navigate to the place where you would need to click, and click the ⃠ symbol to clear all recent activity.
Click the link, and see if there was any requests made to the server
If there was, click it, and see if you can reverse engineer the API of its endpoint
Please be aware that this may be against the website's Terms of Service!
It appears that BeautifulSoup might be the easiest way to do this. I've made a cursory check that the results of the following script are the same as those that appear on the page. You would just have to write the results to a file, rather than print them. However, the columns are ordered differently.
import requests
from bs4 import BeautifulSoup
URL = 'http://www.nasdaq.com/symbol/amd/historical'
page = requests.get(URL).text
soup = BeautifulSoup(page, 'lxml')
tableDiv = soup.find_all('div', id="historicalContainer")
tableRows = tableDiv[0].findAll('tr')
for tableRow in tableRows[2:]:
row = tuple(tableRow.getText().split())
print ('"%s",%s,%s,%s,%s,"%s"' % row)
Output:
"03/24/2017",14.16,14.18,13.54,13.7,"50,022,400"
"03/23/2017",13.96,14.115,13.77,13.79,"44,402,540"
"03/22/2017",13.7,14.145,13.55,14.1,"61,120,500"
"03/21/2017",14.4,14.49,13.78,13.82,"72,373,080"
"03/20/2017",13.68,14.5,13.54,14.4,"91,009,110"
"03/17/2017",13.62,13.74,13.36,13.49,"224,761,700"
"03/16/2017",13.79,13.88,13.65,13.65,"44,356,700"
"03/15/2017",14.03,14.06,13.62,13.98,"55,070,770"
"03/14/2017",14,14.15,13.6401,14.1,"52,355,490"
"03/13/2017",14.475,14.68,14.18,14.28,"72,917,550"
"03/10/2017",13.5,13.93,13.45,13.91,"62,426,240"
"03/09/2017",13.45,13.45,13.11,13.33,"45,122,590"
"03/08/2017",13.25,13.55,13.1,13.22,"71,231,410"
"03/07/2017",13.07,13.37,12.79,13.05,"76,518,390"
"03/06/2017",13,13.34,12.38,13.04,"117,044,000"
"03/03/2017",13.55,13.58,12.79,13.03,"163,489,100"
"03/02/2017",14.59,14.78,13.87,13.9,"103,970,100"
"03/01/2017",15.08,15.09,14.52,14.96,"73,311,380"
"02/28/2017",15.45,15.55,14.35,14.46,"141,638,700"
"02/27/2017",14.27,15.35,14.27,15.2,"95,126,330"
"02/24/2017",14,14.32,13.86,14.12,"46,130,900"
"02/23/2017",14.2,14.45,13.82,14.32,"79,900,450"
"02/22/2017",14.3,14.5,14.04,14.28,"71,394,390"
"02/21/2017",13.41,14.1,13.4,14,"66,250,920"
"02/17/2017",12.79,13.14,12.6,13.13,"40,831,730"
"02/16/2017",13.25,13.35,12.84,12.97,"52,403,840"
"02/15/2017",13.2,13.44,13.15,13.3,"33,655,580"
"02/14/2017",13.43,13.49,13.19,13.26,"40,436,710"
"02/13/2017",13.7,13.95,13.38,13.49,"57,231,080"
"02/10/2017",13.86,13.86,13.25,13.58,"54,522,240"
"02/09/2017",13.78,13.89,13.4,13.42,"72,826,820"
"02/08/2017",13.21,13.75,13.08,13.56,"75,894,880"
"02/07/2017",14.05,14.27,13.06,13.29,"158,507,200"
"02/06/2017",12.46,13.7,12.38,13.63,"139,921,700"
"02/03/2017",12.37,12.5,12.04,12.24,"59,981,710"
"02/02/2017",11.98,12.66,11.95,12.28,"116,246,800"
"02/01/2017",10.9,12.14,10.81,12.06,"165,784,500"
"01/31/2017",10.6,10.67,10.22,10.37,"51,993,490"
"01/30/2017",10.62,10.68,10.3,10.61,"37,648,430"
"01/27/2017",10.6,10.73,10.52,10.67,"32,563,480"
"01/26/2017",10.35,10.66,10.3,10.52,"35,779,140"
"01/25/2017",10.74,10.975,10.15,10.35,"61,800,440"
"01/24/2017",9.95,10.49,9.95,10.44,"43,858,900"
"01/23/2017",9.68,10.06,9.68,9.91,"27,848,180"
"01/20/2017",9.88,9.96,9.67,9.75,"27,936,610"
"01/19/2017",9.92,10.25,9.75,9.77,"46,087,250"
"01/18/2017",9.54,10.1,9.42,9.88,"51,705,580"
"01/17/2017",10.17,10.23,9.78,9.82,"70,388,000"
"01/13/2017",10.79,10.87,10.56,10.58,"38,344,340"
"01/12/2017",10.98,11.0376,10.33,10.76,"75,178,900"
"01/11/2017",11.39,11.41,11.15,11.2,"39,337,330"
"01/10/2017",11.55,11.63,11.33,11.44,"29,122,540"
"01/09/2017",11.37,11.64,11.31,11.49,"37,215,840"
"01/06/2017",11.29,11.49,11.11,11.32,"34,437,560"
"01/05/2017",11.43,11.69,11.23,11.24,"38,777,380"
"01/04/2017",11.45,11.5204,11.235,11.43,"40,742,680"
"01/03/2017",11.42,11.65,11.02,11.43,"55,114,820"
"12/30/2016",11.7,11.78,11.25,11.34,"44,033,460"
"12/29/2016",11.24,11.62,11.01,11.59,"50,180,310"
"12/28/2016",12.28,12.42,11.46,11.55,"71,072,640"
"12/27/2016",11.65,12.08,11.6,12.07,"44,168,130"
The script escapes dates and thousands-separated numbers.
Dig a little bit deeper and find out what js function getQuotes() does. You should get a good clue from that.
If it all seem too much complicated, then you can always use selenium. It is used to simulate the browser. However, it is much slower than using native network calls. You can find official documentation here.

Categories