Web Scraping the Registration Reset Website - python

I am trying to get some perspective on web scraping this website. Essentially, what I am going to do is use the header keys as a way to scrape the data from the website and create a list of tuples, which I will convert into a data frame.
The issue is navigating to display different results and using a for loop to do so (example navigating from the first 50 results to the next 50 results.
What attribute, class, etc would I need to access so that I can iterate from tab to tab till the maximum number of rows is reached?
https://www6.sos.state.oh.us/ords/f?p=119:REGRESET:0:

What happens is what classes are shown in the inspect element and real classes are different sometimes.
Try to write the page as a binary file like:
import requests
html = requests.request("GET","https://www6.sos.state.oh.us/ords/f?p=119:REGRESET:0"
f = open("file.html", "w+")
f.write(str(html))
f.close()
Open the file in a browser and then inspect it, you will get the correct classes to scrape.

Related

Parsing a table on a webpage generated by a script using Python

I am trying to scrape the data contained in a table on https://www.bop.gov/coronavirus/. However, when one first visits the page the table is hidden behind a link (https://www.bop.gov/coronavirus/#) that leads to the same page but expands the hidden table on the page. However, I cannot find within this link within the webpage's source code or using selenium in order to expand the table and scrape its data. How can I go about accessing the data in this table using python?
The endpoint from which the data is loaded on the page is available under the network tab of the developer tools. The data you need is loaded from
https://www.bop.gov/coronavirus/json/final.json
You might also want to take a look at
https://www.bop.gov/coronavirus/data/locations.json
as the first link only contains the short codes for the names.
The table data is readily available under the div with id="totals_breakdown".
You can directly call the page_source and parse the data for that element with BeautifulSoup without needing to "show" the element.
If you MUST show the element for some reason, you simply have to remove the class closed from the div with id="totals_breakdown"

Get HTML-source as an HTML object with ability to work in it using DOM operations

I have a page, say, https://jq.profinance.ru/html/htmlquotes/site2.jsp, which is updated every second. My aim is to parse values using Selenium.
driver = webdriver.Chrome()
driver.get(url)
mylist = []
my_tables = driver.find_elements_by_tag_name('table') #operation1
for tr in my_tables.find_elements_by_tag_name('tr'): #operation2
mylist.append(tr)
The problem is that Python assigns a reference to object driver.find_elements_by_tag_name('table') to my variable my_tables but not value. Hence, I do not get correct data as there is some lag between operations 1 and 2.
How can I copy the webpage HTML structure and then use Selenium commands to walk through the structure of my document?
I tried pickle, get_aatribute("InnerHTML"), .page_source but they do not work properly as they copy the string object.
I don't think you can do exactly what you're trying to do with Selenium alone. Selenium "drives" a running web browser, and if the Javascript in that browser is updating the contents of the page every second or so you'll have these timing problems.
What you can do is use Selenium to drive the browser to get a snapshot of the page's HTML as a string (exactly as you describe in your last paragraph).
Then you can use a library like Beautiful Soup to parse the HTML string and extract the data that you need.
After some time I found the solution:
Dump file into string and save locally in a html file
Open html file locally.
If you want to get back to the website, write driver.back()

Selenium WebDriver Very Slow to Append WebElement Data to List

I'm trying to store webelement content to a python list. While it works, it's taking ~15min to process ~2,000 rows.
# Grab webelements via xpath
rowt = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th[#class='listing-title']")
rowl = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/td[#class='listing-location']")
rowli = driver.find_elements_by_xpath("//tbody[#class='table-body']/tr/th/a")
title = []
location = []
link = []
# Add webElement strings to lists
print('Compiling list...')
[title.append(i.text) for i in rowt]
[location.append(i.text) for i in rowl]
[link.append(i.get_attribute('href')) for i in rowli]
Is there a faster way to do this?
your solution is parsing through the table three times, once for the titles, once for the locations, and once for the links.
Try parsing the table just once. Have a selector for the row, then loop through the rows, and for each row, extract the 3 elements using a relative path, e.g. for the link, it would look like this:
link.append(row.find_elements_by_xpath("./th/a").get_attribute('href'))
Suggestions (apologies if it’s not helpful):
I think Pandas can be used to load HTML tables directly. If your intent is to scrape a table then libraries like Bs4 also might come handy.
You can store the entire HTML and the parse it using Regex,cause all the data you are extracting is gonna be enclosed in fixed set of HTML tags.
Depending on what you're trying to do, if the server that is presenting the page has an API, it would likely be significantly faster for you to use that to retrieve the data, rather than scraping the content from the page.
You could use the browser tools to see what the different requests are being sent to the server, and perhaps the data is being returned in a JSON form that you can easily retrieve your data from.
This, of course, assumes that you're interested in the data, not in verifying the content of the page directly.
I guess the slowest one is [location.append(i.text) for i in rowl].
When you call i.text, Selenium needs to determine what will be displayed in that element, so it needs more time to process.
You can use a workaround i.get_attribute('innerText') instead.
[location.append(i.get_attribbute('innerText')) for i in rowl]
However, I can't guarantee that the result will be the same. (It should be the same or similar to .Text).
I've tested this on my machines with ~2000 row, i.text took 80 sec. while i.get_attribute('innerText') took 28 sec.
Using bs4 would definitely help.
Even if you may have to find elements again using bs4, it was still faster to use bs4.
I'd like to suggest you try bs4.
I.e., code like this would work
soup = bs4.BeautifulSoup(driver.page_source, "html.parser")
elements = soup.find_all(...)
Loop using i
Some job using elements[i]['target attribute']

Python - How to scrape Tr/Td table data using 'requests & BeautifulSoup'

I'm new to programming. I'm trying out my first Web Crawler program that will help me with my job. I'm trying to build a program that will scrape tr/td table data from a web page, but am having difficulties succeeding. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for table_data in soup.find_all('td', {'class': 'sorting_1'}):
print(table_data)
start('http://www.datatables.net/')
My goal is to print out each line and then export it to an excel file.
Thank you,
-Cire
My recommendation is that if you are new to Python, play with things via the iPython notebook (interactive prompt) to get things working first and to get a feel for things before you try writing a script or a function. On the plus side all variables will stick around and it is much easier to see what is going on.
From the screen shot here, you can see immediately that the find_all function is not finding anything. An empty lists [] is being returned. By using ipython you can easily try other variants of a function on a previously defined variable. For example, the soup.find_all('td').
Looking at the source of http://www.datatables.net, I do not see any instances of the text sorting_1, so I wouldn't expect a search for all table cells of that class to return anything.
Perhaps that class appeared on a different URL associated with the DataTables website, in which case you would need to use that URL in your code. It's also possible that that class only appears after certain JavaScript has been run client-side (i.e. after certain actions with the sample tables, perhaps), and not on the initially loaded page.
I'd recommend starting with tags you know are on the initial page (seen by looking at the page source in your browser).
For example, currently, I can see a div with class="content". So the find_all code could be changed to the following:
for table_data in soup.find_all('div', {'class': 'content'}):
print(table_data)
And that should find something.
Response to comments from OP:
The precise reason why you're not finding that tag/class pairing in this case is that DataTables renders the table client-side via JavaScript, generally after the DOM has finished loading (although it depends on the page and where the DataTables init code is placed). That means the HTML associated with the base URL does not contain this content. You can see this if you curl the base URL and look at the output.
However when loading it in a browser, once the JavaScript for DataTables fires, the table is rendered and the DOM is dynamically modified to add the table, including cells with the class for which you're looking.

How can I iterate through the pages of a website using Python?

I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?
Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.
To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.
For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want

Categories