I'm new to webscraping and I wanted to retrieve all the wins and losses within this season of the NHL. Now this url works fine: https://www.nhl.com/scores ... but the problem arises when I want to go back to previous dates like so: https://www.nhl.com/scores/2022-09-24 ... this is the url that shows up when I interact with the buttons in that first url. you can see for yourself. I know i'm missing something here but to me it's not as obvious. please enlighten me.
I then tried to see if there was a way to use https://www.nhl.com/scores/ to obtain the information I require but I am having trouble accessing that data.
I'd recommend not using the URLs to access a specific date's data and instead take a look at the Fetch/XHR requests in your browser's dev tools -> Network activity to see what kinds of API calls are being made whenever you click on a date. Then you can make a call directly to that API endpoint within your python script and parse the JSON response
You can use the requests library for this: https://requests.readthedocs.io/en/latest/
Related
I am trying to automate a test case wherein I first login to a website and then modify a drop down. I have automated the part until where I reach the correct page. The issue is, this drop down option is part of a table. The table has many similar elements. Also a few table tags are used as the first column is locked and others are horizontally scrollable. I am finding it very difficult to locate the correct drop down option to click.
When I check in the developer options, the drown modification is actually a POST request.
I know we can use API testing with pytest, but is it possible to integrate this within existing selenium framework?
Can I create a framework where in test_navigate will navigate me to the necessary page (pure selenium). Then test_modify_dropdown will use api call to send the POST request and modify the option. And then i can continue with further test_three?
All this in pytest by the way.
You should simply be able to use the python requests module. A post request could look like this:
import requests
url = "url/for/your/api"
myobj = {'somekey': 'somevalue'} #json in request payload
x = requests.post(url, json = myobj)
print(x.text) #and/or get whatever data you want from the response.
So I am building a Django web app, and I want to allow users to search for their desired crypto currency and then return the price. I plan on getting the price from coinbase or some other site that already presents this information. How would I go about this. I figure I would have to wrote the script to get the price under views.py. What would be the best approach? Can I add a web scrapping script that already does this to django? Or would I have to connect say coinbases api to my Django project. If so how do I do this?
If you're looking at using an API from a service to get these prices then Request is something you can look at.
If you're looking at scrapping the data from a page, then you'll probably want to look at BeautifulSoup, or scrapy or one step further selenium
As for where you call it, that's on you. if it's data that you're always going to need, then you could look at runnning your script as a task or worker so you're always getting an up-to-date price. Otherwise you could trigger the script and wait for the response to come back. Lot's of draw backs to both of these, and I'm guessing if the site doesn't provide an API for getting the info you need through a managed endpoint they will probably block your requests if you make too many of them.
but that's a starter for 10
What I am doing:
Scraping Amazon products such as https://www.amazon.com.br/dp/B000F5NNKE.
Problem:
The scraper starts out fine and the pages load fully for a while. However, Amazon eventually figures out I am scraping it and, instead of blocking me, the source code it sends back as a response no longer contains the two sections in the image below, namely "Descrição do produto" (in black) and "Informações sobre o produto" (below, in orange)
What I need:
Changing agents and proxies only works for a while so ideally what I wanted was a way to send a request to a specific URL so that I can request these sections directly instead of trying to make Amazon include them in the source code.
What I have tried:
I have looked at the XHR tab in the Network tab of developer tools but the requests were indecipherable to me.
QUESTION
Is there another a way I can figure out exactly which request loads these sections so that I don't need to use something more drastic like Javascript rendering?
I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.