I am a 16 year old student in Singapore. Due to Covid-19, I am unable to take the ACT at the planned date being September 11. However, since test centers have closed down the next slot is all the way in December, making it really bad for me. However, I know that often one slot opens up, for a few hours until someone claims it. Hence, I thought I could use Python to send me an email when it opens up so I could be ahead of the curve.
I was really happy when I found the 'database' for the center where I want to take the test, link
On my default browser, where I have signed in, it looks like this
I was really excited since from that, I could write a script that would check every hour if the data 09..etc (September) date was present, and email me.
However, when I use Beautiful Soup in Python, the output is authentication required). The authentication for the website is email and password. Could anyone help me as to how to authenticate using BeautifulSoup? Thanks!
import requests
from bs4 import BeautifulSoup
url='https://my.act.org/api/test-scheduling/ACTInternational/test-dates/YTLQVQEZ?roomType=REGULAR&testCenterId=YTLQVQEZ'
#open with GET method
resp=requests.get(url)
# we need a parser,Python built-in HTML parser is enough .
soup=BeautifulSoup(resp.text,'html.parser')
print(soup)
The authentication resides in the request. The api should specify which headers its need to be authenticated and authorized. If you find the right headers you can simply provide them in your request ! Or better yet try and use the Basic authentication from requests
import requests
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
url='https://my.act.org/api/test-scheduling/ACTInternational/test-dates/YTLQVQEZ?roomType=REGULAR&testCenterId=YTLQVQEZ'
#open with GET method
#We can now provide credentials, in your case email and pass
resp=requests.get(url, auth=HTTPBasicAuth('user', 'pass'))
# we need a parser,Python built-in HTML parser is enough .
soup=BeautifulSoup(resp.text,'html.parser')
print(soup)
Related
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("https://mail.google.com/mail/u/0/#label/Notes"), features="lxml")
print(soup.prettify())
The above doesn't work for me. I am not sure how to do authentication and what web address to use inside open command.
So Google varies its login procedures from time to time to ensure security and prevent bot logins. Which means there is code to login using BeautifulSoup but it no longer works.
The reason that 2016, and earlier, code will not work is that Gmail has added an encryption using JS as part of its authentication, captchas, and other measures which would have to be reverse engineered in order to use BSoup. Basically, removing the ability to just "scrape" the data.
If you want to download the emails for processing, it is possible using the Gmail API or the Google Cloud API but this is not always an option and would be beyond the scope of this question.
I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!
I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far
from bs4 import BeautifulSoup as soup #parses/cuts the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-
price'
r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())
I am not able to use the html parser because the site detects and blocks my script then.
My questions are:
How can I parse the web pages to pull the table data?
Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.
There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.
One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.
Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.
However, for this particular case, you can simply lie to the server by sending your own User Agent header.
You can see a example here. Try to use your browsers user agent.
Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.
What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium
Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.
Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text
I am signing into my account at www.goodreads.com to scrape the list of books from my profile.
However, when I go to the goodreads page, even if I am logged in, my scraper gets only the home page. It cannot log in to my account. How do I redirect it to my account?
Edit:
from bs4 import BeautifulSoup
import urllib2
response=urllib2.urlopen('http://www.goodreads.com')
soup = BeautifulSoup(response.read())
[x.extract() for x in soup.find_all('script')]
print(soup.get_text())
If I run this code, I get only till the homepage, I cannot login to the my profile, even if I am already logged in to the browser.
What do I do to log in from a scraper?
Actually when you go to the site there is something called sessions that contains information about your accout ( not exactly but something like that ) and your browser can use them so every time that you go to the main page you are logged in , but you code doesn't use sessions and these things so you should do everything from the first
1) go to mainpage 2) log in 3) gathering your data
and also this question showed how to login to your account
I hope it helps.
Goodreads has an API that you might want to use instead of trying to log in and scrape the site's HTML. It's formatted in XML, so you can still use BeautifulSoup - just make sure you have lxml installed and use it as the parser. You'll need to register for a developer key, and also register your application, but then you're good to go.
You can use urllib2 or requests library to login and then scrape the response. In my experience using requests is a lot easier.
Here's a good explanation on logging in using both urllib2 and requests:
How to use Python to login to a webpage and retrieve cookies for later usage?
i have to retrieve some text from a website called morningstar.com . To access that data i have to log in. Once i log in and provide the url of the web page , i get the HTML text of a normal user (not logged in).As a result am not able to accees that information . ANy solutions ?
BeautifulSoup is for parsing html once you've already fetched it. You can fetch the html using any standard url fetching library. I prefer curl, as you tagged your post, python's built-in urllib2 also works well.
If you're saying that after logging in the response html is the same as for those who are not logged in, I'm gonna guess that your login is failing for some reason. If you are using urllib2, are are you making sure to store the cookie properly after your first login and then passing this cookie to urllib2 when you are sending the request for the data?
It would help if you posted the code you are using to make the two requests (the initial login, and the attempt to fetch the data).