from bs4 import BeautifulSoup
soup = BeautifulSoup(open("https://mail.google.com/mail/u/0/#label/Notes"), features="lxml")
print(soup.prettify())
The above doesn't work for me. I am not sure how to do authentication and what web address to use inside open command.
So Google varies its login procedures from time to time to ensure security and prevent bot logins. Which means there is code to login using BeautifulSoup but it no longer works.
The reason that 2016, and earlier, code will not work is that Gmail has added an encryption using JS as part of its authentication, captchas, and other measures which would have to be reverse engineered in order to use BSoup. Basically, removing the ability to just "scrape" the data.
If you want to download the emails for processing, it is possible using the Gmail API or the Google Cloud API but this is not always an option and would be beyond the scope of this question.
Related
I am a 16 year old student in Singapore. Due to Covid-19, I am unable to take the ACT at the planned date being September 11. However, since test centers have closed down the next slot is all the way in December, making it really bad for me. However, I know that often one slot opens up, for a few hours until someone claims it. Hence, I thought I could use Python to send me an email when it opens up so I could be ahead of the curve.
I was really happy when I found the 'database' for the center where I want to take the test, link
On my default browser, where I have signed in, it looks like this
I was really excited since from that, I could write a script that would check every hour if the data 09..etc (September) date was present, and email me.
However, when I use Beautiful Soup in Python, the output is authentication required). The authentication for the website is email and password. Could anyone help me as to how to authenticate using BeautifulSoup? Thanks!
import requests
from bs4 import BeautifulSoup
url='https://my.act.org/api/test-scheduling/ACTInternational/test-dates/YTLQVQEZ?roomType=REGULAR&testCenterId=YTLQVQEZ'
#open with GET method
resp=requests.get(url)
# we need a parser,Python built-in HTML parser is enough .
soup=BeautifulSoup(resp.text,'html.parser')
print(soup)
The authentication resides in the request. The api should specify which headers its need to be authenticated and authorized. If you find the right headers you can simply provide them in your request ! Or better yet try and use the Basic authentication from requests
import requests
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
url='https://my.act.org/api/test-scheduling/ACTInternational/test-dates/YTLQVQEZ?roomType=REGULAR&testCenterId=YTLQVQEZ'
#open with GET method
#We can now provide credentials, in your case email and pass
resp=requests.get(url, auth=HTTPBasicAuth('user', 'pass'))
# we need a parser,Python built-in HTML parser is enough .
soup=BeautifulSoup(resp.text,'html.parser')
print(soup)
I'm doing a personal project where I am trying to scrape HTML tables from a financial data website using Python. I am able to successfully use the requests package in Python to access public websites and extract any information (using BeautfulSoup4 afterwards for processing), but the code I am using is shown below:
# import requests
import requests
# access website
url = 'https://financial-data-url.ezproxy1.library.uniname.edu.com/path/to/financial/data'
headers = example_header
page = requests.get(url, headers = headers)
However, trying to access the website normally requires login through my University's library database through an EZproxy server (shown in example url). When I attempt to request the URL of the financial data webpage after getting access through the library database, it returns what seems to be the University library EZproxy webpage. This is where I need to click "login" before being directed to the financial data webpage.
Is there some credential provision that I may be missing in the request function, or potentially a different way of passing the proxy server to the URL so that the request does not end up on the proxy server login page?
I found that the fastest and most effective work-around for this problem is to use the Selenium web-based automation package (https://selenium-python.readthedocs.io/)
Selenium makes it very easy to replicate a login as well as navigation within the browser just as a person would. IMO, the simplicity of it may far outweigh the benefits of calling the web-page directly depending on the use-case (not efficient when speed and efficiency is the primary goal, however if that is not a major constraint it works quite well)
I want to build a api that accepts a string and returns html code.
Here is my scraping code that i want as a web-service.
Code
from selenium import webdriver
import bs4
import requests
import time
url = "https://www.pnrconverter.com/"
browser = webdriver.Firefox()
browser.get(url)
string = "3 PS 232 M 03FEB 7 JFKKBP HK2 1230A 420P 03FEB E
PS/JPIX8U"
button =
browser.find_element_by_xpath("//textarea[#class='dataInputChild']")
button.send_keys(string) #accept string
button.submit()
time.sleep(5)
soup = bs4.BeautifulSoup(browser.page_source,'html.parser')
html = soup.find('div',class_="main-content") #returns html
print(html)
Can anyone tell me the best possible solution to wrap up my code as a api/web-service.
There's no best possible solution in general, because a solution has to fit the problem and the available resources.
Right now it seems like you're trying to wrap someone else's website. If that's the problem you're actually trying to solve, and you want to give credit, you should probably just forward people to their site. Have your site return a 302 Redirect with their URL in the Location field in your header.
If what you're trying to do is get the response from this one sample check you have hardcoded, and and make that result available, I would suggest you put it in a static file behind nginx.
If what you're trying to do is use their backend to turn itineraries you have into responses you can return, you can do that by using their backend API, once that becomes available. Read the documentation, use the requests library to hit the API endpoint that you want, and get the JSON result back, and format it to your desires.
If you're trying to duplicate their site by making yourself a man-in-the-middle, that may be illegal and you should reconsider what you're doing.
For hosting purposes, you need to figure out how often your API will be hit. You can probably start on Heroku or something similar fairly easily, and scale up if you need to. You'll probably want WebObj or Flask or something similar sitting at the website where you intend to host this application. You can use those to process what I presume will be a simple request into the string you wish to hit their API with.
I am the owner of PNR Converter, so I can shed some light on your attempt to scrape content from our site. Unfortunately scraping from PNR Converter is not recommended. We are developing an API which looks like it would suit your needs, and should be ready in the not too distant future. If you contact us through the site we would be happy to work with you should you wish to use PNR Converter legitimately. PNR Converter gets at least one complete update per year and as such we change all the code on a regular basis. We also monitor all requests to our site, and we will block any requests which are deemed as improper usage. Our filter has already picked up your IP address (ends in 250.144) as potential misuse.
Like I said, should you wish to work with us at PNR Converter legitimately and not scrape our content then we would be happy to do so! please keep checking https://www.pnrconverter.com/api-introduction for information relating to our API.
We are releasing a backend upgrade this weekend, which will have a different HTML structure, and dynamically named elements which will cause a serious issue for web scrapers!
I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far
from bs4 import BeautifulSoup as soup #parses/cuts the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-
price'
r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())
I am not able to use the html parser because the site detects and blocks my script then.
My questions are:
How can I parse the web pages to pull the table data?
Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.
There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.
One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.
Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.
However, for this particular case, you can simply lie to the server by sending your own User Agent header.
You can see a example here. Try to use your browsers user agent.
Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.
What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium
Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.
Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text
I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes