I've been tasked with creating a cookie audit tool that crawls the entire website and gathers data on all cookies on the page and categorizes them according to whether they follow user data or not. I'm new to Python but I think this will be a great project for me, would beautifulsoup be a suitable tool for the job? We have tons of sites and are currently migrating to Drupal so it would have to be able to scan Polopoly CMS and Drupal.
Urllib2 is for submitting http requests, BeautifulSoup is for parsing html. You'll definitely need a http request library, and you may need BeautifulSoup as well depending on what exactly you want to do.
BeautifulSoup is extremely easy to use and parses broken html well, so would be good for grabbing the links to any javascript on a page (even in cases where the html is malformed). You'll then need something else to parse the javascript to figure out whether it's interacting with the cookies.
To see what the cookie values are on the client-side just look at the http-request header or use cookielib (although I've personally not used this library).
For http requests, I recommend the requests library, looking at the http-request headers will be as simple as:
response = requests.get(url)
header = response.headers
I suspect requests also has a shortcut for just accessing the Set-Cookie values of the header as well, but you'll need to look into that.
I don't think you need BeautifulSoup for this. You could do this with urllib2 for connection and cookielib for operations on cookies.
You don't need bs4 for this purpose because you only require info from cookies. (use bs4 only if finally you need to extract something from html code).
For the cookies stuff I would use python-request and its support to http sessions: http://docs.python-requests.org/en/latest/user/advanced/
Related
I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far
from bs4 import BeautifulSoup as soup #parses/cuts the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-
price'
r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())
I am not able to use the html parser because the site detects and blocks my script then.
My questions are:
How can I parse the web pages to pull the table data?
Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.
There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.
One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.
Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.
However, for this particular case, you can simply lie to the server by sending your own User Agent header.
You can see a example here. Try to use your browsers user agent.
Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.
What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium
Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.
Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text
I want to programmatically parse some Web pages on sites that do not have a publicly available API.
For example check if my grades are ready at the university.
Has anyone done anything like this and gotten to a usable solution? I'm probably looking for a library written in python or something similar, right?
Also note that some of these sites need login and/or ssl. How would you suggest to handle that?
I would recommend using urllib or urllib2, it allows you to send/receive requests and gives you HTML objects you can easily parse.
import urllib
proxies = {'http': 'http://proxy.example.com:8080/'}
opener = urllib.FancyURLopener(proxies)
f = opener.open("http://www.python.org")
f.read()
More information on how to use it : https://docs.python.org/2/library/urllib.html
I am currently trying to create a bot for the betfair trading site, it involves using the betfair api which uses soap and the new API-NG will use json so I can understand how to access the information that I need.
My question is, using python, what would the best way to get information from a website that uses just html, can I convert it some way to maybe xml or what is the best/easiest way.
Json, xml and basically all this is new to me so any help will be appreciated.
This is one of the websites I am trying to access to get horse names and prices,
http://www.oddschecker.com/horse-racing-betting/chepstow/14:35/winner
I know there are some similar questions but looking at the answers and the source of the above page I am no nearer to figuring out how to get the info I need.
For getting html from a website there are two well used options.
urllib2 This is built in.
requests This is third party but really easy to use.
If you then need to parse your html then I would suggest using Beautiful soup.
Example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
page_request = requests.get(url)
page_source = page_request.text
soup = BeautifulSoup(page_source)
The page_source is just the basic html of the page, not much use, the soup object on the other hand can be used to access different parts of the page automatically.
I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?
http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:
import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author#example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()
Refusing requests without Accept headers is incorrect; RFC 2616 clearly states
If no Accept header field is present, then it is assumed that the
client accepts all media types.
i have to retrieve some text from a website called morningstar.com . To access that data i have to log in. Once i log in and provide the url of the web page , i get the HTML text of a normal user (not logged in).As a result am not able to accees that information . ANy solutions ?
BeautifulSoup is for parsing html once you've already fetched it. You can fetch the html using any standard url fetching library. I prefer curl, as you tagged your post, python's built-in urllib2 also works well.
If you're saying that after logging in the response html is the same as for those who are not logged in, I'm gonna guess that your login is failing for some reason. If you are using urllib2, are are you making sure to store the cookie properly after your first login and then passing this cookie to urllib2 when you are sending the request for the data?
It would help if you posted the code you are using to make the two requests (the initial login, and the attempt to fetch the data).