Parsing remote web with Python BeautifulSoup - python

https://stackoverflow.com/a/64983/468251 - Hello, I have question about this code, how made that working with remote website url, and how got value = fooId['value'] from all inputs, no only from first?

When you parse url on the internet, you need to find a way to download the page content html first. There are great libraries, like requests, which is said to be best for python. Say you want to parse https://stackoverflow.com/
import requests
response = requests.get("https://stackoverflow.com/")
page_html = response.text
The page_html is the page html in python string, then you can treat it like a local html file, and preform any kind of parsing on them.
As for getting all the occurrence of a pattern, you can do soup.findAll('input',name='fooId',type='hidden'), instead of just soup.find(). The soup.findAll will return a list of all occurrence.

The example use a local file. If you want to use a remote site, you need to download the file from the server and parse the html.
You can look at request or urllib2 for this.
I hope it helps

Related

Downloading the HTML page of a site and crawl it to get the desired data cause they dont have a public api

So I need to get some data from a site, the problem is they dont have a public api for it so I thought of downloading the html file then search for the data I want. I just not sure if its even possible to do it I think it should be right?
the flow would be
1. first download the html file
2. ....crawl
(https://www.forexfactory.com/calendar.php) the link that has the data I want
not sure how will I crawl the page as string,cause the page has like a table, of data they actualy have a public api for the xml file, but it excludes the data I want which is the "actual" column, that's what I want
how will I crawl the table and get that actual column from the html file, I already have the other details from their xml file, like title/event name. Need help thanks.
A good idea is to work with Pythons request and BeautifulSoup4 libs.
First you make a http request with (you guessed it) requests, then you can parse the html site with bs4 (BeautifulSoup4)
import requests
from bs4 import BeautifulSoup
r = requests.get("Your Website").text
soup = BeautifulSoup(r,'lxml')
Now you can look at your "soup" and scrape the data you want

Redirected to main page when trying to parse html with python

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url = "http://www.csgolounge.com/api/mathes"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")
print (data)
I am trying to use this code to get the text from this page, but every time I try to scrape or get the text from the page, I am redirected to home page, and my code outputs the html from the homepage. The page I am trying to scrape is a .php file, and not an html or textfile. I would like to get the text from the page and then extract the data and do what I want with it.
I have tried changing the headers of my code, that the website would think that I am not a bot, but a chrome browser, but I still get redirected to the homepage. I have tried using diffrent html python parsers like BeautifulSoup, and the python built in class, as well as many other popular parsers, but they all give the same result.
Is there a way to stop this, and to get the text from this link? Is it a mistake in my code or what?
First of all, try it without the "www" part.
Rewrite http://www.csgolounge.com/api/mathes as https://csgolounge.com/api/mathes
If it doesn't work, try Selenium.
It may be getting stuck since it can't process the javascript part.
Selenium can handle it better.

Scraping Biography.com using urllib2

So I've scraped websites before, but this time I am stumped. I am attempting to search for a person on Biography.com and retrieve his/her biography. But whenever I search the site using urllib2 and query the URL: http://www.biography.com/search/ I get a blank page with no data in it.
When I look into the source generated in the browser by clicking View Source, I still do not see any data. When I use Chrome's developer tools, I find some data but still no links leading to the biography.
I have tried changing the User Agent, adding referrers, using cookies in Python but to no avail. If someone could help me out with this task it would be really helpful.
I am planning to use this text for my NLP project and worst case, I'll have to manually copy-paste the text. But I hope it doesn't come to that.
Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.
https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0
The search term I used is in the q= part of the query string: q=Barack%20Obama.
This returns JSON inside of which there is a key link with the value of the article of interest's URL.
"link": "http://www.biography.com/people/barack-obama-12782369"
Visiting that page shows me that this is generated by a request to:
http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/#published/#by-custom-type/ContentPerson/#by-slug/barack-obama-12782369
which returns JSON containing HTML.
So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.
To implement:
You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after #by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
Parse the JSON from the response of this second request to extract the content you want.
(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)
Alternatively, you can use Selenium to visit the website, do the search and then extract the content.
You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.
You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx
brings you a json you can process searching for /people/ to retrive biography links.
Or you can use an screen crawler like selenium

getting information from a webpage for an application using python

I am currently trying to create a bot for the betfair trading site, it involves using the betfair api which uses soap and the new API-NG will use json so I can understand how to access the information that I need.
My question is, using python, what would the best way to get information from a website that uses just html, can I convert it some way to maybe xml or what is the best/easiest way.
Json, xml and basically all this is new to me so any help will be appreciated.
This is one of the websites I am trying to access to get horse names and prices,
http://www.oddschecker.com/horse-racing-betting/chepstow/14:35/winner
I know there are some similar questions but looking at the answers and the source of the above page I am no nearer to figuring out how to get the info I need.
For getting html from a website there are two well used options.
urllib2 This is built in.
requests This is third party but really easy to use.
If you then need to parse your html then I would suggest using Beautiful soup.
Example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
page_request = requests.get(url)
page_source = page_request.text
soup = BeautifulSoup(page_source)
The page_source is just the basic html of the page, not much use, the soup object on the other hand can be used to access different parts of the page automatically.

Python URL Redirect Problem

I've got a link that I know redirects to another end url, and I'm trying to get the address for that end url using python. But the original link is a little weird, and doesn't work like a normal redirect, and I can't figure out why. When I post the link (the link's below for you try, if you'd like) into a browser, it redirects perfectly. But when I run the following code, it doesn't.
import urllib2
request = urllib2.Request('http://www.facebook.com/ajax/emu/end.php?eid=AQJSWpZ3e4cCTHoNdahpJzPYzmzHOENzbTWBVlW4SgIxX0rL9bo6NXmS3q06cjeh5jO9wbsmr3IyGrpbXPSj0GPLbRJl4VUH-EBnmSy_R4j7iYzpMe1ooZ6IEqSEIlBl0-5SEldIhxI82m75YPa5nOhuBdokiwTw79hoiRB-Zn1auxN-6WLVe3e5WNSt3HLAEjZL-2e4ox_7yAyLcBo1nkamEvShTyZ-GfIf0A9oFXylwRnV8oNaqNmUnqrFYqDbUhzh7d6LSm3jbv1ue2coS3w8N7OxTKVwODHa-Hd3qRbYskB9weio8eKdDFtkvDKuzSSq5hjr711UjlDsgpxLuAmdD95xVwpomxeEsBsMCYJoUEQYa-cM7q3W1aiIYBHlyn2__t74qHWVvzK5zaLKFMKjRFQqphDlUMgMni6AP1VHSn1wli_3lgeVD8TzcJMSlJIF7DC_O44WdjBIMY8OufER3ZB_mm2NqwUe6cvV9oV9SNyYHE4UUURYjW_Z6sUxz3SpHG8c6QxJ-ltSeShvU3mIwAhFE3M0jGTg7AQ7nIoOUfC8PDainFZ1NV8g31aqaqDsF7UxdlOmBT6w-Y8TPmHOXfSlWB-M3MQYUBmcWS3UzlbSsavQG8LXPqYbyKfvkAfncSnZS3_tkoqbTksFirQWlSxJ3mgXrO5PqopH63Esd9ynCbFQM1q_3_wgkYvTeGS9XK6G63_Ag3N9dCHsO_bCJToJT4jeHQCSQ83cb1U5Qpe_7EWbw1ilzgyL-LBVrpH424dwK-4AoaL00W-gWzShSdOynjcoGeB7KE0pHbg-XhuaVribSodriSGybNdADBosnddVvZldY22-_97MqEuA&amp&c=4&amp&f=4&amp&ui=6003071106023-id_4e0b51323f9d01393198225&amp&en=1&amp&a=0&amp&sig=78154')
opener = urllib2.build_opener()
f = opener.open(request)
f.geturl()
I simply get my original url back. I encounter the same problem when I save cookies and use mechanize. Any help would be much appreciated! Thanks!
It looks like this is using Javascript to perform the redirect. You'll either have to figure out exactly how the Javascript is performing the redirects and pull out the appropriate urls, or you'll have to actually run the Javascript. As far as I know, running Javascript from python is not an easy task.
(original answer deleted)
If you look at the contents of f.read() you'll see what's going on here. Instead of returning a 301 or 302 that redirects to the new URL, Facebook actually returns a real HTML document - which contains a piece of Javascript that uses document.location.replace to change the URL in the browser.
There's no easy way of replicating that with Python - the best thing to do is to parse the document with something like BeautifulSoup to find the Javascript, and somehow extract the new URL. It won't be pretty.

Categories