Matching contents of an html file with keyword python

Matching contents of an html file with keyword python - python

I am making a download manager. And I want to make the download manager check the md5 hash of an url after downloading the file. The hash is found on the page. It needs to compute the md5 of the file ( this is done), search for a match on the html page and then compare the WHOLE contents of the html page for a match.
my question is how do i make python return the whole contents of the html and find a match for my "md5 string"?

Requests lib is what you want to use. Will save you lots of trouble

import urllib and use urllib.urlopen for getting the contents of an html. import re to search for the hash code using regex. You could also use find method on the string instead of regex.
If you encounter problems, then you can ask more specific questions. Your question is too general.

Related

BeautifulSoup 4 issue reading URL-encoded links and files downloaded using wget

I am having an issue with BS4/Python 2.7.12 reading links and files that have been URL encoded already when I downloaded them using wget to archive my Drupal website.
For example, a link that exists on the live website would be:
https://mywebsite.org/content/prime's-and-"doubleprimes"-in-it (I know this is incorrect grammar because the 's example is possessive not plural)
The downloaded file would be:
/content/prime%E2%80%99s-and-%E2%80%9Cdoubleprimes%E2%80%9D-in-it
(This is helpful in identifying different typography: http://www.w3schools.com/TAGS/ref_urlencode.asp)
My script loops through each file and flattens the site by adding ".html" to all links. However, in using BS4 to do this, it is actually changing the link path because it seems to try to re-interpret the already URL-encoded links. So as a result it would change the above link to:
/content/prime%2580%2599s-and-%2580%259Cdoubleprimes%2580%259D-in-it
And thus it wouldn't work. You can see the %25 it is trying to use to encode the % signs beginning %E2, for example.
There have been many questions regarding encoding with BS4, but most of them specifically with regard to utf-8 with BS4. I understand that BS4 will automatically read the "soup" into utf-8, but I'm unsure why it is trying to re-URL encode links that are already encoded. I have tried soup = BeautifulSoup(text.read().decode('utf-8','ignore')) as suggested here, which fixed an issue where BS4 was trying to interpret %E2 as a unicode character, however I haven't seen anything for re-encoding of already-URL encoded characters. I have also tried adding formatter="html" to my soup.prettify command, but this did not work either, as the files had already been read and interpreted at that point.

Scraping Biography.com using urllib2

So I've scraped websites before, but this time I am stumped. I am attempting to search for a person on Biography.com and retrieve his/her biography. But whenever I search the site using urllib2 and query the URL: http://www.biography.com/search/ I get a blank page with no data in it.
When I look into the source generated in the browser by clicking View Source, I still do not see any data. When I use Chrome's developer tools, I find some data but still no links leading to the biography.
I have tried changing the User Agent, adding referrers, using cookies in Python but to no avail. If someone could help me out with this task it would be really helpful.
I am planning to use this text for my NLP project and worst case, I'll have to manually copy-paste the text. But I hope it doesn't come to that.

Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.
https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0
The search term I used is in the q= part of the query string: q=Barack%20Obama.
This returns JSON inside of which there is a key link with the value of the article of interest's URL.
"link": "http://www.biography.com/people/barack-obama-12782369"
Visiting that page shows me that this is generated by a request to:
http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/#published/#by-custom-type/ContentPerson/#by-slug/barack-obama-12782369
which returns JSON containing HTML.
So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.
To implement:
You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after #by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
Parse the JSON from the response of this second request to extract the content you want.
(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)
Alternatively, you can use Selenium to visit the website, do the search and then extract the content.

You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.

You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx
brings you a json you can process searching for /people/ to retrive biography links.
Or you can use an screen crawler like selenium

Extract site that HTML document came from

I have a folder full of HTML documents that are saved copies of webpages, but i need to know what site they came from, what function can i use to extract the website name from the documents? I did not find anything in the BeautifulSoup module. Is there a specific that i should be looking for in the document? I do not need to know the full url i just need to know the name of the website.

You can only do that if the url is mentioned somewhere in the source...
First find out where the url is if it is mentioned. If it's there it'll probably be in the base tag. Sometimes websites have nice headers with a link to their landing page which could be used if all you want is the domain. Or it could be in a comment somewhwere depending on how you saved it.
If the way the url is mentioned is similar in all the pages then your job is easy: Either use re or BeautifulSoup or lxml and xpath to grab the info you need. There are other tools available but either of those will do.

How would I look for all URLs on a web page and then save them to a individual variables with urllib2 In Python?

How would I look for all URLs on a web page and then save them to individual variables with urllib2 In Python?

Parse the html with an html parser and find all (e.g. using Beutiful Soup's findAll() method) <a> tags and check their href attributes.
If, however, you want to find all URLs in the page even if they aren't hyperlinks, then you can use a regular expression which could be anything from simple to ridiculously insane.

You don't do it with urllib2 alone. What are you looking for is parsing urls in a web page.
You get your first page using urllib2, read its contents and then pass it through parser like Beautifulsoup or as the other poster explained, you can regex to search the contents of the page too.

You could simply download the raw html with urllib2, then simply search through it. There might be easier ways but you could do this:
1:Download the source code.
2:Use strings library to split it into a list.
3:Search the first 7 characters of each section-->
4:If the first 7 characters are http://, write that to a variable.
Why do you need separate variables though? Wouldn't it be easier to save them all to a list, using list.append(URL_YOU_JUST_FOUND), every time you find another url?

how do i parse a wiki page without taking a dump of it in python?

Is it possible to parse a wiki without taking its dump , as the dump itself is way too much data to handle . Thus lets say I have the url of a certain wiki and once i call it through urllib , how do I parse it and get a certain type of data using python .
here type means a certain data corresponding to a semantic match to the search that would have been done .

You need an HTML parser to get the useful data from the HTML.
You can use BeautifulSoup to help parse the HTML. I recommend that you read the documentation and have a look at the examples there.

I'd suggest an option such as Harvestman instead, since a semantic search is likely throw multiple pages, compared to a simpler solution such as BS

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.