So I'm working on a web-scraping project that essentially pulls a bunch of product information (like price, location, name, etc.) from a list of 20+ websites... So far i have created a generic MasterSpider ( similar to what is discussed here: Creating a generic scrapy spider ), from which I inherit and override depending on the site's specific architecture.
However, after essentially repeating much code and wanting to make this project scalable, I have started working towards generalizing my MaterSpider that way it could be extended to other websites, and ideally instantiated with minimal arguments like just the start_url. In other words, instead of locating elements by Xpath, which are not consistent across domains, I am now looking for html tag attribute values/text values.
This works fine for generic/consistent targets like identifying the category links from the start page (which typically contain the category in the link), but for things like finding the product name, price, etc. it is lacking. Having to build out a list of xpath conditions (like #class = a or b or c/contains(.,'a') or contains(.,'b') ) kind of defeats the purpose.
I realize i could also pass a few xpath conditions to instantiate the spider, which I may just have to do, but I would prefer to make this as easy to use and extensible as possible...
My idea is before parsing the individual product pages, to issue a dummy request that looks for the information I would like, and works backward to actually identify the xpath of the information-which is then used in the subsequent requests.
So I was wondering if anyone had any good ideas on how to extract the Xpath of an element given lets say a list of tag values it could contain, or the matching of text within... I realize a series of Try-catches could work, but again that would be more of a band-aid than a solution, and not very scalable. If I have to use something like selenium or a parser to do this that is also an option...
Really open to any ideas or fresh perspectives.
Thanks!
At work, I have to scrape thousands of news websites, and as you might expect there is no one fits all solution. So our strategy was to have a "generic" method, that through heuristics would try to extract the needed information and for troublesome websites we would have a list of specific xpaths for that website.
So our general structure is something like this:
parsers = {
"domain1": {
"item1": "//div...",
"item2": "//div...",
},
"domain2": {
"item1": "//div...",
"item2": "//div...",
},
}
def parse(self, response):
domain = urlparse(response.url).netloc # urlparse comes from urllib.parse
try:
parser = self.parsers[domain]
return self.parse_with_parser(response, parser)
except Exception as e:
return self.parse_generic(response)
The parsers dict I actually keep in a separate file. You could also keep it in a database or a file and access the information when the spider is loading so you do not to have to edit the crawler every time you need to change something.
Edit:
Answering the second part of your question, depending on what you need done, you could write xpaths that take into account several conditions. For instance:
"//a[contains(#class, 'foo') or contains(#class, 'bar')]"
Maybe even
"//a[contains(#class, 'foo') or contains(#class, 'bar')] | //div[#class='something'] | //td/span"
The pipe operator "|" will allow you to "chain" different expressions that might contain what you want extracted. A and/or operation on different expressions.
Related
I would like to scrape multiple URLS but they are of different nature, such as different company websites with different html backend. Is there a way to do it without coming up with a customised code for each url?
Understand that I can put multiple URLS into a list and loop them
I fear not, but I am not an expert :-)
I could imagine that it depends on the complexity of the structures. If you want to find a the text "Test" on every website, I coul imagine that soup.body.findAll(text='Test') would return all occurences of "Test" on the website.
I assume you're aware of how to loop through a list here, so that you'd loop through the list of URLS and for each check whether the searched string occurs (maybe you are looking for sth else , i.e. an "apply" button or "login" ?
all the best,
I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item per site that summarizes the information I found across that site, regardless of which page(s) I found it on.
I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item, not the results from the first page the crawler examined.
So I tried using request.meta to pass a single partially-filled Item through the various Requests for a given site. To make that work, I had to have my parse callback return exactly one new Request per call until it had no more pages to visit, then finally return the finished Item. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.
The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?
How about this ugly solution?
Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.
Should work in theory, but I'm not sure that this solution is the best one.
If you want summary, Stats Collection would be another approach
http://doc.scrapy.org/en/0.16/topics/stats.html
for example:
to get total page crawled in each of websites. use the following code.
stats.inc_value('pages_crawled:%s'%socket.gethostname())
I had same issues when i was writing my crawler.
I solved this issue by passing a list of urls through meta tag and chaining them together.
Refer to detailed tutorial I wrote here.
The xpath for text I wish to extract is reliably located deep in the tree at
...table/tbody/tr[4]/td[2]
Specifically, td[2] is structured like so
<td class="val">xyz</td>
I am trying to extract the text "xyz", but a broad search returns multiple results. For example the following path returns 10 elements.
xpath('//td[#class="val"]')
... while a specific search doesn't return any elements. I am unsure why the following returns nothing.
xpath('//tbody/tr/td[#class="val"]')
One solution involves..
table = root.xpath('//table[#class="123"]')
#going down the tree
xyz = table[0][3][1]
print vol.text
However, I am pretty sure this extremely brittle. I would appreciate it if someone could tell me how to construct an xpath search that would be both un-brittle and relatively cheap on resources
You haven't mentioned it explicitly, but if your target table and td tag classes are reliable then you could do something like:
//table[#class="123"]/descendant::td[#class="val"]
And you half dodge the issue of tbody being there or not.
However, there's no substitute for actually seeing the material you are trying to parse for recommending XPATH queries...
...table/tbody/tr[4]/td[2]
I guess you found this XPath via a tool like Firebug. One thing to note about tools like Firebug (or other inspect tools within browsers) is that they use the DOM tree generated by the browser itself and most (if not all) HTML parsers in browsers would try hard to make the passed HTML valid. This often requires adding various tags the standard dictates.
<tbody> is one of these tags. <tr> tags are only allowed as a child of <thead>, <tbody> or <tfoot> tags. Unfortunately, in my experience, you will rarely see one of these tags inside a <table> in the actual source, but a browser would add these necessary tags while parsing to make HTML valid since standard requires to do so.
To cut this story short, there is probably no <tbody> tag in your actual source. That is why your XPath returns nothing.
As for generating XPath queries, this highly depends on the particular page/xml. In general, positional queries such as td[4] should be the last resort since they tend to break easily when something is added before them. You should inspect the markup carefully and try to come up queries that use attributes like id or class since they add specificity more reliably than the positional ones. But in the end, it all boils down to the specifics of the page in question.
This seems to be working
from lxml import etree
doc = etree.HTML('<html><body><table><tbody><tr><td>bad</td><td class="val">xyz</td></tr></tbody></table></body></html>')
print doc.xpath('//tbody/tr/td[#class="val"]')[0].text
output:
xyz
So what is your problem?
So I am relatively new to python, and in order to learn, I have started writing a program that goes online to wikipedia, finds the first link in the overview section of a random article, follows that link and keeps going until it either enters a loop or finds the philosophy page (as detailed here) and then repeats this process for a new random article a specified number of times. I then want to collect the results in some form of useful data structure, so that I can pass the data to R using the Rpy library so that I can draw some sort of network diagram (R is pretty good at drawing things like that) with each node in the diagram representing the pages visited, and the arrows that paths taken from the starting article to the philosophy page.
So I have no problem getting python to return the fairly structured html from wiki but there are some problems that I can't quite figure out. Up till now I have selected the first link using a cssselector from the lxml library. It selects for the first link ( in an a tag) that is a direct descendant of a p tag, that is a direct descendant of a div tag with class="mw-content-ltr" like this:
user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)'
values = {'name' : 'David Kavanagh',
'location' : 'Belfast',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
encodes = urllib.urlencode(values)
req = urllib2.Request(url, encodes, headers)
page = urllib2.urlopen(req)
root = parse(page).getroot()
return root.cssselect("div.mw-content-ltr>p>a")[0].get('href')
This code resides in a function which I use to find the first link in the page. It works for the most part but the problem is if the first link is inside some other tag as opposed to being a direct descendant of a p tag like let's say a b tag or something then I miss it. As you can see from the wiki article above, links in italics or inside parentheses aren't eligible for the game, which means that I never get a link in italics (good) but frequently do get links that are inside parentheses (bad) and sometimes miss the first link on a page like the first link on the Chair article, which is stool, but it is in bold, so I don't get it. I have tried removing the direct descendant stipulation but then I frequently get links that are "above" the overview section, that are usually in the side box, in a p tag, in a table, in the same div as the overview section.
So the first part of my question is:
How could I use cssselectors or some other function or library to select the first link in the overview section that is not inside parentheses or in italics. I thought about using regular expressions to look through the raw html but that seems like a very clunky solution and I thought that there might be something a bit nicer out there that I haven't thought of.
So currently I am storing the results in a list of lists. So I have a list called paths, in which there are lists that contain strings that contain the title of the wiki article.
The second part of the question is:
How can I traverse this list of lists to represent the multiple convergent paths? Is storing the results like this a good idea? Since the end diagram should look something like an upside down tree, I thought about making some kind of tree class, but that seems like a lot of work for something that is conceptually, fairly simple.
Any ideas or suggestions would be greatly appreciated.
Cheers,
Davy
I'll just answer the second question:
For a start, just keep a dict mapping one Wikipedia article title to the next. This will make it easy and fast to check I've you've hit an article you've already found before. Essentially this is just storing a directed graph's vertices, indexed by their origins.
If you get to the point where a Python dict is not efficient enough (it does have a significant memory overhead, once you have millions of items memory can be an issue) you can find a more efficient graph data structure to suit your needs.
EDIT
Ok, I'll answer the first question as well...
For the first part, I highly recommend using the MediaWiki API instead of getting the HTML version and parsing it. The API enables querying for certain types of links, for instance just inter-wiki links or just inter-language links. Additionally, there are Python client libraries for this API, which should make using it from Python code simple.
Don't parse a website's HTML if it provides a comprehensive and well-documented API!
For the first part, it's not possible to find brackets with css selectors, because as far as the html is concerned brackets are just text.
If I were you, I'd use selectors to find all the relevant paragraph elements that are valid for the game. Then, I'd look in the text of the paragraph element, and remove any that is not valid - for instance, anything between brackets, and anything between italic tags. I'd then search this processed text for the link elements I need. This is slightly nicer than manually processing the whole html document.
I'm not sure I follow exactly what you are doing for the second part, but as for representing the results of this search as a tree: This is a bad idea as you are looking for cycles, which trees can't represent.
For a data structure, I would have lists of 'nodes', where a node represents a page and has a URL and a count of occurances. I'd then use a brute force algorithm to compare lists of nodes - if the two lists have a node that are the same, you could merge them, increasing the 'occurances' count for each mirrored node.
I would not use the standard python 'list' as this can't loop back on itself. Maybe create your own linked list implementation to contain the nodes.
Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags