Scraping Information from multiple URLS that are different in structure

Scraping Information from multiple URLS that are different in structure - python

I would like to scrape multiple URLS but they are of different nature, such as different company websites with different html backend. Is there a way to do it without coming up with a customised code for each url?
Understand that I can put multiple URLS into a list and loop them

I fear not, but I am not an expert :-)
I could imagine that it depends on the complexity of the structures. If you want to find a the text "Test" on every website, I coul imagine that soup.body.findAll(text='Test') would return all occurences of "Test" on the website.
I assume you're aware of how to loop through a list here, so that you'd loop through the list of URLS and for each check whether the searched string occurs (maybe you are looking for sth else , i.e. an "apply" button or "login" ?
all the best,

Related

How can I exclude the list of web-pages from google-search results?

"minus" sign doesn't fit because the list consists of ~2000 entries.
I'm just beginner in python, so, please explain as to 5-year old, if possible
Thank you much in advance!

Presumably you are fetching the Google search results from a Python program. So you can exclude the web pages in your list in your Python program as you read the results, instead of trying to make Google do it for you. You can use a functional programming technique like calling filter for this.
Ideally you would do this by comparing the URLs of the links, but if you were willing to sacrifice accuracy you could do it by comparing the titles of the links instead, if you only had titles in your list and not URLs. But URLs are definitely better for this purpose.
So you could parse the Google search results using a library like Beautiful Soup, extract the URLs of the links, and filter out (using filter) the ones that were equal to any of the URLs on your list (you could define a function using def, for checking whether a given URL is on your list). You'll have to be careful though because sometimes Google search result links go via a Google website which redirects to the real URL, for ranking purposes.

Scrapy, developing scalable spiders - extract Xpath by Element properties

So I'm working on a web-scraping project that essentially pulls a bunch of product information (like price, location, name, etc.) from a list of 20+ websites... So far i have created a generic MasterSpider ( similar to what is discussed here: Creating a generic scrapy spider ), from which I inherit and override depending on the site's specific architecture.
However, after essentially repeating much code and wanting to make this project scalable, I have started working towards generalizing my MaterSpider that way it could be extended to other websites, and ideally instantiated with minimal arguments like just the start_url. In other words, instead of locating elements by Xpath, which are not consistent across domains, I am now looking for html tag attribute values/text values.
This works fine for generic/consistent targets like identifying the category links from the start page (which typically contain the category in the link), but for things like finding the product name, price, etc. it is lacking. Having to build out a list of xpath conditions (like #class = a or b or c/contains(.,'a') or contains(.,'b') ) kind of defeats the purpose.
I realize i could also pass a few xpath conditions to instantiate the spider, which I may just have to do, but I would prefer to make this as easy to use and extensible as possible...
My idea is before parsing the individual product pages, to issue a dummy request that looks for the information I would like, and works backward to actually identify the xpath of the information-which is then used in the subsequent requests.
So I was wondering if anyone had any good ideas on how to extract the Xpath of an element given lets say a list of tag values it could contain, or the matching of text within... I realize a series of Try-catches could work, but again that would be more of a band-aid than a solution, and not very scalable. If I have to use something like selenium or a parser to do this that is also an option...
Really open to any ideas or fresh perspectives.
Thanks!

At work, I have to scrape thousands of news websites, and as you might expect there is no one fits all solution. So our strategy was to have a "generic" method, that through heuristics would try to extract the needed information and for troublesome websites we would have a list of specific xpaths for that website.
So our general structure is something like this:
parsers = {
"domain1": {
"item1": "//div...",
"item2": "//div...",
},
"domain2": {
"item1": "//div...",
"item2": "//div...",
},
}
def parse(self, response):
domain = urlparse(response.url).netloc # urlparse comes from urllib.parse
try:
parser = self.parsers[domain]
return self.parse_with_parser(response, parser)
except Exception as e:
return self.parse_generic(response)
The parsers dict I actually keep in a separate file. You could also keep it in a database or a file and access the information when the spider is loading so you do not to have to edit the crawler every time you need to change something.
Edit:
Answering the second part of your question, depending on what you need done, you could write xpaths that take into account several conditions. For instance:
"//a[contains(#class, 'foo') or contains(#class, 'bar')]"
Maybe even
"//a[contains(#class, 'foo') or contains(#class, 'bar')] | //div[#class='something'] | //td/span"
The pipe operator "|" will allow you to "chain" different expressions that might contain what you want extracted. A and/or operation on different expressions.

How to check if URL query strings, fragments, etc actually change webpage content for end users?

I'm writing an app which allows people to discuss a webpage together if they are on the same webpage. The actual app works fine and is interesting to use, but sometimes the app mistakenly believes the two individuals are on different URLs while in content/practical purposes they are on the same page. If I store the entire URL and simply compare it to the other URL that the second user is on, sometimes the URL is clearly different while the webpage content is identical for the end user. Usually this is because sites make different use of the query, fragment, and parameter strings in the URL in different ways. For example, https://www.facebook.com/zuck?fref=ts and https://www.facebook.com/zuck should be treated as identical webpages for the use of my app since the end user content is indiscernably identical. Facebook uses query strings to understand how you arrived to that certain profile. However, other sites such as YouTube clearly use the query string for the actual content identification such as https://www.youtube.com/watch?v=dQw4w9WgXcQ so I can't just write a program that is agnostic to URL query or fragment strings etc.
What is the best way to approach this webpage comparison dilemma in python? I have tried different ways such as comparing the source of the two pages using the requests library found here, but the sources are expectedly different. Things I've tried are comparisons such as:
if requests.get('https://www.facebook.com/zuck?fref=ts').content == requests.get('https://www.facebook.com/zuck').content:
I assume something in the served ads on the sidebars or headers of the page etc is not the same to yield True for a simple '==' comparison.
Any ideas on how to approach this issue? I really appreciate it.

Python re.search with variable in page source

I was wondering how can I search the data that is loaded from a page with a .read() and search for a variable within that search in a while statement. Here is what I wrote:
match2 = re.search(r"=profile\.php\?id=" + str(link) + ">(.+?)</a>, <a href=profile\.php\?id=(.+?)>", home)
The page basically just lists all the user profiles and I'm trying to make it read each one and view their profile; pretty simple except for the fact that I can't get the user ids set as a variable from link = match2.group(2) to work.

This isn't a direct answer, but I strongly suggest you look at Beautiful Soup. It's an HTML parser that lets you search for values in a much more structured manner. In this case, you could loop across all the items in the list of user profiles, and extract the information you want out of each one in turn.

How can I iterate through the pages of a website using Python?

I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?

Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.

To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.

For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.