I am using scrapy to scrape reviews about books from a site. Till now i have made a crawler and scraped comments of a single book by giving its url as starting url by myself and i even had to give tags of comments about that book by myself after finding it from page's source code. Ant it worked. But the problem is that till now the work i've done manually i want it to be done automatically. i.e. I want some way that crawler should be able to find book's page in the website and scrape its comments. I am extracting comments from goodreads and it doesn't provide a uniform method for url's or even tags are also different for different books. Plus i don't want to use Api. I want to do all work by myself. Any help would be appreciated.
It seems, that CrawlSpider can fit your needs.
You can start as follows:
Specify list of starting url(s) for the crawler start_urls = ['https://www.goodreads.com'].
To identify urls with books you can create the following Rule:
rules = (
Rule(SgmlLinkExtractor(allow=(r'book/show/.+', )), callback='parse_comments'),
)
HtmlAgilityPack helped me in parsing and reading Xpath for the reviews. It worked :)
Related
I am trying to scrape an ecommerce website (Lazada.sg) and I found a Github code based of on scrapy: https://github.com/talk2div/lazada-scraper. As I'm tinkering how he developed his code, however, I cannot replicate displaying the same URL search in ajax format (correct me if I'm wrong). Here is a sample of the URL for request in scrapy:https://www.lazada.sg/mother-baby/?ajax=true&page=1&spm=a2o42.searchlistcategory.cate_5b6ee3f0Npltyg.
The searches he made are for baby item searches. I am trying to replicate that for Lego items. I would be glad if I can have some help on displaying the URL in the same format as he did for scrapy so I can re-use his code for my own use case. Thanks
That is because the links he is querying is part of an enumerated category listed on the page. You just want to get the search results of a specific keyword so the query will look a little different like this:
...
page=1
def start_requests(self):
yield scrapy.Request(url=f'https://www.lazada.sg/catalog/?_keyori=ss&ajax=true&from=input&isFirstRequest=true&page={self.page}&q=lego&spm=a2o42.searchlistcategory.search.go.d1c332ab2wBQx9')
This is the link for the first page.
https://www.lazada.sg/catalog/?_keyori=ss&ajax=true&from=input&isFirstRequest=true&page=1&q=lego&spm=a2o42.searchlistcategory.search.go.d1c332ab2wBQx9
Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.
I am staring with the url below:
http://www.imdb.com/chart/top
The structure of the HTML file seems to be so confusing:
"
Metascore: "
I am trying to use a format like this:
movie['metascore'] = self.get_text(soup.find('h4', attrs={' ':'Metascore'}))
I'll take a stab at this since it sounds like you're new to scraping. What it sounds like you're actually trying to do is to get the budget, gross, and metascore from each of the individual 250 movie pages on IMDB. You're on the right track by mentioning Scrapy because you do have to crawl to those pages from the initial URL you provided. Scrapy has some excellent documentation, so if you want to use it, I highly recommend you start there first.
However, if all you need is to scrape those 250 pages, you're better off just using Beautiful Soup to do the whole job. Simply do a soup.findAll("td", {"class":"titleColumn"}), extract the links, then do a loop where you have Beautiful Soup open each of the those pages one at a time. If you're not sure how to do that, again, BS has excellent documentation.
From there, it's just a matter of scraping the relevant data you want during each iteration. For instance, the metascore of each film is inside the a <div> of the class star-box-details. Do a .find for that and then you'll have to do some regular expressions to extract the exact piece you want (regular-expressions.info has a great tutorial on regex and if you really get into regex, you'll probably end up sinking hours into RexEgg).
I'm not going to code the whole thing since you'll learn a lot through the trial and error that comes with attempting to solve things, but hopefully that puts you on the right track. However, do note that IMDB forbids scraping, but for small projects I'm sure no one will care. But if you want to get serious, the "Does IMDB provide an API?" post has some excellent resources for how to do it via various third-party APIs (and some even directly from IMDB). In your case, the best might be to simply download the data as text files directly from IMDB. Click on any of the FTP links. The files you'll probably want are business.list.gz and ratings.list.gz. As for the metascore on each movie page, that rating actually comes from Metacritic, so you'll want to go there to pull that data.
Good luck!
I am trying to do realize a CrawlSpider with Scrapy with the following features.
Basically, my start url contains various list of urls which are divided up in sections. I want to scrape just the urls from a specific section and then crawl them.
In order to do this, I defined my link extractor using restrict_xpaths, in order to isolate the links I want to crawl from the rest.
However, because of the restrict_xpaths, when the spider tries to crawl a link which is not the start url, it stops, since it does not find any links.
So I tried to add another rule, which is supposed to assure that the links outside the start url get crawled, through the use of deny_domains applied to the start_url. However, this solution is not working.
Can anyone suggest a possible strategy?
Right now my rules are :
rules = {Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),}
You're defining a Set by using {} around the pair of rules. Try making it a tuple with ():
rules = (Rule(LinkExtractor(restrict_xpaths=(".//*[#id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),)
Beyond that, you might want to pass 'unique=True' to the Rules to make sure that any links back to the "start url" are not followed. See BaseSgmlLinkExtractor
Also, the use of 'parse_items' as a call back to both LinkExtractors is a bit of a smell. Based on your explanation, I can't see that the first extractor would need a callback.... it's just extracting links that should be added to the queue for the Scraper to go fetch, right?
The real scraping for data that you want to use/persist generally happens in the 'parse_items' callback (at least that's the convention used in the docs).
I am writing a programme in Python to extract all the urls from a given website. All the url's from a site not from a page.
As I suppose I am not the first one who wants to do that I was wondering if there was a ready made solution or if I have to write the code myself.
It's not gonna be easy, but a decent starting point would be to look into these two libraries:
urllib
BeautifulSoup
I didn't see any ready made scripts that does this on a quick google search.
Using the scrapy framework makes this almost trivial.
The time consuming part would be learning how to use scrapy. THeir tutorials are great though and shoulndn't take you that long.
http://doc.scrapy.org/en/latest/intro/tutorial.html
Creating a solution that others can use is one of the joys of being part of a programming community. iF a scraper doesn't exist you can create one that everyone can use to get all links from a site!
The given answers are what I would have suggested (+1).
But if you really want to do something quick and simple, and you're on a *NIX platform, try this:
lynx -dump YOUR_URL | grep http
Where YOUR_URL is the URL that you want to check. This should get you all the links you want (except for links that are not fully written)
You first have to download the page's HTML content using a package like urlib or requests.
After that, you can use Beautiful Soup to extract the URLs. In fact, their tutorial shows how to extract all links enclosed in <a> elements as a specific example:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
If you also want to find links not enclosed in <a> elements, you'll may have to write something more complex on your own.
EDIT: I also just came across two Scrapy link extractor classes that were created specifically for this task:
http://doc.scrapy.org/en/latest/topics/link-extractors.html