I am scraping a webpage from Wikipedia (particularly this one) using a Python library called Scrapy. Here was the original code:
import scrapy
from wikipedia.items import WikipediaItem
class MySpider(scrapy.Spider):
name = "wiki"
allowed_domains = ["en.wikipedia.org/"]
start_urls = [
'https://en.wikipedia.org/wiki/Category:2013_films',
]
def parse(self, response):
titles = response.xpath('//div[#id="mw-pages"]//li')
items = []
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = title.xpath("a/#href").extract()
items.append(item)
return items
Then in the terminal, I ran scrapy crawl wiki -o wiki.json -t json to output the data to a JSON file. While the code worked, the links assigned to the "url" keys were all relative links. (i.e.: {"url": ["/wiki/9_Full_Moons"], "title": ["9 Full Moons"]}).
Instead of /wiki/9_Full_Moons, I needed http://en.wikipedia.org/wiki/9_Full_Moons. So I modified the above mentioned code to import the urljoin from the urlparse library. I also modified my for loop to look like this instead:
for title in titles:
item = WikipediaItem()
url = title.xpath("a/#href").extract()
item["title"] = title.xpath("a/text()").extract()
item["url"] = urljoin("http://en.wikipedia.org", url[0])
items.append(item)
return(items)
I believed this was the correct approach since the type of data assigned to the url key is enclosed in brackets (which would entail a list, right?) so to get the string inside it, I typed url[0]. However, this time I got an IndexError that looked like this:
IndexError: list index out of range
Can someone help explain where I went wrong?
So after mirroring the code to the example given in the documentation here, I was able to get the code to work:
def parse(self, response):
for text in response.xpath('//div[#id="mw-pages"]//li/a/text()').extract():
yield WikipediaItem(title=text)
for href in response.xpath('//div[#id="mw-pages"]//li/a/#href').extract():
link = urljoin("http://en.wikipedia.org", href)
yield WikipediaItem(url=link)
If anyone needs further clarification on how the Items class works, the documentation is here.
Furthermore, although the code works, it won't pair the title with its respective link. So it will give you
TITLE, TITLE, TITLE, LINK, LINK, LINK
instead of
TITLE, LINK, TITLE, LINK, TITLE, LINK
(the latter being probably the more desired result) — but that's for another question. If anyone has a proposed solution that works better than mine, I'll be more than happy to listen to your answers! Thanks.
I think you can just concatenate the two strings instead of using urljoin. Try this:
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = "http://en.wikipedia.org" + title.xpath("a/#href").extract()[0]
items.append(item)
return(items)
On your first iteration of code with relative links, you used the xpath method: item["url"] = title.xpath("a/#href").extract()
The object returned is (I assume) a list of strings, so indexing it would be valid.
In the new iteration, you used the select method: url = title.select("a/#href").extract() Then you treated the returned object as an iterable, with url[0]. Check what the select method returns, maybe it's a list, like in the previous example.
P.S.: IPython is your friend.
For better clarification, i am going to modify above code,
for title in titles:
item = WikipediaItem()
item["title"] = title.xpath("a/text()").extract()
item["url"] = "http://en.wikipedia.org" + title.xpath("a/#href").extract()[0]
items.append(item)
return(items)
Related
I am trying to scrape the data about the circulrs from my college's website using scrapy for a project but my spider is not scraping the data properly. There are a lot of blank elements and also I am unable to scrape the 'href' attributes of the circulars for some reason. I am assuming that my CSS selectors are wrong but I am unable to figure out what am I doing wrong exactly. I copied my CSS selectors using the 'Selector Gadget' Chrome extension. I ams till learning scrapy so it would be great if you could explain what I was doing wrong.
The Website I am scraping data from is : https://www.imsnsit.org/imsnsit/notifications.php
My code is :
import scrapy
from ..items import CircularItem
class CircularSpider(scrapy.Spider):
name = "circular"
start_urls = [
"https://www.imsnsit.org/imsnsit/notifications.php"
]
def parse(self, response):
items = CircularItem()
all = response.css('tr~ tr+ tr font')
for x in all:
cirName = x.css('a font::text').extract()
cirLink = x.css('.list-data-focus a').attrib['href'].extract()
date = x.css('tr~ tr+ tr td::text').extract()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I modified your parse callback function. I changed CSS selectors into xpath. Also, try to learn xpath selectors they are very powerful and easy to use.
Generally, It is bad idea to copy CSS or xpath using automatic selectors, because in some cases they might give you incorrect results or just one element without general path.
First of all I select all tr. If you look carefully, some of tr are just blank used for separator. You can filter them by trying to select date, if it is None you can just skip the row. And finally you can just select cirName and cirLink.
Also, markup of the given website is not good and It is really hard to write proper selectors, elements don't have many attributes, like class or id. That's the solution I came up with, I know it is not perfect.
def parse(self, response):
items = CircularItem()
all = response.xpath('//tr') # select all table items
for x in all:
date = x.xpath('.//td/font[#size="3"]/text()').get() # filter them by date
if not date:
continue
cirName = x.xpath('.//a/font/text()').get()
cirLink = x.xpath('.//a[#title="NOTICES / CIRCULARS"]/#href').get()
items["Name"] = cirName
items["href"] = cirLink
items["Date"] = date
yield items
I have recently started to use python and scrapy.
I have been trying to use scrapy to start at either a movie or actor wiki page, save the name and cast or filmography and traverse through the links in the cast or filmography sections to other actor/movie wiki pages.
However, I have no idea how rules work (edit: ok, this was a bit of hyperbole) and the wiki links are extremely nested. I saw that you can limit by xpath and give id or class but most of the links I want don't seem to have a class or id. I also wasn't sure if xpath also includes the other siblings and children.
Therefore I would like to know what rules to use to limit the non-relevant links and only go to cast and filmography links.
Edit: Clearly, i should have explained my question better. Its not that I dont understand xpaths and rules at all (that was a bit of hyperbole since I was getting frustrated) but I'm clearly not completely clear on their working. Firstly, let me show what I had so far and then clarify where I am having trouble.
import logging
from bs4 import BeautifulSoup
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor, re
from scrapy.exceptions import CloseSpider
from Assignment2_0.items import Assignment20Item
logging.basicConfig(filename='spider.log',level = logging.DEBUG)
class WikisoupSpiderSpider(CrawlSpider):
name = 'wikisoup_spider'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Keira_Knightley']
rules = (
Rule(LinkExtractor(restrict_css= 'table.wikitable')),
Rule(LinkExtractor(allow =('(/wiki/)',), ),
callback='parse_crawl', follow=True))
actor_counter = 0
actor_max = 250
movie_counter = 0
movie_max = 125
def parse_crawl(self, response):
items = []
soup = BeautifulSoup(response.text, 'lxml')
item = Assignment20Item()
occupations = ['Actress', 'Actor']
logging.debug(soup.title)
tempoccu = soup.find('td', class_ = 'role')
logging.warning('tempoccu only works for pages of people')
tempdir = soup.find('th', text = 'Directed by')
logging.warning('tempdir only works for pages of movies')
if (tempdir is not None) and self.movie_counter < self.movie_max:
logging.info('Found movie and do not have enough yet')
item['moviename'] = soup.h1.text
logging.debug('name is ' + item['moviename'])
finder = soup.find('th', text='Box office')
gross = finder.next_sibling.next_sibling.text
gross_float = re.findall(r"[-+]?\d*\.\d+|\d+", gross)
item['netgross'] = float(gross_float[0])
logging.debug('Net gross is ' + gross_float[0])
finder = soup.find('div', text='Release date')
date = finder.parent.next_sibling.next_sibling.contents[1].contents[1].contents[1].get_text(" ")
date = date.replace(u'\xa0', u' ')
item['releasedate'] = date
logging.debug('released on ' + item['releasedate'])
item['type'] = 'movie'
items.append(item)
elif (tempoccu is not None) and (any(occu in tempoccu for occu in occupations)) and self.actor_counter < self.actor_max:
logging.info('Found actor and do not have enough yet')
item['name'] = soup.h1.text
logging.debug('name is ' + item['name'])
temp = soup.find('span', class_ = 'noprint ForceAgeToShow').text
age = re.findall('\d+', temp)
item['age'] = int(age[0])
logging.debug('age is ' + age[0])
filmo = []
finder = soup.find('span', id='Filmography')
for x in finder.parent.next_sibling.next_sibling.find_all('i'):
filmo.append(x.text)
item['filmography'] = filmo
logging.debug('has done ' + filmo[0])
item['type'] = 'actor'
items.append(item)
elif (self.movie_counter == self.movie_max and self.actor_counter == self.actor_max):
logging.info('Found enough data')
raise CloseSpider(reason='finished')
else :
logging.info('irrelavent data')
pass
return items
Now, my understanding of the rules in my code is it should allow all wiki links and should take links only from table tags and their children. This is clearly not what was happening since it very quickly crawled away from movies.
I'm clear on what to do when each element has an identifier like id or class but when inspecting the page, the links are buried in multiple nests of id-less tags which don't seem to all follow a singular pattern(I would use the regular xpath but different pages have different paths to filmography and it didn't seem like finding the path to the table under h2=filmography, would include all links in the tables below it). Therefore I wanted to know more on how I could get scrapy to only use Filmography links(in actor pages anyway).
I apologize if this was an obvious thing, I have started using both python and scrapy/xpath/css only 48 hours ago.
Firstly, you will need to know where you have to look for, I mean, Which tags you have to filter, so you have to inspect in the HMTL code corresponding on your page. Regarding libraries, I would use:
import requests
to do the connections
from bs4 import BeautifulSoup as bs
to parser
example:
bs = bs('file with html code', "html.parser")
you instance the object
select_tags = bs('select')
you look for the tags you want to filter
Then you should to wrap your list and add some condition like this:
for i in self.select:
print i.get('class'), type(i.get('class'))
if type(i.get('class')) is list and '... name you look for ...' in i.get('class'):
In this case you can filter inside the select tag you want by 'class' tag.
If I understand correctly what you want, you will probably need to combine your two rules into one, using both allow and restrict_xpath/restrict_css.
So, something like:
rules = [
Rule(LinkExtractor(allow=['/wiki/'], restrict_xpaths=['xpath']),
callback='parse_crawl',
follow=True)
]
Scraping wikipedia is usually pretty complicated, especially if trying to access very specific data.
There are a few problems i see for this particular example:
The data lacks structure - it's just a bunch of text in sequence, meaning your xpaths are going to be pretty complicated. For example, to select the 3 tables you want, you might need to use:
//table[preceding-sibling::h2[1][contains(., "Filmography")]]
You only want to follow links from the Title column (second one), however, due to the way HTML tables are defined, this might not always be represented by the second td of a row.
This means you'll probably need some additional logic, wither in your xpath, or in your code.
IMO the biggest problem: the lack of consistency. For examle, take a look at https://en.wikipedia.org/wiki/Gerard_Butler#Filmography No tables there, just a list and a link to another article. Basically, you get no guarantee about naming, positioning, layout, or display of information.
Those notes might get you started, but getting this information is going to be a big task.
My recommendation and personal choice would be to obtain the data you want from a more specialized source, instead of trying to scrape a website as generalized as wikipedia.
It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code.
My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items).
import scrapy
from ..items import ShowItem
class OperaSpider(scrapy.Spider):
name = "opera"
allowed_domains = ["http://www.opera.krakow.pl"]
start_urls = [
"http://www.opera.krakow.pl/pl/repertuar/na-afiszu/listopad"
]
def parse(self, response):
divs = response.xpath('//div[#class="row-fluid row-performance "]')
for div in divs:
item= ShowItem()
item['title'] = div.xpath('//h2[#class="item-title"]/a/text()').extract()
item['time'] = div.xpath('//div[#class="item-time vertical-center"]/div[#class="vcentered"]/text()').extract()
item['date'] = div.xpath('//div[#class="item-date vertical-center"]/div[#class="vcentered"]/text()').extract()
yield item
Try to change the xpaths inside the for loop to start with .//. That is, just put a dot in front of the double backslash. You can also try using extract_first() instead of extract() and see if that gives you better results.
I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
My items.py file is given below:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
And the spider is given below:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
yield item
On running the file I am not getting any output. The json file output is something like:
[[{"link": [], "name": []}
The output that I am looking for is the name of the location and the link. I am getting nothing.
Where am I going wrong?
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a tags, you don't need to specify a in the inner xpath expressions. In other words, currently you are searching for a tags inside the a tags inside the td inside tr. Which obviously results into nothing.
Replace a/text() with text() and a/#href with #href.
(tested - works for me)
use this....
item['name'] = sel.xpath('//a/text()').extract()
item['link'] = sel.xpath('//a/#href').extract()
I've tried using BeautifulSoup and regex to extract URLs from a web page. This is my code:
Ref_pattern = re.compile('<TD width="200"><A href="(.*?)" target=')
Ref_data = Ref_pattern.search(web_page)
if Ref_data:
Ref_data.group(1)
data = [item for item in csv.reader(output_file)]
new_column1 = ["Reference", Ref_data.group(1)]
new_data = []
for i, item in enumerate(data):
try:
item.append(new_column1[i])
except IndexError, e:
item.append(Ref_data.group(1)).next()
new_data.append(item)
Though it has many URLs in it, it just repeats the first URL. I know there's something wrong with
except IndexError, e:
item.append(Ref_data.group(1)).next()
this part because if I remove it, it just gives me the first URL (without repetition). Could you please help me extract all the URLs and write them into a CSV file.
Thank you.
Although it's not entirely clear what you're looking for, based on what you've stated, if there are specific elements (classes or id's or text, for instance) associated with the links you're attempting to extract, then you can do something like the following:
from bs4 import BeautifulSoup
string = """\
Linked Text
Linked Text
Image
Phone Number"""
soup = BeautifulSoup(string)
for link in soup.findAll('a', { "class" : "pooper" }, href=True, text='Linked Text'):
print link['href']
As you can see, I am using the bs4's attribute feature to select only those anchor tags that include the "pooper" class (class="pooper"), and then I am further narrowing the return values by passing a text argument (Linked Text rather than Image).
Based on your feedback below, try the following code. Let me know.
for items in soup.select("td[width=200]"):
for link in items:
link.findAll('a', { "target" : "_blank" }, href=True)
print link['href']