Extract URLs recursively from website archives in scrapy

Extract URLs recursively from website archives in scrapy - python

Hi I want to crawl the data from http://economictimes.indiatimes.com/archive.cms, all the urls are archived based on date, month and year, first to get the urls list I am using the code from https://github.com/FraPochetti/StocksProject/blob/master/financeCrawler/financeCrawler/spiders/urlGenerator.py, modified the code for my website as,
import scrapy
import urllib
def etUrl():
totalWeeks = []
totalPosts = []
url = 'http://economictimes.indiatimes.com/archive.cms'
data = urllib.urlopen(url).read()
hxs = scrapy.Selector(text=data)
months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
admittMonths = 12*(2013-2007) + 8
months = months[:admittMonths]
for month in months:
data = urllib.urlopen(month).read()
hxs = scrapy.Selector(text=data)
weeks = hxs.xpath('//ul[#class="weeks"]/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news/day\\d+\.cms')
totalWeeks += weeks
for week in totalWeeks:
data = urllib.urlopen(week).read()
hxs = scrapy.Selector(text=data)
posts = hxs.xpath('//ul[#class="archive"]/li/h1/a/#href').extract()
totalPosts += posts
with open("eturls.txt", "a") as myfile:
for post in totalPosts:
post = post + '\n'
myfile.write(post)
etUrl()
saved file as urlGenerator.py and ran with the command $ python urlGenerator.py
I am getting no result, could someone assist me how to adopt this code for my website use case or any other solution?

Try stepping through your code one line at a time using pdb. Run python -m pdb urlGenerator.py and follow the instructions for using pdb in the linked page.
If you step through your code line by line you can immediately see that the line
data = urllib.urlopen(url).read()
is failing to return something useful:
(pdb) print(data)
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://economictimes.indiatimes.com/archive.cms" on this server.<P>
Reference #18.6057c817.1508411706.1c3ffe4
</BODY>
</HTML>
It seems that they are not allowing access by Python's urllib. As pointed out in the comments you really shouldn't be using urllib anyways--Scrapy is already adept at dealing with this.
A lot of the rest of your code is clearly broken as well. For example this line:
months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
returns an empty list even given the real HTML from this site. If you look at the HTML it's clearly in a table, not unsorted lists (<ul>). You also have the URL format wrong. Instead something like this would work:
months = response.xpath('//table//tr//a/#href').re(r'/archive/year-\d+,month-\d+.cms')
If you want to build a web scraper, rather than starting from some code you found (that isn't even correct) and trying to blindly modify it, try following the official tutorial for Scrapy and start with some very simple examples, then build up from there. For example:
class EtSpider(scrapy.Spider):
name = 'et'
start_urls = ["https://economictimes.indiatimes.com/archive.cms"]
def parse(self, response):
months = response.xpath('//table//tr//a/#href').re(r'/archive/year-\d+,month-\d+.cms')
for month in months:
self.logger.info(month)
process = scrapy.crawler.CrawlerProcess()
process.crawl(EtSpider)
process.start()
This runs correctly, and you can clearly see it finding the correct URLs for the individual months, as printed to the log. Now you can go from there and use callbacks, as explained in the documentation, to make further additional requests.
In the end you'll save yourself a lot of time and hassle by reading the docs and getting some understanding of what you're doing rather than taking some dubious code off the internet and trying to shoehorn it into your problem.

Related

Python Scrapy: saving to csv/json does not encode Latin2 properly

I am new to Scrapy, and I built a simple spider that scrapes my local news site for titles and amount of comments. It scrapes well, but I have a problem with my language encoding.
I have created a Scrapy project that I then run through anaconda prompt to save the output to a file like so (from the project directory):
scrapy crawl MySpider -o test.csv
When I then open the json file with the following code:
with open('test.csv', 'r', encoding = "L2") as f:
file = f.read()
I also tried saving it to json, opening in excel, changing to different encodings from there ... always unreadable, but the characters differ. I am Czech if that is relevant. I need characters like ěščřžýáíé etc., but it is Latin.
What I get: Varuje pĹ\x99ed
What I want: Varuje před
Here is my spider code. I did not change anything in settings or pipeline, though I tried multiple tips from other threads that do this. I spent 2 hours on this already, browsing stack overflow and documentation and I can't find the solution, it's becoming a headache for me. I'm not a programmer so this may be the reason... anyway:
urls = []
for number in range(1,101):
urls.append('https://www.idnes.cz/zpravy/domaci/'+str(number))
class MySpider(scrapy.Spider):
name = "MySpider"
def start_requests(self):
urls = ['https://www.idnes.cz/zpravy/domaci/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
articleBlocks = response.xpath('//div[contains(#class,"art")]')
articleLinks = articleBlocks.xpath('.//a[#class="art-link"]/#href')
linksToFollow = articleLinks.extract()
for url in linksToFollow:
yield response.follow(url = url, callback = self.parse_arts)
print(url)
def parse_arts(self, response):
for article in response.css('div#content'):
yield {
'title': article.css('h1::text').get(),
'comments': article.css('li.community-discusion > a > span::text').get(),
}

Scrapy saves feed exports with utf-8 encoding by default.
Opening the file with the correct encoding displays the characters fine.
If you want to change the encoding used, you can do so by using the FEED_EXPORT_ENCODING setting (or using FEEDS instead).

After one more hour of trial and error, I solved this. The problem was not in Scrapy, it was correctly saving in utf-8, the problem was in the command:
scrapy crawl idnes_spider -o test.csv
that I ran to save it. When I run the command:
scrapy crawl idnes_spider -s FEED_URI=test.csv -s FEED_FORMAT=csv
It works.

Empty list most of the time outputted when trying to find first link when getting links from youtube (Python)

I followed this tutorial from this website
in order to learn how I can extract the first link from youtube based on a given query. I have implemented the code into a function like so:
import urllib.request
import re
def GetBestYoutubeLink(MusicRequest):
MusicSearchLink = MusicRequest.replace(" ","+")
MusicSearchLink = "https://www.youtube.com/results?search_query=" + MusicSearchLink
HTMLContent = urllib.request.urlopen(MusicSearchLink)
SearchResults = re.findall(r'href=\"\/watch\?v=(.{11})', HTMLContent.read().decode())
print(SearchResults)
BestLink = "http://www.youtube.com/embed/" + SearchResults[0]
return BestLink
Where a query will passed into the function and it would print the first/best url. However the problem I am facing from this solution is most of the time the SearchResults array when printed is empty and hence I am unable to get the first url. It is not like the query is an uncommon query as I had tried popular songs and videos to obtain the link of, but it simply returns as empty, however it works sometimes with the correct output of the best link. In order to find a solution to this I gave the following statement between when it prints the SearchResults array and when the BestLink variable is defined:
if SearchResults == []:
print(SearchResults)
MusicPlayer(MusicRequest)
Where if the SearchResults array is empty then it runs the function again. However it is being rerun and an empty list is being printed sometimes 20 to 30 times which is not at all efficient. I would like to understand what may the problem be behind my list returning as empty most of the time but sometimes is populated and hence am able to get the link and how may I be able to fix this?
My current python version is 3.6 and I am running on macOS Catalina.

I think the style of the query return changed since this tutorial has been written. If you print the HTMLContent.read().decode() you can see that the URLs are in form "url":"/watch?v=0755SXCTCN0"
I changed your code, you also had a search_results[0] which doesn't exist.
import urllib.request
import re
def GetBestYoutubeLink(MusicRequest):
MusicSearchLink = MusicRequest.replace(" ","+")
MusicSearchLink = "https://www.youtube.com/results?search_query=" + MusicSearchLink
HTMLContent = urllib.request.urlopen(MusicSearchLink)
SearchResults = re.findall(r'/watch\?v=(.{11})', HTMLContent.read().decode())
print(SearchResults)
BestLink = "http://www.youtube.com/embed/" + SearchResults[0]
return BestLink

Download entire history of a Wikipedia page

I'd like to download the entire revision history of a single article on Wikipedia, but am running into a roadblock.
It is very easy to download an entire Wikipedia article, or to grab pieces of its history using the Special:Export URL parameters:
curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml"
And of course I can download the entire site including all versions of every article from here, but that's many terabytes and way more data than I need.
Is there a pre-built method for doing this? (Seems like there must be.)

The example above only gets information about the revisions, not the actual contents themselves. Here's a short python script that downloads the full content and metadata history data of a page into individual json files:
import mwclient
import json
import time
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Wikipedia']
for i, (info, content) in enumerate(zip(page.revisions(), page.revisions(prop='content'))):
info['timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S", info['timestamp'])
print(i, info['timestamp'])
open("%s.json" % info['timestamp'], "w").write(json.dumps(
{ 'info': info,
'content': content}, indent=4))

Wandering around aimlessly looking for clues to another question I have myself — my way of saying I know nothing substantial about this topic! — I just came upon this a moment after reading your question: http://mwclient.readthedocs.io/en/latest/reference/page.html. Have a look for the revisions method.
EDIT: I also see http://mwclient.readthedocs.io/en/latest/user/page-ops.html#listing-page-revisions.
Sample code using the mwclient module:
#!/usr/bin/env python3
import logging, mwclient, pickle, os
from mwclient import Site
from mwclient.page import Page
logging.root.setLevel(logging.DEBUG)
logging.debug('getting page...')
env_page = os.getenv("MEDIAWIKI_PAGE")
page_name = env_page is not None and env_page or 'Stack Overflow'
page_name = Page.normalize_title(env_page)
site = Site('en.wikipedia.org') # https by default. change w/`scheme=`
page = site.pages[page_name]
logging.debug('extracting revisions (may take a really long time, depending on the page)...')
revisions = []
for i, revision in enumerate(page.revisions()):
revisions.append(revision)
logging.debug('saving to file...')
with open('{}Revisions.mediawiki.pkl'.format(page_name), 'wb+') as f:
pickle.dump(revisions, f, protocol=0) # protocol allows backwards compatibility between machines

python yield function with callback args

This is the first time I ask question here. If something I got wrong, please forgive me.
And I am a newer in python for one month, I try to use the scrapy to learn something more about spider.
question is here:
def get_chapterurl(self, response):
item = DingdianItem()
item['name'] = str(response.meta['name']).replace('\xa0', '')
yield item
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
def get_chapter(self, response):
urls = re.findall(r'<td class="L">(.*?)</td>', response.text)
As you can see, I yield item and Requests at the same time, but the get_chapter function did not run the first line(I take a break point there), so where was I wrong?
Sorry for disturbing you.
I have google for a time, but get noting...

Your request gets filtered out.
Scrapy has in-built request filter that prevents you from downloading the same page twice (intended feature).
Lets say you are on http://example.com; this request you yield:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id})
tries to download http://example.com again. And if you look at the crawling log it should say something along the lines of "ignoring duplicate url http://example.com".
You can always ignore this feature by setting dont_filter=True parameter in your Request object, as so:
yield Request(url=response.url, callback=self.get_chapter, meta={'name':name_id},
dont_filter=True)
However! I'm having trouble understanding the intention of your code but it seems that you don't really want to download the same url twice.
You don't have to schedule a new request either, you can just call your callback with the request you already have:
response = response.replace(meta={'name': name_id}) # update meta
# why crawl it again, if we can just call the callback directly!
# for python2
for result in self.get_chapter(response):
yield result
# or if you are running python3:
yield from self.get_chapter(response):

How to get all YouTube comments with Python's gdata module?

Looking to grab all the comments from a given video, rather than go one page at a time.
from gdata import youtube as yt
from gdata.youtube import service as yts
client = yts.YouTubeService()
client.ClientLogin(username, pwd) #the pwd might need to be application specific fyi
comments = client.GetYouTubeVideoComments(video_id='the_id')
a_comment = comments.entry[0]
The above code with let you grab a single comment, likely the most recent comment, but I'm looking for a way to grab all the comments at once. Is this possible with Python's gdata module?
The Youtube API docs for comments, the comment feed docs and the Python API docs

The following achieves what you asked for using the Python YouTube API:
from gdata.youtube import service
USERNAME = 'username#gmail.com'
PASSWORD = 'a_very_long_password'
VIDEO_ID = 'wf_IIbT8HGk'
def comments_generator(client, video_id):
comment_feed = client.GetYouTubeVideoCommentFeed(video_id=video_id)
while comment_feed is not None:
for comment in comment_feed.entry:
yield comment
next_link = comment_feed.GetNextLink()
if next_link is None:
comment_feed = None
else:
comment_feed = client.GetYouTubeVideoCommentFeed(next_link.href)
client = service.YouTubeService()
client.ClientLogin(USERNAME, PASSWORD)
for comment in comments_generator(client, VIDEO_ID):
author_name = comment.author[0].name.text
text = comment.content.text
print("{}: {}".format(author_name, text))
Unfortunately the API limits the number of entries that can be retrieved to 1000. This was the error I got when I tried a tweaked version with a hand crafted GetYouTubeVideoCommentFeed URL parameter:
gdata.service.RequestError: {'status': 400, 'body': 'You cannot request beyond item 1000.', 'reason': 'Bad Request'}
Note that the same principle should apply to retrieve entries in other feeds of the API.
If you want to hand craft the GetYouTubeVideoCommentFeed URL parameter, its format is:
'https://gdata.youtube.com/feeds/api/videos/{video_id}/comments?start-index={sta‌rt_index}&max-results={max_results}'
The following restrictions apply: start-index <= 1000 and max-results <= 50.

The only solution I've got for now, but it's not using the API and gets slow when there's several thousand comments.
import bs4, re, urllib2
#grab the page source for vide
data = urllib2.urlopen(r'http://www.youtube.com/all_comments?v=video_id') #example XhFtHW4YB7M
#pull out comments
soup = bs4.BeautifulSoup(data)
cmnts = soup.findAll(attrs={'class': 'comment yt-tile-default'})
#do something with them, ie count them
print len(cmnts)
Note that due to 'class' being a builtin python name, you can't do regular searches for 'startwith' via regex or lambdas as seen here, since you're using a dict, over regular parameters. It also gets pretty slow due to BeautifulSoup, but it needs to get used because etree and minidom don't find matching tags for some reason. Even after prettyfying() with bs4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract URLs recursively from website archives in scrapy - python

Related

Python Scrapy: saving to csv/json does not encode Latin2 properly

Empty list most of the time outputted when trying to find first link when getting links from youtube (Python)

Download entire history of a Wikipedia page

python yield function with callback args

How to get all YouTube comments with Python's gdata module?

Categories

Resources