Does web scraping have patterns?

Does web scraping have patterns? - python

I have not done too much of web scraping in my experience. So far I am using python and using BeautifulSoup4 to scrape the hackernews page.
Was just wondering if there are patterns I should keep in mind before doing scraping. Right now the code looks very ugly and I feel like a hack.
Code:
import requests
from bs4 import BeautifulSoup
class Command(BaseCommand):
page = {}
td_count = 2
data_count = 0
def handle(self, *args, **options):
for i in range(1,4):
self.page_no = i
self.parse()
print self.page[1]
def get_result(self):
return requests.get('https://news.ycombinator.com/news?p=%s'% self.page_no)
def parse(self):
soup = BeautifulSoup(self.get_result().text, 'html.parser')
for x in soup.find_all('table')[2].find_all('tr'):
self.data_count += 1
self.page[self.data_count] = {'other_data' : None, 'url' : ''}
if self.td_count%3 == 0:
try:
subtext = x.find_all('td','subtext')[0]
self.page[self.data_count - 1]['other_data'] = subtext
except IndexError:
pass
title = x.find_all('td', 'title')
if title:
try:
self.page[self.data_count]['url'] = title[1].a
print title[1].a
except IndexError:
print 'Done page %s'%self.page_no
self.td_count +=1

Actually I behave scrappable data as part of my domain(business) data, which allows me to use Domain Driven Design to structure the problem:
Entities and Value Objects
I use entities and value objects to store the correct extracted information from data into my programming language data structures, so I can work with them in a great way.
Repository Pattern
I use repository pattern to delegate the job of gathering data to a different class. The repository class is given a site and then fetches the data and pre-builds the entities if needed.
Transformer/Presenter pattern
After fetching the data from the repository, I pass the html data to a presenter class. The presenter class has the duty of creating my business entity/value objects from the given HTML string.
Service Layer
If there is more process than those described above, I make a service class which is a wrapper around the problem, It calls the repository , gives the fetched data to the presenter the presenter builds the entities, and done, the result may be used by another service to be stored in a SQL database.
If you are familiar with PHP, I have programmed a small app in Laravel which fetches the alexa rank of a given website each 15mins and notifies the subscribers of that website by Email.
Github repository : Alexa Watcher
Folder of Repository classes
Command line application layer class which calls the service
The Service class which is also a presenter that builds needed entities.
The Service class which pushes detected changes to subscriber emails.

Related

Fetch data from API inside Scrapy

I am working on a project that is divided into two parts:
Retrieve a specific page
Once the ID of this page is extracted,
Send requests to an API to obtain additional information on this page
For the second point, and to follow Scrapy's asynchronous philosophy, where should such a code be placed? (I hesitate between in the spider or in a pipeline).
Do we have to use different libraries like asyncio & aiohttp to be able to achieve this goal asynchronously? (I love aiohttp so this is not a problem to use it)
Thanks you

Since you're doing this to fetch additional information about an item, I'd just yield a request from the parsing method, passing the already scraped information in the meta attribute.
You can see an example of this at https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
This can also be done in a pipeline (either using scrapy's engine API, or a different library, e.g. treq).
I do however think that doing it "the normal way" from the spider makes more sense in this instance.

I recently had the same problem (again) and found an elegant solution using Twisted decorators t.i.d.inlineCallbacks.
# -*- coding: utf-8 -*-
import scrapy
import re
from twisted.internet.defer import inlineCallbacks
from sherlock import utils, items, regex
class PagesSpider(scrapy.spiders.SitemapSpider):
name = 'pages'
allowed_domains = ['thing.com']
sitemap_follow = [r'sitemap_page']
def __init__(self, site=None, *args, **kwargs):
super(PagesSpider, self).__init__(*args, **kwargs)
#inlineCallbacks
def parse(self, response):
# things
request = scrapy.Request("https://google.com")
response = yield self.crawler.engine.download(request, self)
# Twisted execute the request and resume the generator here with the response
print(response.text)

How to print available tags while using Robot Framework

If I have a large test suite in Robot Framework and a lot of tags is it possible to to get to know a list of tag names available within the suite ?
some thing like pybot --listtags ??
It will be useful for the person who is actually going to run the test.
For example in a scenario related to publishing of news articles , the test cases may be tagged as "publish", "published", or "publishing" .
The tester is not going to have RIDE at his/her disposal . And hence he/she may not know the exact tag name .
Under these circumstances I thought it will be useful to extract the available tags to display - without running any tests . And then he/she can choose to run the test with the desired tag
I searched the robot framework user guide and didn't see any command line options that do this.

There is nothing provided by robot to give you this information. However, it's pretty easy to write a python script that uses the robot parser to get all of the tag information. Here's a quick hack that I think is correct (though I only tested it very briefly):
from robot.parsing import TestData
import sys
def main(path):
suite = TestData(parent=None, source=path)
tags = get_tags(suite)
print ", ".join(sorted(set(tags)))
def get_tags(suite):
tags = []
if suite.setting_table.force_tags:
tags.extend(suite.setting_table.force_tags.value)
if suite.setting_table.default_tags:
tags.extend(suite.setting_table.default_tags.value)
for testcase in suite.testcase_table.tests:
if testcase.tags:
tags.extend(testcase.tags.value)
for child_suite in suite.children:
tags.extend(get_tags(child_suite))
return tags
if __name__ == "__main__":
main(sys.argv[1])
Note that this will not get any tags created by the Set Tags keyword, nor does it take into account tags removed by Remove Tags.
Save the code to a file, eg get_tags.py, and run it like this:
$ python /tmp/get_tags.py /tmp/tests/
a tag, another force tag, another tag, default tag, force tag, tag-1, tag-2

I have used a Robot Framework output file listener to list all tags of the current suite.
from lxml import etree as XML
"""Listener that prints the tags of the executed suite."""
ROBOT_LISTENER_API_VERSION = 3
tags_xpath = ".//tags/tag"
def output_file(path):
root = XML.parse(path).getroot()
tag_elements = root.xpath(tags_xpath)
tags = set()
for element in tag_elements:
tags.add(element.text)
print("\nExisting tags: " + str(tags) + "\n")
You can use such listener along with dry run mode to quickly get the tag data of a suite.
robot --listener get_tags.py --dryrun ./tests
The tags will be listed at the output file section of the console log.
==============================================================================
Existing tags: {'Tag1', 'a', 'Tag3.5', 'Feature1', 'b', 'Tag3', 'Feature2'}
Output: D:\robot_framework\output.xml
Log: D:\robot_framework\log.html
Report: D:\robot_framework\report.html

In robot 3.2, the API is different and the following is what I did:
from robot.api import get_model
import ast
def _check_tags(self, file):
class TestTagPrint(ast.NodeVisitor):
def visit_File(self, node):
print(f"{node.source}")
# to get suite level force tags
for section in node.sections:
for sect in section.body:
try:
if sect.type == 'FORCE_TAGS':
print(sect.values)
except AttributeError as ae:
pass
self.generic_visit(node)
def visit_TestCase(self, node):
for statement in node.body:
# to get tags at test case level
if statement.type == "TAGS":
print(statement.values)
model = get_model(file)
printer = TestTagPrint()
printer.visit(model)
Both print statements would return tuple.
There might be a simpler or better way in this API.
And I see API is different for Robot framework 4.

Should I do a URL fetch or call the class method ? Which would be the best option

Which would be a better to way to get contents from two different request handlers?
This is how my app structure looks like
#/twitter/<query>
class TwitterSearch(webapp2.RequestHandler):
def get(self,query):
#get data from Twitter
json_data = data_from_twiiter()
return json_data
#/google/<query>
class GoogleSearch(webapp2.RequestHandler):
def get(self,query):
#get data from Twitter
json_data = data_from_google()
return json_data
Now I can access twitter search data and Google search data separately by calling their respective URL.
I also need to combine both these search results and offer to the user. What would be my best approach to do this ?
Should I call the get method of the respective classes like this ?
#/search/<query>
#Combined search result from google and twitter
class Search(webapp2.RequestHandler):
def get(self,query):
t = TwitterSearch()
twitterSearch = t.get(self,query)
g = GoogleSearch()
googlesearch = g.get(self,query)
Or fetch the data from URL using urllib or something like this ?
#/search/<query>
#Combined search result from google and twitter
class Search(webapp2.RequestHandler):
def get(self,query):
t = get_data_from_URL('/twitter/'+query)
g = get_data_from_URL('/google/'+query)
Or is there some other way to handle this situation?

You shouldn't make HTTP calls to your own application, that introduces a completely unnecessary level of overhead.
I would do this by extracting the query code into a separate function and calling it from both request handlers.

Getting Objects with urllib2

I have two GAE apps working in conjunction. One holds an object in a database, the other gets that object from the first app. Below I have the bit of code where the first app is asked for and gives the Critter Object. I am trying to access the first app's object via urllib2, is this really possible? I know it can be used for json but can it be used for objects?
Just for some context I am developing this as a project for a class. The students will learn how to host a GAE app by creating their critters. Then they will give me the url for their critters and my app will use the urls to collect all of their critters then put them into my app's world.
I've only recently heard about pickle, have not looked into yet, might that be a better alternative?
critter.py:
class Access(webapp2.RequestHandler):
def get(self):
creature = CritStore.all().order('-date').get()
if creature:
stats = loads(creature.stats)
return SampleCritter(stats)
else:
return SampleCritter()
map.py:
class Out(webapp2.RequestHandler):
def post(self):
url = self.request.POST['url']#from a simple html textbox
critter = urllib2.urlopen(url)
...work with critter as if it were the critter object...

Yes you can use pickle.
Here is some sample code to transfer an entity, including the key :
entity_dict = entity.to_dict() # First create a dict of the NDB entity
entity_dict['entity_ndb_key_safe'] = entity.key.urlsafe() # add the key of the entity to the dict
pickled_data = pickle.dumps(entity_dict, 1) # serialize the object
encoded_data = base64.b64encode(pickled_data) # encode it for safe transfer
As an alternative for urllib2 you can use the GAE urlfetch.fetch()
In the requesting app you can :
entity_dict = pickle.loads(base64.b64decode(encoded_data))

Download a Google Sites page Content Feed using gdata-python-client

My final goal is import some data from Google Site pages.
I'm trying to use gdata-python-client (v2.0.17) to download a specific Content Feed:
self.client = gdata.sites.client.SitesClient(source=SOURCE_APP_NAME)
self.client.client_login(USERNAME, PASSWORD, source=SOURCE_APP_NAME, service=self.client.auth_service)
self.client.site = SITE
self.client.domain = DOMAIN
uri = '%s?path=%s' % (self.client.MakeContentFeedUri(), '[PAGE PATH]')
feed = self.client.GetContentFeed(uri=uri)
entry = feed.entry[0]
...
Resulted entry.content has a page content in xhtml format. But this tree doesn't content any plan text data from a page. Only html page struct and links.
For example my test page has
<div>Some text</div>
ContentFeed entry has only div node with text=None.
I have debugged gdata-python-client request/response and checked resolved data from server in raw buffer - any plan text data in content. Hence it is a Google API bug.
May be there is some workaround? May be i can use some common request parameter? What's going wrong here?

This code works for me against a Google Apps domain and gdata 2.0.17:
import atom.data
import gdata.sites.client
import gdata.sites.data
client = gdata.sites.client.SitesClient(source='yourCo-yourAppName-v1', site='examplesite', domain='example.com')
client.ClientLogin('admin#example.com', 'examplepassword', client.source);
uri = '%s?path=%s' % (client.MakeContentFeedUri(), '/home')
feed = client.GetContentFeed(uri=uri)
entry = feed.entry[0]
print entry
Given, it's pretty much identical to yours, but it might help you prove or disprove something. Good luck!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Does web scraping have patterns? - python

Related

Fetch data from API inside Scrapy

How to print available tags while using Robot Framework

Should I do a URL fetch or call the class method ? Which would be the best option

Getting Objects with urllib2

Download a Google Sites page Content Feed using gdata-python-client

Categories

Resources