Making a pythonic decision on developing a web scraper module

Making a pythonic decision on developing a web scraper module - python

This is a fairly high level question.
I have developed many different web scrapers that work in different websites.
I have many different versions of the functions named getName() and getAddress().
Is it pythonic/not terrible coding practice to do this in the function of a module? If this is bad to do, could someone give me a high level tip on how to manage this kind of library-of-scrapers?
def universalNameAdressGrab(url):
page = pullPage(url)
if 'Tucson.com' in url:
import tucsonScraper
name = getName(page) #this is the getName for Tucson
address = getAddress(page)
elif 'NewYork.com' in url:
import newyorkScraper
name = getName(page) #this is the getName for NewYork
address = getAddress(page)
return {'name':name, 'address':address}

It is probably more pythonic to import everything at the top of the file. After that you can reference the functions by module and remove a lot of duplicated code. You may run into issues with URL capitalization, so I would standardize that as well. You could use urlparse for that. I would consider something like the following more pythonic:
import tucsonScraper
import newyorkScraper
def universalNameAdressGrab(url):
page = pullPage(url)
scraper = None
if 'Tucson.com' in url:
scraper = tucsonScraper
elif 'NewYork.com' in url:
scraper = newyorkScraper
else:
raise Exception("No scraper found for url")
return {'name': scraper.getName(page), 'address': scraper.getAddress(page)}

Related

Transform a multiple line url request into a function in Python

I try to download a serie of text files from different websites. I am using urllib.request with Python. I want to expend the list of URL without making the code long.
The working sequence is
import urllib.request
url01 = 'https://web.site.com/this.txt'
url02 = 'https://web.site.com/kind.txt'
url03 = 'https://web.site.com/of.txt'
url04 = 'https://web.site.com/link.txt'
[...]
urllib.request.urlretrieve(url01, "Liste n°01.txt")
urllib.request.urlretrieve(url02, "Liste n°02.txt")
urllib.request.urlretrieve(url03, "Liste n°03.txt")
[...]
The number of file to download is increasing and I want to keep the second part of the code short.
I tried
i = 0
while i<51
i = i +1
urllib.request.urlretrieve( i , "Liste n°0+"i"+.txt")
It doesn't work and I am thinking that a while loop can be use for string but not for request.
So I was thinking of making it a function.
def newfunction(i)
return urllib.request.urlretrieve(url"i", "Liste n°0"+1+".txt")
But it seem that I am missing a big chunk of it.
This request is working but it seem I cannot transform it for long list or URL.

As a general suggestion, I'd recommend the requests module for Python, rather than urllib.
Based on that, some naive code for a possible function:
import requests
def get_file(site, filename):
target = site + "/" + filename
try:
r = requests.get(target, allow_redirects=True)
open(filename, 'wb').write(r.content)
return r.status_code
except requests.exceptions.RequestException as e:
print("File not downloaded, error: {}".format(e))
You can then call the function, passing in parameters of site and file name:
get_file('https://web.site.com', 'this.txt')
The function will raise an exception, but not stop execution, if it cannot download a file. You could expand exception handling to deal with files not being writable, but this should be a start.

It seems as if you're not casting the variable i to an integer before your concatenating it to the url string. That may be the reason why you're code isn't working. The while-loop/for-loop approach shouldn't effect whether or not the requests get sent out. I recommend using the requests module for making requests as well. Mike's post covers what the function should kind of look like. I also recommend creating a sessions object if you're going to be making a whole lot of requests in a piece of code. The sessions object will keep the underlying TCP connection open while you make your requests, which should reduce latency, CPU usage, and network congestion (https://en.wikipedia.org/wiki/HTTP_persistent_connection#Advantages). The code would look something like this:
import requests
with requests.Session() as s:
for i in range(10):
s.get(str(i)+'.com') # make request
# write to file here
To cast to a string you would want something like this:
i = 0
while i<51
i = i +1
urllib.request.urlretrieve( i , "Liste n°0" + str(i) + ".txt")

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()

This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Does web scraping have patterns?

I have not done too much of web scraping in my experience. So far I am using python and using BeautifulSoup4 to scrape the hackernews page.
Was just wondering if there are patterns I should keep in mind before doing scraping. Right now the code looks very ugly and I feel like a hack.
Code:
import requests
from bs4 import BeautifulSoup
class Command(BaseCommand):
page = {}
td_count = 2
data_count = 0
def handle(self, *args, **options):
for i in range(1,4):
self.page_no = i
self.parse()
print self.page[1]
def get_result(self):
return requests.get('https://news.ycombinator.com/news?p=%s'% self.page_no)
def parse(self):
soup = BeautifulSoup(self.get_result().text, 'html.parser')
for x in soup.find_all('table')[2].find_all('tr'):
self.data_count += 1
self.page[self.data_count] = {'other_data' : None, 'url' : ''}
if self.td_count%3 == 0:
try:
subtext = x.find_all('td','subtext')[0]
self.page[self.data_count - 1]['other_data'] = subtext
except IndexError:
pass
title = x.find_all('td', 'title')
if title:
try:
self.page[self.data_count]['url'] = title[1].a
print title[1].a
except IndexError:
print 'Done page %s'%self.page_no
self.td_count +=1

Actually I behave scrappable data as part of my domain(business) data, which allows me to use Domain Driven Design to structure the problem:
Entities and Value Objects
I use entities and value objects to store the correct extracted information from data into my programming language data structures, so I can work with them in a great way.
Repository Pattern
I use repository pattern to delegate the job of gathering data to a different class. The repository class is given a site and then fetches the data and pre-builds the entities if needed.
Transformer/Presenter pattern
After fetching the data from the repository, I pass the html data to a presenter class. The presenter class has the duty of creating my business entity/value objects from the given HTML string.
Service Layer
If there is more process than those described above, I make a service class which is a wrapper around the problem, It calls the repository , gives the fetched data to the presenter the presenter builds the entities, and done, the result may be used by another service to be stored in a SQL database.
If you are familiar with PHP, I have programmed a small app in Laravel which fetches the alexa rank of a given website each 15mins and notifies the subscribers of that website by Email.
Github repository : Alexa Watcher
Folder of Repository classes
Command line application layer class which calls the service
The Service class which is also a presenter that builds needed entities.
The Service class which pushes detected changes to subscriber emails.

How do I handle partially initialized classes

My question concerns a class that I am writing that may or may not be fully initialized. The basic goal is to take a match_id and open the corresponding match_url (example: http://dota2lounge.com/match?m=1899) and then grab some properties out of the webpage. The problem is some match_ids will result in 404 pages (http://dota2lounge.com/404).
When this happens, there won't be a way to determine the winner of the match, so the rest of the Match can't be initialized. I have seen this causing problems with methods of the Match, so I added the lines to initialize everything to None if self._valid_url is False. This works in principal, but then I'm adding a line each time a new attribute is added, and it seems prone to errors down the pipeline (in methods, etc.) It also doesn't alert the user that this class wasn't properly initialized. They would need to call .is_valid_match() to determine that.
tl;dr: What is the best way to handle classes that may be only partially initiated? Since this is a hobby project and I'm looking to learn, I'm open to pretty much any solutions (trying new things), including other classes or whatever. Thanks.
This is an abbreviated version of the code containing the relevant portions (Python 3.3):
from urllib.request import urlopen
from bs4 import BeautifulSoup
class Match(object):
def __init__(self, match_id):
self.match_id = match_id
self.match_url = self.__determine_match_url__()
self._soup = self.__get_match_soup__()
self._valid_match_url = self.__determine_match_404__()
if self._valid_match_url:
self.teams, self.winner = self.__get_teams_and_winner__()
# These lines were added, but I'm not sure if this is correct.
else:
self.teams, self.winner = None, None
def __determine_match_url__(self):
return 'http://dota2lounge.com/match?m=' + str(self.match_id)
def __get_match_soup__(self):
return BeautifulSoup(urlopen(self.match_url))
def __get_match_details__(self):
return self._soup.find('section', {'class': 'box'})
def __determine_match_404__(self):
try:
if self._soup.find('h1').text == '404':
return False
except AttributeError:
return True
def __get_teams_and_winner__(self):
teams = [team.getText() for team in
self._soup.find('section', {'class': 'box'}).findAll('b')]
winner = False
for number, team in enumerate(teams):
if ' (win)' in team:
teams[number] = teams[number].replace(' (win)', '')
winner = teams[number]
return teams, winner
def is_valid_match(self):
return all([self._valid_match_url, self.winner])

I would raise an exception, handle that in your creation code (wherever you call some_match = Match(match_id)), and probably don't add it to whatever list you may or may not be using...
For a better answer, you might want to include in your question the code that instantiates all your Match objects.

Python script for "Google search by image"

I have checked Google Search API's and it seems that they have not released any API for searching "Images". So, I was wondering if there exists a python script/library through which I can automate the "search by image feature".

This was annoying enough to figure out that I thought I'd throw a comment on the first python-related stackoverflow result for "script google image search". The most annoying part of all this is setting up your proper application and custom search engine (CSE) in Google's web UI, but once you have your api key and CSE, define them in your environment and do something like:
#!/usr/bin/env python
# save top 10 google image search results to current directory
# https://developers.google.com/custom-search/json-api/v1/using_rest
import requests
import os
import sys
import re
import shutil
url = 'https://www.googleapis.com/customsearch/v1?key={}&cx={}&searchType=image&q={}'
apiKey = os.environ['GOOGLE_IMAGE_APIKEY']
cx = os.environ['GOOGLE_CSE_ID']
q = sys.argv[1]
i = 1
for result in requests.get(url.format(apiKey, cx, q)).json()['items']:
link = result['link']
image = requests.get(link, stream=True)
if image.status_code == 200:
m = re.search(r'[^\.]+$', link)
filename = './{}-{}.{}'.format(q, i, m.group())
with open(filename, 'wb') as f:
image.raw.decode_content = True
shutil.copyfileobj(image.raw, f)
i += 1

There is no API available but you are can parse the page and imitate the browser, but I don't know how much data you need to parse because google may limit or block access.
You can imitate the browser by simply using urllib and setting correct headers, but if you think parsing complex web-pages may be difficult from python, you can directly use a headless browser like phontomjs, inside a browser it is trivial to get correct elements using javascript/DOM
Note before trying all this check google's TOS

You can try this:
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_python
It's deprecated, but seems to work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Making a pythonic decision on developing a web scraper module - python

Related

Transform a multiple line url request into a function in Python

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Does web scraping have patterns?

How do I handle partially initialized classes

Python script for "Google search by image"

Categories

Resources