Search multiple pages for match

Search multiple pages for match - python

I'm trying to solve an exercise, basically, I have to parse a JSON page and search for an object. If the object is not found then I have to search the next page for it. If the person I'm looking for is on the first page then I pass the test but I fail if it's on another page.
I checked and each page is parsed correctly but the return is always undefined if it's not on the first page.
This is my code:
import urllib.request
import json
class Solution:
def __new__(self, character):
url = 'https://challenges.hackajob.co/swapi/api/people/'
numberOfFilms = 0
#
# Some work here; return type and arguments should be according to the problem's requirements
#
numberOfFilms = self.search(self,character,url)
return numberOfFilms
def search(self, character,url):
numberOfFilms = 0
found = False
with urllib.request.urlopen(url) as response:
data = response.read()
jsonData = json.loads(data.decode('utf-8'))
for r in jsonData['results']:
if r['name'] == character:
return len(r['films'])
if (jsonData['next']):
nextPage = jsonData['next']
self.search(self,character,nextPage)

change the last line to return self.search(self,character,nextPage)

Related

Having problems with a simple Instagram Scraper

i am pretty new to all the programming stuff and I am learning Python for my social engineering project. So really sorry if you will hit your own forehead.
So now i was looking at a tutorial to scrape certain information from a certain instagram page. Lets say f.e. i wanted to extract info from www.instagram.com/nbamemes
I am getting a problem in Line 12 "IndentationError: expected an indented block". So i have googled that, but i just dont get the Code. Where are my placeholders which i need to place info from myself.
import requests
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
class insta_Scraper_v1:
def getinfo(self, url):
html = urllib.request.urlopen('www.instagram.com/nbamemes', context=self.ctx).read()
soup = BeautifulSoup(html, 'html.parser')
data = soup.find_all('meta', attr={'property': 'og:description'})
text = data[0]
user = '%s %s %s' % (text[-3], text[-2], text[-1])
followers = text[0]
following = text[2]
posts = text[4]
print('User:', user)
print('Followers:', followers)
print('Following:', following)
print('Posts:', posts)
print('-----------------------')
def mail(self):
self.ctx = ssl.create_default_context()
self.ctx.check_hostname = False
self.ctx.verify_mode = ssl.CERT_NONE
with open('123.txt') as f:
self.content = f.readlines()
self.content = [x.strip() for x in self.content]
for url in self.content:
self.getinfo(url)
if __name__ == '__main__'
obj = insta_Scraper_v1()
obj.mail()
I used a Tutorial for programming this. However I dont get the whole thing right. Its not completely beginner friendly and I seem to need help. Again sorry for this super beginners question.
beste regards,
lev

In the future, it would be useful to share the error message produced by your code. It includes the line at which the error has occurred.
Based on the code you provided, I can see that you did not indent the code inside your functions. After the function declaration def, you need to indent all code inside it
So from:
def getinfo (self, url):
html = urllib.request.urlopen('www.instagram.com/nbamemes', context=self.ctx).read()
soup = BeautifulSoup(html, 'html.parser')
data = soup.find_all ('meta', attr={'property': 'og:description'})
To:
def getinfo (self, url):
html = urllib.request.urlopen('www.instagram.com/nbamemes', context=self.ctx).read()
soup = BeautifulSoup(html, 'html.parser')
data = soup.find_all ('meta', attr={'property': 'og:description'})

Indentations are the block separators in python. Below is the indented code. Whenever you are using the condition loops , def , class you are creating a block. In order to define that you have to indent the code using spaces . Usually a tab space is been preferred , but even a single space also works fine .
import requests
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
class insta_Scraper_v1:
def getinfo (self, url):
html = urllib.request.urlopen('www.instagram.com/nbamemes', context=self.ctx).read()
soup = BeautifulSoup(html, 'html.parser')
data = soup.find_all ('meta', attr={'property': 'og:description'})
text = data[0]
user = '%s %s %s' % (test[-3], text[-2], text[-1])
followers = text[0]
following = text[2]
posts = text[4]
print ('User:', user)
print ( 'Followers:', followers)
print ('Following:', following)
print ('Posts:', posts)
print ('-----------------------')
def mail(self:
self.ctx = ssl.create_default_context()
self.ctx.check_hostname = False
self.ctx.verify_mode = ssl.CERT_NONE
with open('123.txt') as f:
self.content = f.readlines()
self.content = [x.strip() for x in self.content]
for url in self.content:
self.getinfo(url)
if __name__ == '__main__'
obj = insta_Scraper_v1()
obj.main()
Ref : Geeks For Geeks : Indentation
Thanks

Building sequential urls with Beautifulsoup

I have a text file that has 1 number in it. i'm trying to have it go to the website with the first number appended to the url, grab the info and then move on to the next url in sequence, pull the info etc. If the number brings up a blank page, it should end the sequence and email out the information it gathered. I'm not getting any errors. It completes it's run, but i'm not getting any back or seeing any changes in the number in the text file. I'm curious if what i've got for this part of the program is correct, or if i'm missing something.
Here's what I've got
import requests
from bs4 import BeautifulSoup as bs
#loads LIC# url
def get_page(license_number):
url = URL_FORMAT.format(license_number)
r = requests.get(url)
return bs(r.text, 'lxml')
#looks for non-existent info for no-license
def license_exists(soup):
if soup.find('td', class_ = 'style3'):
return True
else:
return False
#pulls lic# from text license_number.txt
def get_current_license_number():
with open(LICENSE_NUMBER_FILE, 'r') as f:
return int(f.read())
#adds lic# to urls
def get_new_license_pages(curr_license_num):
new_pages = []
more = True
curr_license_num +=1
return new_pages

How do i get the text from a xml.dom.minidom Dom Element?

How do I get the text value of the title element?
Is this even possible with a Dom Element?
Will I have to parse out the text by hand?
#-*-coding:utf8;-*-
#qpy:3
#qpy:console
import re
import urllib.request
from xml.dom import minidom
def download(url):
with urllib.request.urlopen(url) as res:
return res.read().decode('latin-1')
class RSSFeed(object):
def __init__(self, url):
self.url = url
self.raw_xml = download(url)
self.dom = minidom.parseString(self.raw_xml)
self.links = self.dom.getElementsByTagName('link')
def entries(self):
ret = {}
for element in self.dom.getElementsByTagName('entry'):
title = element.getElementsByTagName('title')[0]
print(title.toprettyxml())
def __str__(self):
return self.dom.toprettyxml()
feed_url = 'https://rickys-python-notes.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500'
feed = RSSFeed(feed_url)
dom = feed.dom
print(feedHow totries())

The canonical way to determine the node value (i.e. text content) of any XML element is to
get the node value of all the text nodes it contains, including the nested ones
trim them
join them with a space
Minidom inexplicably does not implement this procedure, so if you must use minidom, you need to do it yourself.
So we need a few helper functions.
One to get all the descendant nodes that fulfill a certain condition, like being a text node.
One to get their values and join them
One that gets the first element of a certain name from a node, for convenience.
Let's collect them in a module.
# minidom_helpers.py
def get_descendant_nodes(context_node, predicate):
if not context_node:
yield None
for child in context_node.childNodes:
if predicate(child):
yield child
yield from get_descendant_nodes(child, predicate)
def get_text_value(context_node, default=None):
texts_nodes = get_descendant_nodes(context_node, lambda n: n.nodeType == n.TEXT_NODE)
text_value = ' '.join([str.strip(t.nodeValue) for t in texts_nodes])
return text_value if text_value else default
def get_first_child(context_node, element_name):
elems = context_node.getElementsByTagName(element_name)
return elems[0] if elems else None
Now we can do
import re
import urllib.request
from xml.dom import minidom
from minidom_helpers import *
class RSSFeed(object):
def __init__(self, url):
self.url = url
self.dom = minidom.parse(urllib.request.urlopen(url))
self.links = self.dom.getElementsByTagName('link')
def entries(self):
for entry in self.dom.getElementsByTagName('entry'):
yield {
"title": get_text_value(get_first_child(entry, 'title'))
}
def __str__(self):
return self.dom.toprettyxml()
feed_url = 'https://rickys-python-notes.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500'
feed = RSSFeed(feed_url)
for entry in feed.entries():
print(entry)
A general note on parsing XML. Try to get into the habit of thinking of XML as binary data, instead of text.
XML parsers implement a complex mechanism of figuring out the file encoding automatically. It's not necessary and not smart to circumvent that mechanism by trying to decode the file or HTTP response into a string yourself ahead of time:
# BAD CODE, DO NOT USE
def download(url):
with urllib.request.urlopen(url) as res:
return res.read().decode('latin-1')
raw_xml = download(url)
dom = minidom.parseString(self.raw_xml)
The above makes hard-coded (and in your case: wrong) assumptions about the file encoding and will break when the server decides to start sending the file in UTF-16 for some reason.
If you think of XML as binary data instead of text, it gets both a lot easier and a lot more robust.
dom = minidom.parse(urllib.request.urlopen(url))
The XML parser will sniff the bytes and decide what encoding they are in.
This is also true for reading XML from files. Instead of
# BAD CODE, DO NOT USE
with open(path, 'r', encoding='latin-1') as fp:
dom = minidom.parseString(fp.read())
Use
with open(path, 'rb') as fp:
dom = minidom.parse(fp)
or simply
dom = minidom.parse(path)

def entries(self):
for element in self.dom.getElementsByTagName('entry'):
title = element.getElementsByTagName('title')[0].firstChild.nodeValue
link = element.getElementsByTagName('link')[0].getAttribute('href')
author = element.getElementsByTagName('name')[0].firstChild.nodeValue
article = element.getElementsByTagName('content')[0].firstChild
yield type('Entry', (object,), dict(title=title, link=link, author=author, article=article))

#-*-coding:utf8;-*-
#qpy:3
#qpy:console
import urllib.request
from xml.dom import minidom
def parse_feed(url):
with urllib.request.urlopen(url) as res:
dom = minidom.parseString(res.read().decode('latin-1'))
for element in dom.getElementsByTagName('entry'):
title = element.getElementsByTagName('title')[0].firstChild.nodeValue
link = element.getElementsByTagName('link')[0].getAttribute('href')
author = element.getElementsByTagName('name')[0].firstChild.nodeValue
article = element.getElementsByTagName('content')[0].firstChild.nodeValue
yield type('Entry', (object,), dict(title=title, link=link, author=author, article=article))
feed_url = 'https://rickys-python-notes.blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500'
for entry in parse_feed(feed_url):
print(entry.title, entry.link)

collective intelligence by toby sergaram chapter 4 crawler is not working

I am using this book and tried to download links by using crawler. But i dont know why anything is not happening. I followed page 55,57 code but no links are coming as per him.
here is the code:
File name linkextractCrawler.py
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
# Create a list of words to ignore
class crawler:
# Initialize the crawler with the name of database
def __init__(self,dbname):
pass
def __del__(self):
pass
def dbcommit(self):
pass
# Auxilliary function for getting an entry id and adding
# it if it's not present
def getentryid(self,table,field,value,createnew=True):
return None
# Index an individual page
def addtoindex(self,url,soup):
print 'Indexing %s' % url
# Extract the text from an HTML page (no tags)
def gettextonly(self,soup):
return None
# Separate the words by any non-whitespace character
def separatewords(self,text):
return None
# Return true if this url is already indexed
def isindexed(self,url):
return False
# Add a link between two pages
def addlinkref(self,urlFrom,urlTo,linkText):
pass
# Starting with a list of pages, do a breadth
# first search to the given depth, indexing pages
# as we go
def crawl(self,pages,depth=2):
pass
# Create the database tables
def createindextables(self):
pass
ignorewords=set(['the','of','to','and','a','in','is','it'])
print("kk");
def crawl(self,pages,depth=2):
for i in range(depth):
newpages=set( )
for page in pages:
try:
c=urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup=BeautifulSoup(c.read( ))
self.addtoindex(page,soup)
links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url=urljoin(page,link['href'])
if url.find("'")!=-1: continue
url=url.split('#')[0] # remove location portion
if url[0:4]=='http' and not self.isindexed(url):
newpages.add(url)
linkText=self.gettextonly(link)
self.addlinkref(page,url,linkText)
self.dbcommit( )
pages=newpages
print("kk");
print(pages);
On console:
>>> import linkextractCrawler
>>> p = ['https://en.wikipedia.org/wiki/Perl.html']
>>> crawler=linkextractCrawler.crawler('')
>>> crawler.crawl(p)
>>>

Strategy for scraping web pages, maximizing information gathered

Here's the problem:
Users register for a site and can pick one of 8 job categories, or choose to skip this step. I want to classify the users who've skipped that step into job categories, based on the domain name in their email address.
Current setup:
Using a combination of Beautiful Soup and nltk, I scrape the homepage and look for links to pages on the site that contain the word "about". I scrape that page, too. I've copied the bit of code that does the scraping at the end of this post.
The issue:
I'm not getting enough data to get a good learning routine in place. I'd like to know if my scraping algorithm is set up for success--in other words, are there any gaping holes in my logic, or any better way to ensure that I have a good chunk of text that describes what kind of work a company does?
The (relevant) code:
import bs4 as bs
import httplib2 as http
import nltk
# Only these characters are valid in a url
ALLOWED_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;="
class WebPage(object):
def __init__(self, domain):
"""
Constructor
:param domain: URL to look at
:type domain: str
"""
self.url = 'http://www.' + domain
try:
self._get_homepage()
except: # Catch specific here?
self.homepage = None
try:
self._get_about_us()
except:
self.about_us = None
def _get_homepage(self):
"""
Open the home page, looking for redirects
"""
import re
web = http.Http()
response, pg = web.request(self.url)
# Check for redirects:
if int(response.get('content-length',251)) < 250:
new_url = re.findall(r'(https?://\S+)', pg)[0]
if len(new_url): # otherwise there's not much I can do...
self.url = ''.join(x for x in new_url if x in ALLOWED_CHARS)
response, pg = web.request(self.url)
self.homepage = self._parse_html(nltk.clean_html(pg))
self._raw_homepage = pg
def _get_about_us(self):
"""
Soup-ify the home page, find the "About us" page, and store its contents in a
string
"""
soup = bs.BeautifulSoup(self._raw_homepage)
links = [x for x in soup.findAll('a') if x.get('href', None) is not None]
about = [x.get('href') for x in links if 'about' in x.get('href', '').lower()]
# need to find about or about-us
about_us_page = None
for a in about:
bits = a.strip('/').split('/')
if len(bits) == 1:
about_us_page = bits[0]
elif 'about' in bits[-1].lower():
about_us_page = bits[-1]
# otherwise assume shortest string is top-level about pg.
if about_us_page is None and len(about):
about_us_page = min(about, key=len)
self.about_us = None
if about_us_page is not None:
self.about_us_url = self.url + '/' + about_us_page
web = http.Http()
response, pg = web.request(self.about_us_url)
if int(response.get('content-length', 251)) > 250:
self.about_us = self._parse_html(nltk.clean_html(pg))
def _parse_html(self, raw_text):
"""
Clean html coming from a web page. Gets rid of
- all '\n' and '\r' characters
- all zero length words
- all unicode characters that aren't ascii (i.e., &...)
"""
lines = [x.strip() for x in raw_text.splitlines()]
all_text = ' '.join([x for x in lines if len(x)]) # zero length strings
return [x for x in all_text.split(' ') if len(x) and x[0] != '&']

It is outside of what you are asking, but I would look at calling an external data source that has already collected this information. A good place to find such a service would be on the Programmable Web (for instance Mergent Company Fundamentals). Not all the data on Programmable Web is up-to-date but it seems like a lot of API providers are out there.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search multiple pages for match - python

change the last line to return self.search(self,character,nextPage)

Related

Having problems with a simple Instagram Scraper

Building sequential urls with Beautifulsoup

How do i get the text from a xml.dom.minidom Dom Element?

collective intelligence by toby sergaram chapter 4 crawler is not working

Strategy for scraping web pages, maximizing information gathered

Categories

Resources