How to get around "MissingSchema" error in Python? - python

After running my script I notice that my "parse_doc" function throws error when it find's any url None. Turn out that, my "process_doc" function were supposed to produce 25 links but it produces only 19 because few pages doesn't have any link to lead to another page. However, when my second function receives that link with None value, it produces that error indicating "MissingSchema". How to get around this so that when it finds any link with None value it will go for another. Here is the partial portion of my script which will give you an idea what I meant:
def process_doc(medium_link):
page = requests.get(medium_link).text
tree = html.fromstring(page)
try:
name = tree.xpath('//span[#id="titletextonly"]/text()')[0]
except IndexError:
name = ""
try:
link = base + tree.xpath('//section[#id="postingbody"]//a[#class="showcontact"]/#href')[0]
except IndexError:
link = ""
parse_doc(name, link) "All links get to this function whereas some links are with None value
def parse_doc(title, target_link):
page = requests.get(target_link).text # Error thrown here when it finds any link with None value
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(title, tel)
The error what I'm getting:
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://?
Btw, in my first function there is a variable named "base" which is for concatenating with the produced result to make a full-fledged link.

If you want to avoid cases when your target_link == None then try
def parse_doc(title, target_link):
if target_link:
page = requests.get(target_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(tel)
print(title)
This should allow you to handle only non-empty links or do nothing otherwise

First of all make sure that your schema, meaning url, is correct. Sometimes you are just missing a character or have one too much in https://.
If you have to raise an exception though you can do it like this:
import requests
from requests.exceptions import MissingSchema
...
try:
res = requests.get(linkUrl)
print(res)
except MissingSchema:
print('URL is not complete')

Related

Scraping text from a header and Class tag

The code below errors out when trying to execute this line
"RptTime = TimeTable[0].xpath('//text()')"
Not sure why I see TimeTable has a value in my variable window, but the HtmlElement "TimeTable[0]" has no value and the "content.cssselect" at time of assignment returns value. Why then would I get an error "list index out of range". This tells me that the element is empty. I am trying to get the Year Month value in that field.
import pandas as pd
from datetime import datetime
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
df=pd.DataFrame()
df2=pd.DataFrame()
for cmslink in cmslinks:
print(cmslink)
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
TimeTable = content.cssselect('td[headers="view-dlf-2-report-period-table-column"]')[0]
headers = linkTable[0].xpath("//a[contains(text(),'Contract Summary') or contains(text(),'Monthly Enrollment by CPSC')]/#href")
RptTime = TimeTable.xpath('//text()')
dfl = pd.DataFrame(headers,columns= ['links'])
dft = pd.DataFrame(RptTime,columns= ['ReportTime'])
df=df.append(dfl)
df2=df.append(dft)
Error
src\lxml\etree.pyx in lxml.etree._Element.__getitem__()
IndexError: list index out of range
Look carefully at your last line. df = df.append(df1). Your explanation and code is quite unclear due to indenting and error traceback however this is obviously not what you intended.
df.append(df1) is a procedure rather than strictly a function, it does not return anything. You simply write the line and it does its magic similar to print("hi") rather than this_is_wrong = print("hi").
What would end up happening is you overwrite df will null which should be causing some major errors if you ever use that variable again. However, this is not the cause of your problem I thought it my duty to tell you anyway.
Could you please tell us exactly what is returned by the css... function. Although you said it returned something you only store the [0] index of the return value. Meaning that if it is ["","something"] hypothetically, the value stored would be null.
It is quite likely that the problem you are having is that you indexed [0] twice, when you probably only meant to do it once.

How to write correctly a program which extract all links from a web page?

This is the part of Udacity course WEB SEARCH ENGINE.The goal of this quiz is to write a program which extract all links from the web page.On the output program must return only LINKS.But in my case program returns all links and "NONE" twice.I know that the error in the second part of program after "WHILE" and after "ELSE".But i dont know what i must write there.
def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
return None,0
else:
start_quote = page.find('"', start_link)
endquo = page.find('"',start_quote + 1)
url = page[(start_quote + 1) : endquo]
return url,endquo
page = 'i know what you doing summer <a href="Udasity".i know what you doing summer <a href="Georgia" i know what you doing summer '
def ALLlink(page):
url = 1
while url != None:
url,endquo = get_next_target(page)
if url:
print url
page = page[endquo:]
else:
print ALLlink(page)
First, you can remove your else statement in your ALLlink() function since it's not doing anything.
Also, when comparing to None, you should use is not instead of !=:
while url != None: # bad
while url is not None # good
That said, I think your error is in your last line:
print ALLlink(page)
You basically have two print statements. The first is inside your function and the second is on the last line of your script. Really, you don't need the last print statement there because you're already printing in your ALLlink() function. So if you change the line to just ALLlink(page), I think it'll work.
If you do want to print there, you could modify your function to store the URLs in an array, and then print that array. Something like this:
def ALLlink(page):
urls = []
url = 1
while url is not None:
url, endquo = get_next_target(page)
if url:
urls.append(url)
page = page[endquo:]
return urls
print ALLlink(page)

Getting wrong result from JSON - Python 3

Im working on a small project of retrieving information about books from the Google Books API using Python 3. For this i make a call to the API, read out the variables and store those in a list. For a search like "linkedin" this works perfectly. However when i enter "Google", it reads the second title from the JSON input. How can this happen?
Please find my code below (Google_Results is the class I use to initialize the variables):
import requests
def Book_Search(search_term):
parms = {"q": search_term, "maxResults": 3}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
print(r.url)
results = r.json()
i = 0
for result in results["items"]:
try:
isbn13 = str(result["volumeInfo"]["industryIdentifiers"][0]["identifier"])
isbn10 = str(result["volumeInfo"]["industryIdentifiers"][1]["identifier"])
title = str(result["volumeInfo"]["title"])
author = str(result["volumeInfo"]["authors"])[2:-2]
publisher = str(result["volumeInfo"]["publisher"])
published_date = str(result["volumeInfo"]["publishedDate"])
description = str(result["volumeInfo"]["description"])
pages = str(result["volumeInfo"]["pageCount"])
genre = str(result["volumeInfo"]["categories"])[2:-2]
language = str(result["volumeInfo"]["language"])
image_link = str(result["volumeInfo"]["imageLinks"]["thumbnail"])
dict = Google_Results(isbn13, isbn10, title, author, publisher, published_date, description, pages, genre,
language, image_link)
gr.append(dict)
print(gr[i].title)
i += 1
except:
pass
return
gr = []
Book_Search("Linkedin")
I am a beginner to Python, so any help would be appreciated!
It does so because there is no publisher entry in volumeInfo of the first entry, thus it raises a KeyError and your except captures it. If you're going to work with fuzzy data you have to account for the fact that it will not always have the expected structure. For simple cases you can rely on dict.get() and its default argument to return a 'valid' default entry if an entry is missing.
Also, there are a few conceptual problems with your function - it relies on a global gr which is bad design, it shadows the built-in dict type and it captures all exceptions guaranteeing that you cannot exit your code even with a SIGINT... I'd suggest you to convert it to something a bit more sane:
def book_search(search_term, max_results=3):
results = [] # a list to store the results
parms = {"q": search_term, "maxResults": max_results}
r = requests.get(url="https://www.googleapis.com/books/v1/volumes", params=parms)
try: # just in case the server doesn't return valid JSON
for result in r.json().get("items", []):
if "volumeInfo" not in result: # invalid entry - missing volumeInfo
continue
result_dict = {} # a dictionary to store our discovered fields
result = result["volumeInfo"] # all the data we're interested is in volumeInfo
isbns = result.get("industryIdentifiers", None) # capture ISBNs
if isinstance(isbns, list) and isbns:
for i, t in enumerate(("isbn10", "isbn13")):
if len(isbns) > i and isinstance(isbns[i], dict):
result_dict[t] = isbns[i].get("identifier", None)
result_dict["title"] = result.get("title", None)
authors = result.get("authors", None) # capture authors
if isinstance(authors, list) and len(authors) > 2: # you're slicing from 2
result_dict["author"] = str(authors[2:-2])
result_dict["publisher"] = result.get("publisher", None)
result_dict["published_date"] = result.get("publishedDate", None)
result_dict["description"] = result.get("description", None)
result_dict["pages"] = result.get("pageCount", None)
genres = result.get("authors", None) # capture genres
if isinstance(genres, list) and len(genres) > 2: # since you're slicing from 2
result_dict["genre"] = str(genres[2:-2])
result_dict["language"] = result.get("language", None)
result_dict["image_link"] = result.get("imageLinks", {}).get("thumbnail", None)
# make sure Google_Results accepts keyword arguments like title, author...
# and make them optional as they might not be in the returned result
gr = Google_Results(**result_dict)
results.append(gr) # add it to the results list
except ValueError:
return None # invalid response returned, you may raise an error instead
return results # return the results
Then you can easily retrieve as much info as possible for a term:
gr = book_search("Google")
And it will be far more tolerant of data omissions, provided that your Google_Results type makes most of the entries optional.
Following #Coldspeed's recommendation it became clear that missing information in the JSON file caused the exception to run. Since I only had a "pass" statement there it skipped the entire result. Therefore I will have to adapt the "Try and Except" statements so errors do get handled properly.
Thanks for the help guys!

ValueError on writing lxml text

I have the following block to write an xml tag. Sometimes the name is already in the correct form (that is, it won't error), and sometimes it is not
if 'Name' in title_data:
name = etree.SubElement(info, 'Name')
try:
name.text = title_data['Name']
except ValueError:
name.text = title_data['Name'].decode('utf-8')
Is there a way to simplify this? For example, something along the lines of:
name.text = title_data['Name'] if (**something**) else title_data['Name'].decode('utf-8')
I assume that you want to avoid having to write similar code for every element you want to set. This has the smell of trying to treat the symptom rather than the cause, but if nothing else, you can simply break that out into a helper function:
def assign_text(field, text):
try:
field.text = text
except ValueError:
field.text = text.decode("utf-8")
# ...
if "Name" in title_data:
name = etree.SubElement(info, "Name")
assign_text(name, title_data["Name"] or None)

KeyError and TypeError in my python web scraper

So sorry about this vague and confusing title. But there is no really better way for me to summarize my problem in one sentence.
I was trying to get the student and grade information from a french website. The link is this (http://www.bankexam.fr/resultat/2014/BACCALAUREAT/AMIENS?filiere=BACS)
My code is as follows:
import time
import urllib2
from bs4 import BeautifulSoup
regions = {'R\xc3\xa9sultats Bac Amiens 2014':'/resultat/2014/BACCALAUREAT/AMIENS'}
base_url = 'http://www.bankexam.fr'
tests = {'es':'?filiere=BACES','s':'?filiere=BACS','l':'?filiere=BACL'}
for i in regions:
for x in tests:
# create the output file
output_file = open('/Users/student project/'+ i + '_' + x + '.txt','a')
time.sleep(2) #compassionate scraping
section_url = base_url + regions[i] + tests[x] #now goes to the x test page of region i
request = urllib2.Request(section_url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response,'html.parser')
content = soup.find('div',id='zone_res')
for row in content.find_all('tr'):
if row.td:
student = row.find_all('td')
name = student[0].strong.string.encode('utf8').strip()
try:
school = student[1].strong.string.encode('utf8')
except AttributeError:
school = 'NA'
result = student[2].span.string.encode('utf8')
output_file.write ('%s|%s|%s\n' % (name,school,result))
# Find the maximum pages to go through
if soup.find('div','pagination'):
import re
page_info = soup.find('div','pagination')
pages = []
for i in page_info.find_all('a',re.compile('elt')):
try:
pages.append(int(i.string.encode('utf8')))
except ValueError:
continue
max_page = max(pages)
# Now goes through page 2 to max page
for i in range(1,max_page):
page_url = '&p='+str(i)+'#anchor'
section2_url = section_url+page_url
request = urllib2.Request(section2_url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response,'html.parser')
content = soup.find('div',id='zone_res')
for row in content.find_all('tr'):
if row.td:
student = row.find_all('td')
name = student[0].strong.string.encode('utf8').strip()
try:
school = student[1].strong.string.encode('utf8')
except AttributeError:
school = 'NA'
result = student[2].span.string.encode('utf8')
output_file.write ('%s|%s|%s\n' % (name,school,result))
A little more description about the code:
I created a 'regions' dictionary and 'tests' dictionary because there are 30 other regions I need to collect and I just include one here for showcase. And I'm just interested in the test results of three tests (ES, S, L) and so I created this 'tests' dictionary.
Two errors keep showing up,
one is
KeyError: 2
and the error is linked to line 12,
section_url = base_url + regions[i] + tests[x]
The other is
TypeError: cannot concatenate 'str' and 'int' objects
and this is linked to line 10.
I know there is a lot of information here and I'm probably not listing the most important info for you to help me. But let me know how I can do to fix this!
Thanks
The issue is that you're using the variable i in more than one place.
Near the top of the file, you do:
for i in regions:
So, in some places i is expected to be a key into the regions dictionary.
The trouble comes when you use it again later. You do so in two places:
for i in page_info.find_all('a',re.compile('elt')):
And:
for i in range(1,max_page):
The second of these is what is causing your exceptions, as the integer values that get assigned to i don't appear in the regions dict (nor can an integer be added to a string).
I suggest renaming some or all of those variables. Give them meaningful names, if possible (i is perhaps acceptable for an "index" variable, but I'd avoid using it for anything else unless you're code golfing).

Categories