Trouble executing my class crawler - python

I'm completely newbie to python when it comes to scrape any web data using class. So, apology in advance for any serious mistake. I've written a script to parse the text using a tag from wikipedia web site. I tried to write the code accurately from my level best but for some reason when i execute the code it throws error. The code and the error I'm having are given below for your kind consideration.
The script:
import requests
from lxml.html import fromstring
class TextParser(object):
def __init__(self):
self.link = 'https://en.wikipedia.org/wiki/Main_Page'
self.storage = None
def fetch_url(self):
self.storage = requests.get(self.link).text
def get_text(self):
root = fromstring(self.storage)
for post in root.cssselect('a'):
print(post.text)
item = TextParser()
item.get_text()
The error:
Traceback (most recent call last):
File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\testmatch.py", line 38, in <module>
item.get_text()
File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\testmatch.py", line 33, in get_text
root = fromstring(self.storage)
File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\lib\site-packages\lxml\html\__init__.py", line 875, in fromstring
is_full_html = _looks_like_full_html_unicode(html)
TypeError: expected string or bytes-like object

You're executing the following two lines
item = TextParser()
item.get_text()
When you initialize TextParser, self.storage is equal to None. When you execute the function get_text() it's still equal to None. So that's why you get that error.
However, if you change it to the following. self.storage should get populated with a string rather than being none.
item = TextParser()
item.fetch_url()
item.get_text()
If you want to call the function get_text without calling fetch_url you can do it this way.
def get_text(self):
self.fetch_url()
root = fromstring(self.storage)
for post in root.cssselect('a'):
print(post.text)

Related

Python - Web Scraping exercise - Attribute Error

I am learning how to scrape web information. Below is a snippet of the actual code solution + output from datacamp.
On datacamp, this works perfectly fine, but when I try to run it on Spyder (my own macbook), it doesn't work...
This is because on datacamp, the URL has already been pre-loaded into a variable named 'response'.. however on Spyder, the URL needs to be defined again.
So, I first defined the response variable as response = requests.get('https://www.datacamp.com/courses/all') so that the code will point to datacamp's website..
My code looks like:
from scrapy.selector import Selector
import requests
response = requests.get('https://www.datacamp.com/courses/all')
this_url = response.url
this_title = response.xpath('/html/head/title/text()').extract_first()
print_url_title( this_url, this_title )
When I run this on Spyder, I got an error message
Traceback (most recent call last):
File "<ipython-input-30-6a8340fd3a71>", line 11, in <module>
this_title = response.xpath('/html/head/title/text()').extract_first()
AttributeError: 'Response' object has no attribute 'xpath'
Could someone please guide me? I would really like to know how to get this code working on Spyder.. thank you very much.
The value returned by requests.get('https://www.datacamp.com/courses/all') is a Response object, and this object has no attribute xpath, hence the error: AttributeError: 'Response' object has no attribute 'xpath'
I assume response from your tutorial source, probably has been assigned to another object (most likely the object returned by etree.HTML) and not the value returned by requests.get(url).
You can however do this:
from lxml import etree #import etree
response = requests.get('https://www.datacamp.com/courses/all') #get the Response object
tree = etree.HTML(response.text) #pass the page's source using the Response object
result = tree.xpath('/html/head/title/text()') #extract the value
print(response.url) #url
print(result) #findings

__new__() missing 1 required positional argument depending of url scraped

I have a weird error and I will try to simplify my problem.
I have a simple function that scraps an url with beautiful soup and returns a list.
Then, I pickle the list in file, so I setrecursionlimit(10000) to avoid RecursionError. Until there, everything is good.
But when I try to unpickle my list, I have this error:
Traceback (most recent call last):
File ".\scrap_index.py", line 86, in <module>
data_file = pickle.load(data)
TypeError: __new__() missing 1 required positional argument: 'name'
There is my function:
import urllib.request
from bs4 import BeautifulSoup
def scrap_function(url):
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html5lib")
return [soup]
For testing, I've tried different url.
With that url, everything is good:
url_ok = 'https://www.boursorama.com/bourse/'
But with that one, I have the TypeError:
url_not_ok = 'https://www.boursorama.com/bourse/actions'
And the test code:
import pickle
import sys
sys.setrecursionlimit(10000)
scrap_list = scrap_function(url_not_ok)
with open('test_saving.pkl', 'wb') as data:
pickle.dump(scrap_list, data, protocol=2)
with open('test_saving.pkl', 'rb') as data:
data_file = pickle.load(data)
print(data_file)
This states
If some class objects have extra arguments in the new constructor
, pickle fail to serialize it.
This could cause the problem here in beautifulsoap:
class NavigableString(unicode, PageElement):
def __new__(cls, value):
This answer states the same.
As a solution do not store the whole object but maybe only the source code of the page as mentioned here.

Regarding Keyerror while using Python json module

would be helped if the mistake is pointed.
Here Iam trying to create a code for displaying the name of the city state and country by taking Pincode as input, Thanks in advance
import urllib, json
from urllib.request import urlopen
from tkinter import *
global pincode
root=Tk()
frame=Frame(root,width=250,height=250)
frame.grid()
class cal:
def __init__(self):
self.string=StringVar()
entry=Entry(frame,textvariable=self.string)
entry.grid(row=1,column=2,columnspan=6)
but=Button(root,text="submit",command=self.pin)
but.grid()
def pin(self):
pincode=self.string.get()
url = "https://www.whizapi.com/api/v2/util/ui/in/indian-city-by-postal-code?pin="+pincode+"&project-app-key=fnb1agfepp41y49jz6a39upx"
response = urllib.request.urlopen(url)
data = json.loads(response.read().decode('utf8'))
fi=open("neme.txt","w")
fi.write(str(data))
state=data['State']
city=data['City']
area=data['area']
name=Label(frame,text="State:"+state+"City:"+city+"area:"+area)
name.grid(row=3,column=0)
cal()
mainloop()
error being
Traceback (most recent call last):
File "/usr/lib/python3.4/tkinter/__init__.py", line 1541, in __call__
return self.func(*args)
File "/home/yuvi/Documents/LiClipse Workspace/GUI/src/Pn_code.py", line 24, in pin
state=data['State']
KeyError: 'State'
Ok. Error tells you that you don't have key named "State" in you dict under data variable. So maybe there isn't also in incomming json.
If in response you get:
{"ResponseCode":0,"ResponseMessage":"OK","ResponseDateTime":‌​"9/3/2016 2:41:25 PM GMT","Data":[{"Pincode":"560103","Address":"nagar","City":"B‌​analore","State":"na‌​taka","Country":"Ind‌​ia"}]}
then you cannot get "State" by using:
data["State"]
you have to do it using:
data["Data"][0]["State"]
and the rest:
data["Data"][0]["City"]
data["Data"][0]["Country"]
Why in this way? Because you have to get nested keys, first key is "Data", using data["Data"] you recieve a list, and because it's one element list, you have to get first item of the list: data["Data"][0]. And at the end under data["Data"][0] you get dict of keys where you can find State, City, Country.

Not clear on why my function is returning none

I have very limited coding background except for some Ruby, so if there's a better way of doing this, please let me know!
Essentially I have a .txt file full of words. I want to import the .txt file and turn it into a list. Then, I want to take the first item in the list, assign it to a variable, and use that variable in an external request that sends off to get the definition of the word. The definition is returned, and tucked into a different .txt file. Once that's done, I want the code to grab the next item in the list and do it all again until the list is exhausted.
Below is my code in progress to give an idea of where I'm at. I'm still trying to figure out how to iterate through the list correctly, and I'm having a hard time interpreting the documentation.
Sorry in advance if this was already asked! I searched, but couldn't find anything that specifically answered my issue.
from __future__ import print_function
import requests
import urllib
from bs4 import BeautifulSoup
def get_definition(x):
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={0}'.format(x)
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
return soup.find('pre', text=True)[0]
lines = []
with open('vocab.txt') as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
definitions = []
for line in lines:
definitions.append(get_definition(line))
out_str = '\n'.join(definitions)
with open('definitions.txt', 'w') as f:
f.write(out_str)
the problem I'm having is
Traceback (most recent call last):
File "WIP.py", line 20, in <module>
definitions.append(get_definition(line))
File "WIP.py", line 11, in get_definition
return soup.find('pre', text=True)[0]
File "/Library/Python/2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 0
I understand that soup.find('pre', text=True) is returning None, but not why or how to fix it.
your problem is that find() returns a single result not a list. The result is a dict-like object so it tries to find the key 0 which it cannot.
just remove the [0] and you should be fine
Also soup.find(...) is not returning None. It is returning an answer! If it were returning None you would get the error
NoneType has no attribute __getitem__
Beautiful soup documentation for find()

Accessing variable outside class using inheritance

I am trying to inherit a variable from base class but the interpreter throws an error.
Here is my code:
class LibAccess(object):
def __init__(self,url):
self.url = url
def url_lib(self):
self.urllib_data = urllib.request.urlopen(self.url).read()
return self.urllib_data
class Spidering(LibAccess):
def category1(self):
print (self.urllib_data)
scrap = Spidering("http://jabong.com")
scrap.category1()
This is the output:
Traceback (most recent call last):
File "variable_concat.py", line 16, in <module>
scrap.category1()
File "variable_concat.py", line 12, in category1
print (self.urllib_data)
AttributeError: 'Spidering' object has no attribute 'urllib_data'
What is the problem with the code?
You will need to define self.urllib_data prior to accessing it. The simples way would be to create it during initialization, e.g.
class LibAccess(object):
def __init__(self,url):
self.url = url
self.urllib_data = None
That way you can make sure it exists everytime you try to access it. From your code I take it that you do not want to obtain the actual data during initialization. Alternatively, you could call self.url_lib() from __init__(..) to read the data for the first time. Updating it later on would be done in the same way as before.

Categories