I have a weird error and I will try to simplify my problem.
I have a simple function that scraps an url with beautiful soup and returns a list.
Then, I pickle the list in file, so I setrecursionlimit(10000) to avoid RecursionError. Until there, everything is good.
But when I try to unpickle my list, I have this error:
Traceback (most recent call last):
File ".\scrap_index.py", line 86, in <module>
data_file = pickle.load(data)
TypeError: __new__() missing 1 required positional argument: 'name'
There is my function:
import urllib.request
from bs4 import BeautifulSoup
def scrap_function(url):
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html5lib")
return [soup]
For testing, I've tried different url.
With that url, everything is good:
url_ok = 'https://www.boursorama.com/bourse/'
But with that one, I have the TypeError:
url_not_ok = 'https://www.boursorama.com/bourse/actions'
And the test code:
import pickle
import sys
sys.setrecursionlimit(10000)
scrap_list = scrap_function(url_not_ok)
with open('test_saving.pkl', 'wb') as data:
pickle.dump(scrap_list, data, protocol=2)
with open('test_saving.pkl', 'rb') as data:
data_file = pickle.load(data)
print(data_file)
This states
If some class objects have extra arguments in the new constructor
, pickle fail to serialize it.
This could cause the problem here in beautifulsoap:
class NavigableString(unicode, PageElement):
def __new__(cls, value):
This answer states the same.
As a solution do not store the whole object but maybe only the source code of the page as mentioned here.
Related
I am learning how to scrape web information. Below is a snippet of the actual code solution + output from datacamp.
On datacamp, this works perfectly fine, but when I try to run it on Spyder (my own macbook), it doesn't work...
This is because on datacamp, the URL has already been pre-loaded into a variable named 'response'.. however on Spyder, the URL needs to be defined again.
So, I first defined the response variable as response = requests.get('https://www.datacamp.com/courses/all') so that the code will point to datacamp's website..
My code looks like:
from scrapy.selector import Selector
import requests
response = requests.get('https://www.datacamp.com/courses/all')
this_url = response.url
this_title = response.xpath('/html/head/title/text()').extract_first()
print_url_title( this_url, this_title )
When I run this on Spyder, I got an error message
Traceback (most recent call last):
File "<ipython-input-30-6a8340fd3a71>", line 11, in <module>
this_title = response.xpath('/html/head/title/text()').extract_first()
AttributeError: 'Response' object has no attribute 'xpath'
Could someone please guide me? I would really like to know how to get this code working on Spyder.. thank you very much.
The value returned by requests.get('https://www.datacamp.com/courses/all') is a Response object, and this object has no attribute xpath, hence the error: AttributeError: 'Response' object has no attribute 'xpath'
I assume response from your tutorial source, probably has been assigned to another object (most likely the object returned by etree.HTML) and not the value returned by requests.get(url).
You can however do this:
from lxml import etree #import etree
response = requests.get('https://www.datacamp.com/courses/all') #get the Response object
tree = etree.HTML(response.text) #pass the page's source using the Response object
result = tree.xpath('/html/head/title/text()') #extract the value
print(response.url) #url
print(result) #findings
So I'm getting a weird error/traceback while trying to use BeautifulSoup. If you recall, in my previous questions, I was having trouble with BioPython. While those troubles are more or less on the verge of being solved, I have a new problem. The references that are scraped from PMC are not always pertinent to the plant-disease pair. For example, the plant or the disease may occur in the references, rather than the body of the fulltext, rendering that result a false positive. To get around this problem, one of the other interns working with us suggested that I use BeautifulSoup to parse the HTML from the PMC pages, and check if either of the plant/the disease occurs after the text 'References'. While trying to do this I got the 403 Forbidden error, and inferred from other answers on StackOverflow and GitHub that NCBI was somehow blocking urllib. To get around this problem, the suggested solution was to use Mozilla's FancyURLopener as an intermediary. However, I keep getting this weird traceback, and I can't, for the life of me, figure out what's wrong with the code. Here's the traceback:
scraperscript_python.py:54: DeprecationWarning: AppURLopener style of invoking requests is deprecated. Use newer urlopen functions/methods
opener = AppURLopener()
Traceback (most recent call last):
File "scraperscript_python.py", line 58, in <module>
pmc_refsoup = soup(page_html, "html.parser")
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 275, in __init__
elif len(markup) <= 256 and (
TypeError: object of type 'function' has no len()
Here are the lines leading up to and including line 58:
# First print statement
for plant, disease in plant_disease_list:
search_query = generate_search_query(plant, disease)
handle1 = Entrez.esearch(db="pmc", term=search_query, retmax="10")
record1 = Entrez.read(handle1)
pubmed_ids = record1.get("IdList")
if len(pubmed_ids)==0:
print("{}, {}, None".format(plant, disease))
# Else statement, initializing BeautifulSoup for parsing fulltext to avoid false positives
else:
for pubmed_id in pubmed_ids:
handle2 = Entrez.esummary(db="pmc", id=pubmed_id)
records = Entrez.read(handle2)
pmc_main = pubmed_id
pmcid_string = str("http://ncbi.nlm.nih.gov/pmc/articles/PMC")
append_pmcid = ("").join((pmcid_string + pmc_main))
my_url = append_pmcid
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
uClient_response = opener.open(my_url)
page_html = uClient_response.read
uClient_response.close()
pmc_refsoup = soup(page_html, "html.parser")
I feel like there's something obvious I'm missing, but I can't figure it out, and it's driving me bananas.
In the line
page_html = uClient_response.read
you are missing the brackets to call the read method, which means that you assign the method itself to page_html and later pass this to soup as a parameter, leading to the TypeError.
The correct way to call read is:
page_html = uClient_response.read()
I'm completely newbie to python when it comes to scrape any web data using class. So, apology in advance for any serious mistake. I've written a script to parse the text using a tag from wikipedia web site. I tried to write the code accurately from my level best but for some reason when i execute the code it throws error. The code and the error I'm having are given below for your kind consideration.
The script:
import requests
from lxml.html import fromstring
class TextParser(object):
def __init__(self):
self.link = 'https://en.wikipedia.org/wiki/Main_Page'
self.storage = None
def fetch_url(self):
self.storage = requests.get(self.link).text
def get_text(self):
root = fromstring(self.storage)
for post in root.cssselect('a'):
print(post.text)
item = TextParser()
item.get_text()
The error:
Traceback (most recent call last):
File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\testmatch.py", line 38, in <module>
item.get_text()
File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\testmatch.py", line 33, in get_text
root = fromstring(self.storage)
File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\lib\site-packages\lxml\html\__init__.py", line 875, in fromstring
is_full_html = _looks_like_full_html_unicode(html)
TypeError: expected string or bytes-like object
You're executing the following two lines
item = TextParser()
item.get_text()
When you initialize TextParser, self.storage is equal to None. When you execute the function get_text() it's still equal to None. So that's why you get that error.
However, if you change it to the following. self.storage should get populated with a string rather than being none.
item = TextParser()
item.fetch_url()
item.get_text()
If you want to call the function get_text without calling fetch_url you can do it this way.
def get_text(self):
self.fetch_url()
root = fromstring(self.storage)
for post in root.cssselect('a'):
print(post.text)
Motivation
Motivated by this problem - the OP was using urlopen() and accidentally passed a sys.argv list instead of a string as a url. This error message was thrown:
AttributeError: 'list' object has no attribute 'timeout'
Because of the way urlopen was written, the error message itself and the traceback is not very informative and may be difficult to understand especially for a Python newcomer:
Traceback (most recent call last):
File "test.py", line 15, in <module>
get_category_links(sys.argv)
File "test.py", line 10, in get_category_links
response = urlopen(url)
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 420, in open
req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
Problem
Here is the shortened code I'm working with:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
import sys
def get_category_links(url):
response = urlopen(url)
# do smth with response
print(response)
get_category_links(sys.argv)
I'm trying to think whether this kind of an error can be caught statically with either smart IDEs like PyCharm, static code analysis tools like flake8 or pylint, or with language features like type annotations.
But, I'm failing to detect the problem:
it is probably too specific for flake8 and pylint to catch - they don't warn about the problem
PyCharm does not warn about sys.argv being passed into urlopen, even though, if you "jump to source" of sys.argv it is defined as:
argv = [] # real value of type <class 'list'> skipped
if I annotate the function parameter as a string and pass sys.argv, no warnings as well:
def get_category_links(url: str) -> None:
response = urlopen(url)
# do smth with response
get_category_links(sys.argv)
Question
Is it possible to catch this problem statically (without actually executing the code)?
Instead of keeping it editor specific, you can use mypy to analyze your code. This way it will run on all dev environments instead of just for those who use PyCharm.
from urllib.request import urlopen
import sys
def get_category_links(url: str) -> None:
response = urlopen(url)
# do smth with response
get_category_links(sys.argv)
response = urlopen(sys.argv)
The issues pointed out by mypy for the above code:
error: Argument 1 to "get_category_links" has incompatible type List[str]; expected "str"
error: Argument 1 to "urlopen" has incompatible type List[str]; expected "Union[str, Request]"
Mypy here can guess the type of sys.argv because of its definition in its stub file. Right now some standard library modules are still missing from typeshed though, so you will have to either contribute them or ignore the errors related till they get added :-).
When to run mypy?
To catch such errors you can run mypy on the files with annotations with your tests in your CI tool. Running it on all files in project may take some time, for a small project it is your choice.
Add a pre-commit hook that runs mypy on staged files and points out issues right away(could be a little annoying to the dev if it takes a while).
Firstly, you need to check whether the url type is string or not and if string then check for ValueError exception(Valid url)
import sys
from urllib2 import urlopen
def get_category_links(url):
if type(url) != type(""): #Check if url is string or not
print "Please give string url"
return
try:
response = urlopen(url)
# do smth with response
print(response)
except ValueError: #If url is string but invalid
print "Bad URL"
get_category_links(sys.argv)
It looks like sth error in it, but i failed to find it.
from urllib.request import Request, urlopen
from urllib.error import URLError,HTTPError
from bs4 import BeautifulSoup
import re
print('https://v.qq.com/x/page/h03425k44l2.html\\\\n\\\\https://v.qq.com/x/cover/dn7fdvf2q62wfka/m0345brcwdk.html\\\\n\\\\http://v.qq.com/cover/2/2iqrhqekbtgwp1s.html?vid=c01350046ds')
web = input('请输入网址:')
if re.search(r'vid=',web) :
patten =re.compile(r'vid=(.*)')
vid=patten.findall(web)
vid=vid[0]
else:
newurl = (web.split("/")[-1])
vid =newurl.replace('.html', ' ')
#从视频页面找出vid
getinfo='http://vv.video.qq.com/getinfo?vids{vid}&otype=xlm&defaultfmt=fhd'.format(vid=vid.strip())
def getpage(url):
req = Request(url)
user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit'
req.add_header('User-Agent', user_agent)
try:
response = urlopen(url)
except HTTPError as e:
print('The server couldn\\\'t fulfill the request.')
print('Error code:', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason:', e.reason)
html = response.read().decode('utf-8')
return(html)
#打开网页的函数
a = getpage(getinfo)
soup = BeautifulSoup(a, "html.parser")
for e1 in soup.find_all('url'):
ippattent = re.compile(r"((?:(2[0-4]\\\\d)|(25[0-5])|([01]\\\\d\\\\d?))\\\\.){3}(?:(2[0-4]\\\\d)|(255[0-5])|([01]?\\\\d\\\\d?))")
if re.search(ippattent,e1.get_text()):
ip=(e1.get_text())
for e2 in soup.find_all('id'):
idpattent = re.compile(r"\\\\d{5}")
if re.search(idpattent,e2.get_text()):
id=(e2.get_text())
filename=vid.strip()+'.p'+id[2:]+'.1.mp4'
#找到ID和拼接FILENAME
getkey='http://vv.video.qq.com/getkey?format={id}&otype=xml&vt=150&vid{vid}&ran=0%2E9477521511726081\\\\&charge=0&filename={filename}&platform=11'.format(id=id,vid=vid.strip(),filename=filename)
#利用getinfo中的信息拼接getkey网址
b = getpage(getkey)
key=(re.findall(r'<key>(.*)<\\\\/key>',b))
videourl=ip+filename+'?'+'vkey='+key[0]
print('视频播放地址 '+videourl)
#完成了
I run it and get this:
Traceback (most recent call last):
File "C:\Users\DYZ_TOGA\Desktop\qq.py", line 46, in <module>
filename=vid.strip()+'.p'+id[2:]+'.1.mp4'
TypeError: 'builtin_function_or_method' object is not subscriptable
What should I do? I don't know how to change my code to correct it.
The root of your problem is here:
if re.search(idpattent,e2.get_text()):
id=(e2.get_text())
If this is false, you never set id. And that means id is the built-in function of that name, which gets the unique ID of any object. Since it's a function, not the string you expect, you can't do this:
id[2:]
Hence the error you are getting.
My suggestions are:
Use a different variable name; you would have get an error about it not being defined in this case, which would have made solving the problem easier
When you don't find the ID, don't continue the script; it won't work anyway. If you expected to find it, and are not sure why that's not happening, that's a different question you should ask separately.
id is a builtin function in python. it seems you are using the same to store variable. It is bad habit to use keyword as variable name. Use some different name instead.
if re.search(idpattent,e2.get_text()):
id=(e2.get_text())
filename=vid.strip()+'.p'+id[2:]+'.1.mp4'
If the above "if" is not true, id will not be set to string value.
By default id is a function is python . So you cannot do id[2:]
because python expects id().