why can't I get the string while scraping with python? - python

here is my code i want to scrape a list of words from a website,
but when i call the .string on the
import requests
from bs4 import BeautifulSoup
url = "https://www.merriam-webster.com/browse/thesaurus/a"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
entry_view = soup.find_all('div', {'class': 'entries'})
view = entry_view[0]
list = view.ul
for m in list:
for x in m:
title = x.string
print(title)
what I want is a list printing the text from the website but what I get is an error
Traceback (most recent call last):
File "/home/vidu/PycharmProjects/untitled/hello.py", line 14, in <module>
title = x.string
AttributeError: 'str' object has no attribute 'string'
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in <module>
from apport.report import Report
File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in <module>
import apport.fileutils
File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in <module>
from apport.packaging_impl import impl as packaging
File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 23, in <module>
import apt
File "/usr/lib/python3/dist-packages/apt/__init__.py", line 23, in <module>
import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'
Original exception was:
Traceback (most recent call last):
File "/home/vidu/PycharmProjects/untitled/hello.py", line 14, in <module>
title = x.string
AttributeError: 'str' object has no attribute 'string'

You can achieve what you want by using the following piece of code.
Code:
import requests
from bs4 import BeautifulSoup
url = "https://www.merriam-webster.com/browse/thesaurus/a"
html_source = requests.get(url).text
soup = BeautifulSoup(html_source, "html.parser")
entry_view = soup.find_all('div', {'class': 'entries'})
entries = []
for elem in entry_view:
for e in elem.find_all('a'):
entries.append(e.text)
#show only 5 elements and whole list length
print(entries[:5])
print(entries[-5:])
print(len(entries))
Output:
['A1', 'aback', 'abaft', 'abandon', 'abandoned']
['absorbing', 'absorption', 'abstainer', 'abstain from', 'abstemious']
100
In your code:
print(type(list))
<class 'bs4.element.Tag'>
print(type(m))
<class 'bs4.element.NavigableString'>
print(type(x))
<class 'str'>
So, as you can see, the variable x is already a string, so it's non-sense to use the bs4 method .string().
p.s.: you shouldn't use a variable name like list, it's a reserved keyword.

AttributeError: 'str' object has no attribute 'string'
This is telling you that the object is already a string. Try removing that and it should work.
It also tells you that the proper syntax of the string data type is str not string.
Another thing to take home from this is that you convert using title = str(x), but since it is already a string in this case it is redundant.
To quote Google:
Python has a built-in string class named "str" with many handy features (there is an older module named "string" which you should not use)

Related

I get a error message when using request.get() error on a mac

I'm a Mac user and I'm trying to use the requests module get() here:
import requests
url = 'http://www.omdbapi.com/?t=star+wars&r=json'
response = requests.get(url)
dic = response.json()
for key in dic:
print key, ':', dic[key]
But I get this:
import requests
File "/Library/Python/2.7/site-packages/requests/__init__.py", line 63, in <module>
from . import utils
File "/Library/Python/2.7/site-packages/requests/utils.py", line 24, in <module>
from ._internal_utils import to_native_string
File "/Library/Python/2.7/site-packages/requests/_internal_utils.py", line 11, in <module>
from .compat import is_py2, builtin_str, str
File "/Library/Python/2.7/site-packages/requests/compat.py", line 33, in <module>
import json
File "/Users/miguel/Desktop/json.py", line 4, in <module>
response = requests.get(url)
AttributeError: 'module' object has no attribute 'get'
I've reinstalled the requests library using pip module but it doesn't fix.
Any idea? Thanks.
You've named your own script json.py which we can see in the last line of the traceback:
File "/Users/miguel/Desktop/json.py", line 4, in <module>
Don't do that, you're now masking the actual json module and that's going to do some damage to requests's ability to function. Equally, don't call a script requests.py. Just rename the file and it should fix the issue.

Get number of results on Google search results in Python 3

Is there any way to return the number of Google search results in Python3? I tried several way from SO but none of them are still working:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> def get_results(name):
re = requests.get('https://www.google.com/search', params={'q':name})
soup = BeautifulSoup(re.text, 'lxml')
response = soup.find('div', {'id': 'resultStats'})
return int(response.text.replace(',', '').split()[1])
>>> get_results('Leonardo DiCaprio')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in get_results
AttributeError: 'NoneType' object has no attribute 'text'
response in your get_results() function is None because the request to Google returned an error page so the div you are looking for does not exist. You should check for a successful response status before trying to parse the results.

AttributeError: 'module' object has no attribute 'get'

I have the following code to load JSON:
import json
import requests
r = requests.get('http://api.reddit.com/controversial?limit=5')
if r.status_code = 200:
reddit_data = json.loads(r.content)
print reddit_data['data']['children'][1]['data']
else:
print "Errror."
And I got this message.
arsh#arsh:~$ python q.py
Traceback (most recent call last):
File "q.py", line 1, in <module>
import json
File "/home/arsh/json.py", line 5, in <module>
reddit_data = json.loads(r.content)
AttributeError: 'module' object has no attribute 'loads'
You have a different file called json.py in your home directory:
File "/home/arsh/json.py", line 5, in <module>
This file is in the way, you did not import the standard library version. Rename it to something else or delete it. You'll also have to remove the json.pyc file.
Note that requests response objects can already handle JSON responses for you:
import requests
r = requests.get('http://api.reddit.com/controversial?limit=5')
r.raise_for_status()
reddit_data = r.json()
print reddit_data['data']['children'][1]['data']
The Response.json() method handles decoding JSON for you, including detecting the correct characterset to use when decoding.

How can I read headlines from Times of India and place it into text file?

I am using python 2.7 and urllib2 command for doing this.But I am facing error that urllib2 has no attribute name urlopen.Please help me.thanx,Here is my code.
import urllib2
import re
pat = re.compile('target="_parent">(.*?)</a>')
url = 'http://timesofindia.indiatimes.com/home/headlines'
sock = urllib2.urlopen(url)
li = pat.findall(sock.read())
sock.close()
print li
f=open("headlines.txt", 'a+')
for i in range(len(li)):
f.write(li[i]+"\n")
f.close()
Traceback
Error:Traceback (most recent call last):
File "C:/Users/Training/PycharmProjects/758702_Python_Program/ReadTOI/ReadTOI.py", line 1, in <module>
import urllib2
File "C:\Python27\lib\urllib2.py", line 111, in <module>
from urllib import (unwrap, unquote, splittype, splithost, quote,
File "C:\Users\Training\PycharmProjects\758702_Python_Program\urllib.py", line 4, in <module>
f=urllib.urlopen("http://www.python.org/")
AttributeError: 'module' object has no attribute 'urlopen'
File "C:\Users\Training\PycharmProjects\758702_Python_Program\urllib.py", line 4, in <module>
It looks like you have an other file by the name urllib.py in your working directory. When doing an import urllib python is importing your local file which does not contain urlopen(). Renaming your local file to something else or changing your working directory would solve this

Finding specific text using BeautifulSoup

I'm trying to grab all the winner categories from this page:
http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013
I've written this in sublime:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013"
page = urllib2.urlopen(url)
soup_package = BeautifulSoup(page)
page.close()
#find everything in the div class="bestOfItem). This works.
all_categories = soup_package.findAll("div",class_="bestOfItem")
# print(all_categories)
#this part breaks it:
soup = BeautifulSoup(all_categories)
winner = soup.a.string
print(winner)
When I run this in terminal, I get the following error:
Traceback (most recent call last):
File "winners.py", line 12, in <module>
soup = BeautifulSoup(all_categories)
File "build/bdist.macosx-10.9-intel/egg/bs4/__init__.py", line 193, in __init__
File "build/bdist.macosx-10.9-intel/egg/bs4/builder/_lxml.py", line 99, in prepare_markup
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 249, in encodings
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 304, in find_declared_encoding
TypeError: expected string or buffer
Any one know what's happening there?
You are trying to create a new BeautifulSoup object from a list of elements.
soup = BeautifulSoup(all_categories)
There is absolutely no need to do this here; just loop over each match instead:
for match in all_categories:
winner = match.a.string
print(winner)

Categories