I have this code below, and I am trying to get 'Oswestry, England' as the result.
label = soup.findall('span',{'class':"ProfileHeaderCard-locationText"})
print(label)
But, it doesn't give me a value.
Here is what the HMTL code looks like
<span class="ProfileHeaderCard-locationText u-dir" dir="ltr">
<a data-place-id="5b756a1991aa8648" href="/search?q=place%3A5b756a1991aa8648">Oswestry, England</a>
</span>
When I print label the result is the HTML code I posted above.
Here is my full code:
import requests as req
from bs4 import BeautifulSoup
usernames = #list of username
location_list = []
for x in usernames:
url= "https://twitter.com/" + x
try:
html = req.get(url)
except Exception as e:
print("Failed to")
continue
soup = BeautifulSoup(html.text,'html.parser')
try:
label = soup.find('span',{'class':"ProfileHeaderCard-locationText"})
label_formatted = label.string.lstrip()
label_formatted = label_formatted.rstrip()
if label_formatted != "":
location_list.append(label_formatted)
print(x + ' : ' + label_formatted)
else:
print('Not found')
except:
print('Not found')
You should call find, not find_all to get a single element. Then use the .text attribute to get the text content.
label = soup.find('span',{'class':"ProfileHeaderCard-locationText"})
print(label.text)
For anyone having the same problem, I was able to do get the innerdata from the html code by just doing this:
label2 = soup.findAll('span',{"class":"ProfileHeaderCard-locationText"})[0].get_text()
It seems that you were searching for a span tag with the class attribute exactly matching your query class. As the span has two classes, your test failed and no results returned.
Using css selectors, you could try your solution as:
from bs4 import BeautifulSoup as BS
soup = BS('''<span class="ProfileHeaderCard-locationText u-dir">.....</span>''', 'html.parser')
soup.select('span.ProfileHeaderCard-locationText')
returns span tags that contain your prescribed class.
see also
Related
I am somewhat new to Python and can't for the life of me figure out why the following code isn’t pulling the element I am trying to get.
It currently returns:
for player in all_players:
player_first, player_last = player.split()
player_first = player_first.lower()
player_last = player_last.lower()
first_name_letters = player_first[:2]
last_name_letters = player_last[:5]
player_url_code = '/{}/{}{}01'.format(last_name_letters[0], last_name_letters, first_name_letters)
player_url = 'https://www.basketball-reference.com/players' + player_url_code + '.html'
print(player_url) #test
req = urlopen(player_url)
soup = bs.BeautifulSoup(req, 'lxml')
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
Currently returning:
--> for td in table.find_all('td'):
player_pbp_data.append(td.get_text()) #if this works, would like to
AttributeError: 'NoneType' object has no attribute 'find_all'
Note: iterating through children of the wrapper object returns:
< div class="table_outer_container" > as part of the tree.
Thanks!
Make sure that table contains the data you expect.
For example https://www.basketball-reference.com/players/a/abdulka01.html doesn't seem to contain a div with id='all_advanced_pbp'
Try to explicitly pass the html instead:
bs.BeautifulSoup(the_html, 'html.parser')
I trie to extract data from the url you gave but it did not get full DOM. after then i try to access the page with browser with javascrip and without javascrip, i know website need javascrip to load some data. But the page like players it need not. The simple way to get dynamic data is using selenium
This is my test code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
player_pbp_data = []
def get_list(t="a"):
with requests.Session() as se:
url = "https://www.basketball-reference.com/players/{}/".format(t)
req = se.get(url)
soup = BeautifulSoup(req.text,"lxml")
with open("a.html","wb") as f:
f.write(req.text.encode())
table = soup.find("div",class_="table_wrapper setup_long long")
players = {player.a.text:"https://www.basketball-reference.com"+player.a["href"] for player in table.find_all("th",class_="left ")}
def get_each_player(player_url="https://www.basketball-reference.com/players/a/abdulta01.html"):
with webdriver.Chrome() as ph:
ph.get(player_url)
text = ph.page_source
'''
with requests.Session() as se:
text = se.get(player_url).text
'''
soup = BeautifulSoup(text, 'lxml')
try:
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
except Exception as e:
print("This page dose not contain pbp")
get_each_player()
I am trying to parse some information thats in a var meta window, and I am just a little confused how to grab just the value for the "id".
My code is below
url = input("\n\nEnter URL: ")
print(Fore.MAGENTA + "\nSetting link . . .")
def printID():
print("")
session = requests.session()
response = session.get(url)
soup = bs(response.text, 'html.parser')
form = soup.find('script', {'id' : 'ProductJson-product-template'})
scripts = soup.findAll('id')
#get the id
'''
for scripts in form:
data = soup.find_all()
print data
'''
print(form)
printID()
And the output of this prints
<script id="ProductJson-product-template" type="application/json">
{"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
</script>
Again, I just want to print just the value of the ID ("463448473639").
you can retrieve all the attributes using following sytax.
form.attrs
and if you looking some specific, it's dictionary.
form['id']
the full code is as below
from bs4 import BeautifulSoup
html_doc="""<script id="ProductJson-product-template" type="application/json">
{"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
</script>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.find("script").attrs
print soup.find("script")['id']
However if you want to get value of ID from innerText {"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
the only way to do is, as below.
innerText = soup.find("script").getText()
print innerText
print ast.literal_eval(strip(innerText)).get("id")
It looks like you are going to want to do something like:
import json
id = json.loads(scripts[0].get_text())['id']
I haven't tested that but if you want to get what is in between the script tags I think that is they way you will do it. get_text doc
I’m trying to scrape information off the Oxford dictionary. The problem is that there are same class name for the class "form-groups".
I only want to scrape the class "form-groups" above the entry 1. For the word "acclimatize", my code works.
But for the word "peculiar", it scrapped the class "form-groups" under entry 2, which is not what I want. I only want to scrape the class "form-groups" above the entry 1.
So basically:
If "form-groups" above the entry 1 doesn't exist, print("none"); but not to scrape other "form-groups" in different entries.
Here's my code:
from bs4 import BeautifulSoup
import urllib.request
import requests
import time
word = ["peculiar"]
source = "https://en.oxforddictionaries.com/definition/"
for word in word:
try:
with urllib.request.urlopen(source + word) as url:
s = url.read()
soup = BeautifulSoup(s, "lxml")
try:
form_groups = soup.find('span', {'class': 'form-groups'}).text
y = form_groups
except:
y = "no form_groups"
print(word + "#" + y)
time.sleep(2)
except:
print("No result for " + word)
time.sleep(2)
Any input is much appreciated! Thank you very much!
The answer is embedded in your question. You are scanning the whole page for spans of class form-groups, but you are actually interested in the hierarchy of the dictionary article: you only want spans of that class when they are direct children of a section of class gramb, and not lower down in the tree.
Edit: Original answer was pasted from the wrong IDLE session
section_grambs = soup.find_all('section', {'class': 'gramb'})
for section_gramb in section_grambs:
for child in (section_gramb.children):
if child.name == "span" and "form-groups" in child.attrs["class"]:
y = child.text
else:
y = "no form groups"
I have written code to extract the url and title of a book using BeautifulSoup from a page.
But it is not extracting the name of the book Astounding Stories of Super-Science April 1930 between > and </a> tags.
How can I extract the name of the book?
I have tried the findnext method recommended in another question, but I get an AttributeError on that.
HTML:
<li>
<a class="extiw" href="//www.gutenberg.org/ebooks/29390" title="ebook:29390">Astounding Stories of Super-Science April 1930</a>
<a class="image" href="/wiki/File:BookIcon.png"><img alt="BookIcon.png" height="16" src="//www.gutenberg.org/w/images/9/92/BookIcon.png" width="16"/></a>
(English)
</li>
Code below:
def make_soup(BASE_URL):
r = requests.get(BASE_URL, verify = False)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls(filename)
You should use the text attribute of the element. The following works for me:
def make_soup(BASE_URL):
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def extract_text_urls(html):
soup = make_soup(BASE_URL)
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a.text
print "\n"
except KeyError:
pass
except TypeError:
pass
extract_text_urls('http://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)')
I get the following output for the element in question
//www.gutenberg.org/ebooks/29390 Astounding Stories of Super-Science April 1930
According to the BeautifulSoup documentation the .string property should accomplish what you are trying to do, by editing your original listing this way:
# ...
try:
print li.a['href'], li.a['title']
print "\n"
print li.a.string
except KeyError:
pass
# ...
You probably want to surround it with something like
if li.a['class'] == "extiw":
print li.a.string
since, in your example, only the anchors of class extiw contain a book title.
Thanks #wilbur for pointing out the optimal solution.
I did not see how you can extract the text within the tag. I would do something like this:
from bs4 import BeatifulSoup as bs
from urllib2 import urlopen as uo
soup = bs(uo(html))
for li in soup.findall('li'):
a = li.find('a')
book_title = a.contents[0]
print book_title
To get just the text that is not inside any tags use the get_text() method. It is in the documentation here.
I can't test it because I don't know the url of the page you are trying to scrape, but you can probably just do it with the li tag since there doesn't seem to be any other text.
Try replacing this:
for li in soup.findAll('li'):
try:
try:
print li.a['href'], li.a['title']
print "\n"
except KeyError:
pass
except TypeError:
pass
with this:
for li in soup.findAll('li'):
try:
print(li.get_text())
print("\n")
except TypeError:
pass
I am looking to identify the urls that request external resources in html files.
I currently use the scr attribute in the img and script tags, and the href attribute in the link tag (to identify css).
Are there other tags that I should be examining to identify other resources?
For reference, my code in Python is currently:
html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = [x['src'] for x in soup.findAll('img')]
css_link = [x['href'] for x in soup.findAll('link')]
scipt_src = [] ## Often times script doesn't have attributes 'src' hence need for try/except
for x in soup.findAll('script'):
try:
scipt_src.append(x['src'])
except KeyError:
pass
Updated my code to capture what seemed like the most common resources in html code. Obviously this doesn't look at resources requested in either CSS or Javascript. If I am missing tags please comment.
from bs4 import BeautifulSoup
def find_list_resources (tag, attribute,soup):
list = []
for x in soup.findAll(tag):
try:
list.append(x[attribute])
except KeyError:
pass
return(list)
html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = find_list_resources('img',"src",soup)
scipt_src = find_list_resources('script',"src",soup)
css_link = find_list_resources("link","href",soup)
video_src = find_list_resources("video","src",soup)
audio_src = find_list_resources("audio","src",soup)
iframe_src = find_list_resources("iframe","src",soup)
embed_src = find_list_resources("embed","src",soup)
object_data = find_list_resources("object","data",soup)
soruce_src = find_list_resources("source","src",soup)