Python 3.4: href with XPATH - python

Using lxml and requests I am passing a XPATH to retrieve href attributes of a tags. Every time I use the simple code below I get an AttributeError as exemplified below.
import requests
from lxml import html
import csv
url = 'https://biz.yahoo.com/p/sum_conameu.html'
resp = requests.get(url)
tree = html.fromstring(resp.text)
update_tick = [td.text_content()
for td in tree.xpath('''//tr[starts-with(normalize-space(.), "Industry")]
/following-sibling::tr[position()>0]
/td/a/#href''')]
print(update_tick)
AttributeError: 'str' object has no attribute 'text_content'

Passing XPath attribute selector (.../#href) to xpath() method make it return string values of the matched attributes. No need to call text_content() in this case :
update_tick = [td
for td in tree.xpath('''//tr[starts-with(normalize-space(.), "Industry")]
/following-sibling::tr[position()>0]
/td/a/#href''')]

Related

Beautifulsoup html parser attribute crawling question

from bs4 import BeautifulSoup as bs4
import json
import requests
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
html = '''
<div class="side_sm">선물</div>
'''
json_data=soup.find('div',class_='side_sm').find('a').attrs('data-log-body')
print(json_data)
TypeError
'dict' object is not callable
I want to get the value of the dictionary, which is the value of the data-log-body attribute of the tag a under the 'class' side_sm of the 'div' tag.
I keep getting error handling, so please give me some advice on how to deal with it.
To access a dict with key use square brackets:
.attrs['data-log-body']
Use tag.get('attr') if you’re not sure attr is defined, just as you
would with a Python dictionary.
docs
So I would recommend to use:
.get('data-log-body')

Extracting 'a' tags containing specific substring with Python's BeautifulSoup

Using BeautifulSoup, I would like to return only "a" tags containing "Company" and not "Sector" in their href string. Is there a way to use regex inside of re.compile() to return only Companies and not Sectors?
Code:
soup = soup.findAll('tr')[5].findAll('a')
print(soup)
Output
[<a class="example" href="../ref/index.htm">Example</a>,
Facebook,
Exxon,
Technology,
Oil & Gas]
Using this method:
import re
soup.findAll('a', re.compile("Company"))
Returns:
AttributeError: 'ResultSet' object has no attribute 'findAll'
But I would like it to return (without the Sectors):
[Facebook,
Exxon]
Using:
Urllib.request version: 3.5
BeautifulSoup version: 4.4.1
Pandas version: 0.17.1
Python 3
Using soup = soup.findAll('tr')[5].findAll('a') and then soup.findAll('a', re.compile("Company")) writes over the original soup variable. findAll returns a ResultSet that is basically an array of BeautifulSoup objects. Try using the following to get all of the "Company" links instead.
links = soup.findAll('tr')[5].findAll('a', href=re.compile("Company"))
To get the text contained in these tags:
companies = [link.text for link in links]
Another approach is xpath, which supports AND/NOT operations for querying by attributes in an XML document. Unfortunately, BeautifulSoup doesn't handle xpath itself, but lxml can:
from lxml.html import fromstring
import requests
r = requests.get("YourUrl")
tree = fromstring(r.text)
#get elements with company in the URL but excludes ones with Sector
a_tags = tree.xpath("//a[contains(#href,'?Company') and not(contains(#href, 'Sector'))]")
You can use a css selector getting all the a tags where the href starts with ?Company:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
a = soup.select("a[href^=?Company]")
If you want them just from the sixth tr you can use nth-of-type:
.select("tr:nth-of-type(6) a[href^=?Company]"))
Thanks for the above answers #Padriac Cunningham and #Wyatt I !! This is a less elegant solution I came up with:
import re
for i in range(1, len(soup)):
if re.search("Company" , str(soup[i])):
print(soup[i])

Beautiful Soup 'ResultSet' object has no attribute 'text'

from bs4 import BeautifulSoup
import urllib.request
import win_unicode_console
win_unicode_console.enable()
link = ('https://pietroalbini.io/')
req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'})
url = urllib.request.urlopen(req).read()
soup = BeautifulSoup(url, "html.parser")
body = soup.find_all('div', {"class":"wrapper"})
print(body.text)
Hi, I have a problem with Beautiful Soup, if I run this code without ".text" at the end it show me a list of div but if I add ".text" at the end come the error
Traceback (most recent call last):
File "script.py", line 15, in
print(body.text)
AttributeError: 'ResultSet' object has no attribute 'text'
find_all returns a ResultSet object which you can iterate over using a for loop. What you can do is:
for wrapper in body.find_all('div', {"class":"wrapper"}):
print wrapper.text
If you'll type:
print(type(body))
you'll see body is <class 'bs4.element.ResultSet'> It means all the elements that match the class. You can either iterate over them:
for div in body:
print(div.text)
Or if you know you only have div, you can use find instead:
div = soup.find('div', {"class":"wrapper"})
div.text
Probably should have posted as answer.. so as stated in the comments almost verbatim
Your code should be the following:
for div in body:
print div.text
#python3
#print(div.text)
Or some naming schema to your preference thereof.
The find_all method returns a generated list ( loosely using the term list here ) of items that beautifulsoup has found matching your criteria after parsing the source webpages html either recursively or non-recursively depending upon how you search.
As the error says the resulting set of objects has no attribute text, since it isn't an element but rather a collection of them.
However, the items inside the resulting set ( should any be found ) do.
You can view the documentation here

media:thumbnail w/ BeautifulSoup

What is the correct way to parse the url attribute of media:thumbnail tags using BeautifulSoup? I have tried the following:
doc = BeautifulSoup(urlopen('http://rss.cnn.com/rss/edition.rss'), 'xml')
items = doc.findAll('item')
for item in items:
title = item.title.text
link = item.link.text
image = item.find('media:thumbnail')[0]['url']
However, I get the 'NoneType' object is not subscriptable error.
Don't include the namespace prefix:
>>> doc.find('thumbnail')
<media:thumbnail height="51" url="http://i2.cdn.turner.com/cnn/dam/assets/150116173806-amateur-video-amedy-coulibaly-top-tease.jpg" width="90"/>
The element.find() method returns one element, so there is no need for subscription here; you can access the url attribute on the element directly:
>>> doc.find('thumbnail')['url']
u'http://i2.cdn.turner.com/cnn/dam/assets/150116173806-amateur-video-amedy-coulibaly-top-tease.jpg'
There currently isn't any support for searching by a specific namespace; the namespace URL is stored (in the .namespace attribute) but not used by .find() or .find_all().

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'name'

Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.
Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)
name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag
You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass
This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)

Categories