I'm new to lxml and I'm trying to figure how to rewrite links using iterlinks().
import lxml.html
html = lxml.html.document_fromstring(doc)
for element, attribute, link, pos in html.iterlinks():
if attibute == "src":
link = link.replace('foo', 'bar')
print lxml.html.tostring(html)
However, this doesn't actually replace the links. I know I can use .rewrite_links, but iterlinks provides more information about each link, so I would prefer to use this.
Thanks in advance.
Instead of just assigning a new (string) value to the variable name link, you have to alter the element itself, in this case by setting its src attribute:
new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar')
element.set('src', new_src)
Note that - if you know which "links" you are interested in, for example, only img elements - you can also get the elements by using .findall() (or xpath or css selectors) instead of using .iterlinks().
lxml provides a rewrite_links method (or function that you pass the text to be parsed into a document) to provide a method of changing all links in a document:
.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None):
This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL.
For each link link_repl_func(link) is called. That function then returns the new link, or None to remove the attribute or tag that contains the link. Note that all links will be passed in, including links like "#anchor" (which is purely internal), and things like "mailto:bob#example.com" (or javascript:...).
Probably link is just a copy of the actual object. Try replacing the attribute of the element in your loop. Even element can be just a copy, but it deserves a try...
Here is working code with rewrite_links:
from lxml.html import fromstring, tostring
e = fromstring("<html><body><a href='http://localhost'>hello</body></html>")
def my_rewriter(link):
return "http://newlink.com"
e.rewrite_links(my_rewriter)
print(tostring(e))
Output:
b'<html><body>hello</body></html>'
Related
I'm admittedly beginner to intermediate with Python and novice to BeautifulSoup/web-scraping. However, I have successfully built a couple of scrapers. Normal tags = no problem (e.g., div, a, li, etc)
However, can't find how to reference this tag with .select or .find or attrs="" or anything:
..........
<react type="sad" msgid="25314120" num="2"
..........
I ultimately want what looks like the "num" attribute from whatever this ghastly thing is ... a "react" tag (though I don't think that's a thing?)?
.find() works the same way as you'd find other tags such as div, p and a tags. Therefore, we search for the 'react' tag.
react_tag = soup.find('react')
Then, access the num attribute like so.
num_value = react_tag['num']
Should print out:
2
As per bs4 Documentation .find('tag') returns the single tag and .find_all('tag') returns list of tags in html.
In your case if there are multiple react tags use this
for reactTag in soup.find_all('react'):
print(reactTag.get('num'))
To get only first tag use this
print(soup.find('react').get('num'))
The user "s n" was spot on! These are dynamically created javascript which I didn't know anything about, but was pretty easy to figure out. Using the SeleniumLibrary in Python and a "headless" WebChromeDriver together, you can use Selenium selectors like Xpath and many others to find these tags.
Using XPATH with Python, do I really need to use get() or getall() or does the xpath string suffice
For example, is this ok?
product_links = response.xpath('//a[contains(#class,"box_product")]/#href')
or do I really need to use
product_links = response.xpath('//a[contains(#class,"box_product")]/#href').getall()
Or is it so that when using an attribute #attribute works, but to retrieve the data (text) within the html tags itself then we use get() and getall()
question: when do I need to use variant 1 /#href or variant 2 /#href').getall()?
The goal is to obtain a workable array of links
Calling response.xpath('//a[contains(#class,"box_product")]/#href') gives you only an instance of Selector (i.e. a recipe for getting the results you want) instead of the actual results.
To get the actual results, you need to call either get(), which will give you only the first match, or getall(), which will return all matches.
So for your use case, go with getall().
=====================
Example and read more # https://www.pythongasm.com/introduction-to-scrapy/
I'm generating a model to find out where a piece of text is located in an HTML file.
So, I have a database with plenty of data from different newspaper's articles with data like title, publish date, authors and news text. What I'm trying to do is by analyzing this data, generate a model that can find by itself the XPath to the HTML tags with this content.
The problem is when I use a regex within the xpath method as shown here:
from lxml import html
with open('somecode.html', 'r') as f:
root = html.fromstring(f.read())
list_of_xpaths = root.xpath('//*/#*[re:match(.,"2019-04-15")]')
This is an example of searching for the publish date in the code. It returns a lxml.etree._ElementUnicodeResult instead of lxml.etree._Element.
Unfortunately, this type of element doesn't let me get the XPath to where is it locate like an lxml.etree._Element after applying root.getroottree().getpath(list_of_xpaths[0]).
Is there a way to get the XPath for this type of element? How?
Is there a way to lxml with regex return an lxml.etree._ElementUnicodeResult element instead?
The problem is that you get an attribute value represented as an instance of _ElementUnicodeResult class.
If we introspect what _ElementUnicodeResult class provides, we could see that it allows you to get to the element which has this attribute via .getparent() method:
attribute = list_of_xpaths[0]
element = attribute.getparent()
print(root.getroottree().getpath(element))
This would get us a path to the element, but as we need an attribute name as well, we could do:
print(attribute.attrname)
Then, to get the complete xpath pointing at the element attribute, we may use:
path_to_element = root.getroottree().getpath(element)
attribute_name = attribute.attrname
complete_path = path_to_element + "/#" + attribute_name
print(complete_path)
FYI, _ElementUnicodeResult also indicates if this is actually an attribute via .is_attribute property (as this class also represents text nodes and tails as well).
I am new to Python and BeautifulSoup. So please forgive me if I'm using the wrong terminology.
I am trying to get a specific 'text' from a div tag/element that has multiple attributes in the same .
<div class="property-item" data-id="183" data-name="Brittany Apartments" data-street_number="240" data-street_name="Brittany Drive" data-city="Ottawa" data-province="Ontario" data-postal="K1K 0R7" data-country="Canada" data-phone="613-688-2222" data-path="/apartments-for-rent/brittany-apartments-240-brittany-drive-ottawa/" data-type="High-rise-apartment" data-latitude="45.4461070" data-longitude="-75.6465360" >
Below is my code to loop through and find 'property-item'
for btnMoreDetails in citySoup.findAll(attrs= {"class":"property-item"}):
My question is, if I specifically want the 'data-name' and 'data-path' for example, how do I go about getting it?
I've searched google and even this website. Some were saying using the .contents[2]. But I still wasn't able to get any of it.
Once you have extracted the element (which findAll does one at a time) you can access attributes as though they were dictionary keys. So for example the following code:
data = """<div class="property-item" data-id="183" data-name="Brittany Apartments" data-street_number="240" data-street_name="Brittany Drive" data-city="Ottawa" data-province="Ontario" data-postal="K1K 0R7" data-country="Canada" data-phone="613-688-2222" data-path="/apartments-for-rent/brittany-apartments-240-brittany-drive-ottawa/" data-type="High-rise-apartment" data-latitude="45.4461070" data-longitude="-75.6465360" >"""
import bs4
soup = bs4.BeautifulSoup(data)
for btnMoreDetails in soup.findAll(attrs= {"class":"property-item"}):
print btnMoreDetails["data-name"]
prints out
Brittany Apartments
If you want to get the data-name and data-path attributes, you can simply use the dictionary-like access to Tag's attributes:
for btnMoreDetails in citySoup.findAll(attrs={"class":"property-item"}):
print(btnMoreDetails["data-name"])
print(btnMoreDetails["data-path"])
Note that you can also use the CSS selector to match the property items:
for property_item in citySoup.select(".property-item"):
print(property_item["data-name"])
print(property_item["data-path"])
FYI, if you want to see all the attributes use .attrs property:
for property_item in citySoup.select(".property-item"):
print(property_item.attrs)
According to BeautifulSoup documentation, it is possible to get the value of tag's attribute by using a code which looks like this :
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag['class']
Theoretically (that is, according to the doc), the output would be :
u'boldest'
However, when I execute the above code, it outputs :
['boldest']
So, is there something I'm missing ? How can I obtain a tag's attribute content as a plain unicode string ?
tag['class'][0]
There are can be more than one class in tag, thats why it return list of values. If you sure there is only one class there - just get first element from list.
Check this section in the documentation:
Multi-valued attributes
HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:
tag['class'][0] will give you the string