Obtain text from lxml Comment - python

I am trying to get the content of the _Comment. I've researched quite a bit on how do do it, but I don't know how to access the function from the td element in order to grab the text. I'm using xpaths with the python Scrapy module if that helps.
td = None [_Element]
<built-in function Comment> = None [_Comment]
a = None [_Element]
The HTML for the td element is:
<table class="crIFrameReviewList">
<tr>
<td>
<!-- BOUNDARY -->
<a name="R2L4AFEICL8GG6"></a><br />
<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
304 of 309 people found the following review helpful
</div>
<div style="margin-bottom:0.5em;">
<span style='margin-left: -5px;'><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_.gif" width="64" alt="5.0 out of 5 stars" title="5.0 out of 5 stars" height="12" border="0" /> </span>
<b>Great Travel Zoom</b>, <nobr>April 9, 2014</nobr>
</div>
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<span class="crVerifiedStripe"><b class="h3color tiny" style="margin-right: 0.5em;">Verified Purchase</b><span class="tiny verifyWhatsThis">(What's this?)</span></span>
</div>
<div class="tiny" style="margin-bottom:0.5em;">
<b><span class="h3color tiny">This review is from: </span>Canon PowerShot SX700 HS Digital Camera (Black) (Electronics)</b>
</div>
For the recent few years Canon has made great efforts to improve their travel-zoom compact cameras, and the new SX700 is their next remarkable achievement on that way. It's a little bit bigger than its predecessor (SX280) but it is very well built and has an attractive look and feel (I like the black one). It also got a new front grip which makes one-hand shooting more convenient, even when shooting video, since the Video button was moved from the back to the top and you can now use your thumb solely for holding the camera.<br /><br />Here is a brief list of the new camera pros & cons:<br /><br />PROS:<br />* A very good design and build quality with the attractive finish.<br />* A new powerful 30x optical zoom lens in just a pocket-size body.<br />* Incredible range from 25mm wide to 750mm telephoto for stills and video.<br />* Zoom Framing Assist - very useful new feature to compose your pictures at long telephoto.<br />* Very effective optical Intelligent Image Stabilization for...
Read more
<div style="padding-top: 10px; clear: both; width: 100%;">

Find the div with class="reviewText" using .//div[#class="reviewText"] xpath expression and dump the element to string using tostring() with a text method:
import lxml.html
data = """
your html here
"""
td = lxml.html.fromstring(data)
review = td.find('.//div[#class="reviewText"]')
print lxml.html.tostring(review, method="text")
Prints:
54,000 RPM - It has a spinning disk drive that is way beyond our time...I bought 10 of these just for the hard drive, they blow SSD's out of the water.
Seriously though... how does a well known computer company mistype an important spec?

Related

Getting the text of a paragraph element using Selenium

`<div id="businessCategory12">`
`<p style="margin-top: 0px;line-height:80%;margin-left:5px;font-weight: bold;color:#00004C">Business Types</p>`
`<p style="margin-top: 0px;line-height:80%;margin-left:15px;font-weight: bold;"> Minority Owned Business</p>`
`<p style="margin-top: 0px;line-height:80%;margin-left:15px;"> Black American Owned</p>`
`</div>``
I am working on a webscraping tool for a client. I need to get the text from the third paragraph above using selenium (python) but I am having a lot of trouble. The text should be "Black American Owned". I have tried the following but it keeps giving me a null value. What am I doing wrong here?
Any help or other way to get the text would be greatly greatly appreciated!
`minority = driver.find_element_by_xpath("//*[#id='businessCategory12']/p[3]")`
`minority_owned = minority.text`
Possibly the node is hidden try with textContent instead of text
minority = driver.find_element_by_xpath("//*[#id='businessCategory12']/p[3]")
minority_owned = minority.get_attribute("textContent")
<div id="businessCategory12">
<p style="margin-top: 0px;line-height:80%;margin-left:5px;font-weight: bold;color:#00004C">Business Types</p>
<p style="margin-top: 0px;line-height:80%;margin-left:15px;font-weight: bold;">Minority Owned Business</p>
<p style="margin-top: 0px;line-height:80%;margin-left:15px;">Black American Owned</p>
</div>
Just try:
//p[3]/text()
Here is a good site to play around xpath:
https://scrapinghub.github.io/xpath-playground/

Using BeautifulSoup to extract specific text within duplicate tags

I am working on a Digital Humanities project trying to isolate just the descriptions of images from a series of digitized engravings. (I am also fairly new to coding and programming in general as I am just a humble philosopher stepping into the waters of DH) So far I have been able to isolate the source code using Python and a urllib script which looks like this:
import urllib.request
import urllib.parse
url = "http://pitts.emory.edu/dia/image_details.cfm?ID=17250"
f = urllib.request.urlopen(url)
print(f.read().decode('utf-8'))
However, my issue arises in the source code itself. The description is situated with other pieces of information which are all broken up by P and b tags:
</div>
<div class="col-sm-6">
<P>
<b>Book Title:</b>
The Holy Bible containing the Old and New Testaments, according to the authorised version. With illustrations by Gustave Doré
</p>
<P>
<b>Author:</b> Doré, Gustave, 1832-1883
</p>
<P>
<b>Image Title:</b> Baptism of Jesus
</p>
<P>
<b>Scripture Reference:</b><ul><li>John 1 (<a href='search.cfm?biblicalbook=John&biblicalbookchapter=1'>further images</a> / <a rel='shadowbox;height=500;width=600' href='http://www.commonenglishbible.com/explore/passage-lookup/?query=John+1'>scripture text</a>)</li></ul>
</p>
<P>
<b>Description:</b> John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.
</P>
<P>
<A HREF="book_list.cfm?ID=2449">Click here
</a> for additional images available from this book.
</P>
<p>For information on licensing this image, please send an email, including a link to the image, to
dia#emory.edu
</p>
</div>
How can I use BeautifulSoup to isolate just the text of the description in from out of these tags? Everything I have thus far found on StackOverFlow suggests it may be doable; however I have yet to find something trying to do this specifically.
Again, out of the source code, I want to extract just the description "John the Baptist baptizes Jesus...". How could I go about doing this?
Thanks! And sorry again for my lack of robust knowledge just yet.
In this example, we can use CSS selectors. Assuming you are using BeautifulSoup 4.7+, CSS selector support is provided by the soupsieve library. We are first going to use the :has() CSS level 4 selector to find <p> tags who have a direct child <b> tag, and then use the soupsieve's non-standard :contains selector to ensure the <b> tag contains Description:. Then we simply print the content of all elements that match this criteria stripping out leading and trailing whitespace and stripping out Description:. Keep in mind there are multiple ways to do this, this is just the method I've chosen to illustrate:
import bs4
markup = """
</div>
<div class="col-sm-6">
<P>
<b>Book Title:</b>
The Holy Bible containing the Old and New Testaments, according to the authorised version. With illustrations by Gustave Doré
</p>
<P>
<b>Author:</b> Doré, Gustave, 1832-1883
</p>
<P>
<b>Image Title:</b> Baptism of Jesus
</p>
<P>
<b>Scripture Reference:</b><ul><li>John 1 (<a href='search.cfm?biblicalbook=John&biblicalbookchapter=1'>further images</a> / <a rel='shadowbox;height=500;width=600' href='http://www.commonenglishbible.com/explore/passage-lookup/?query=John+1'>scripture text</a>)</li></ul>
</p>
<P>
<b>Description:</b> John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.
</P>
<P>
<A HREF="book_list.cfm?ID=2449">Click here
</a> for additional images available from this book.
</P>
<p>For information on licensing this image, please send an email, including a link to the image, to
dia#emory.edu
</p>
</div>
"""
soup = bs4.BeautifulSoup(markup, "html.parser")
for el in soup.select('p:has(> b:contains("Description:"))'):
print(el.get_text().strip('').replace('Description: ', ''))
Output:
John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.
I could achieve something almost like you wanted using the below code:
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
url = "http://pitts.emory.edu/dia/image_details.cfm?ID=17250"
f = urllib.request.urlopen(url)
soup = BeautifulSoup(f, 'html.parser')
parent = soup.find("b", text="Description:").parent
parent.find("b", text="Description:").decompose()
print(parent.text)
I've added BeautifulSoup and removed the Description.
I used < p > tags as index and than choosed the [4] index. I am just a newbie, but it worked.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pitts.emory.edu/dia/image_details.cfm?ID=17250")
soup = BeautifulSoup(html, 'html.parser')
page = soup.find_all('p')[4].getText()
print(page)

Trying to use Python + XPath to click the highlighted text?

Fairly new to coding and Python, I'm trying to use find_element_by_xpath to click the text highlighted text "Snoring Chin Strap by TheFamilyMarket".
time.sleep(2)
#btn = br.find_element_by_name("#Anti Snoring Chin Strap Kit")
# btn = br.find_element_by_link_text('Snoring Chin Strap')
The HTML code:
<div class="tableD">
<div class="productDiv" id="productDiv69507">
<h2 class="productTitle" id="productTitle69507" onclick="goToProduct(7)">Snoring Chin Strap by TheFamilyMarket</h2>
<img class="productImage" src="https://images-na.ssl-images-amazon.com/images/I/516fC3JruqL.jpg" onclick="goToProduct(7)">
<hr>
<h4 class="normalPrice" id="normalPrice7" onclick="goToProduct(7)">Normally: <span class="currency">$ </span>19.99</h4>
<h4 class="promoPrice" style="margin:2.5px auto;" id="promoPrice69507" onclick="goToProduct(7)">Your Amazon Price: <span class="currency">$ </span>1.99</h4>
<h3>Your Total: <span class="currency">$ </span>1.99</h3>
<p class="clickToViewP" id="cToVP69507" onclick="goToProduct(7)">Click to view and purchase!</p>
</div>
</div>
br.find_element_by_xpath("//h2[text()='Snoring Chin Strap by TheFamilyMarket']");
XPath is sometimes fast to get because you can get it from the browser, and that's why so many people use it, but in my opinion for long term, learning JavaScript and CSS selectors can help you in many instances in the future.
The above can be done also by selecting all the h2 elements and looking for text using plain JavaScript and passing the result to python:
link_you_search = br.execute_script('''
links= document.querySelectorAll("h2");
for (link of links) if (link.textContent.includes("Chin Strap")) return link;
''')
link_you_search.click()
or alternatively you can select by class:
link_you_search = br.execute_script('''
links= document.querySelectorAll(".productDiv");
for (link of links) if (link.textContent.includes("Chin Strap")) return link;
''')
link_you_search.click()
given that your element has an id attribute usually selecting by id it is best practice since it is the fastest search and you should only have only one element with that id and usually ids don't change so often in case of translation etc, so in your case it would be:
link_you_search = br.find_element_by_id('productTitle69507')
link_you_search.click()

Trouble finding element in Selenium with Python

I have been trying to collect a list of live channels/viewers on Youtube Gaming. I am using selenium with Python to force the website to scroll down the page so it loads more that 11 channels. For reference, this is the webpage I am working on.
I have found the location of the data I want, but I am struggling with getting selenium to go there. The part I am having trouble with looks like this:
<div class="style-scope ytg-gaming-video-renderer" id="video-metadata"><span class="title ellipsis-2 style-scope ytg-gaming-video-renderer"><ytg-nav-endpoint class="style-scope ytg-gaming-video-renderer x-scope ytg-nav-endpoint-2"><a href="/watch?v=FFKSD1HHrdA" tabindex="0" class="style-scope ytg-nav-endpoint" target="_blank">
Live met Bo3
</a></ytg-nav-endpoint></span>
<div class="channel-info small layout horizontal center style-scope ytg-gaming-video-renderer">
<ytg-owner-badges class="style-scope ytg-gaming-video-renderer x-scope ytg-owner-badges-0">
<template class="style-scope ytg-owner-badges" is="dom-repeat"></template>
</ytg-owner-badges>
<ytg-formatted-string class="style-scope ytg-gaming-video-renderer">
<ytg-nav-endpoint class="style-scope ytg-formatted-string x-scope ytg-nav-endpoint-2">Rico Eeman
</ytg-nav-endpoint>
</ytg-formatted-string>
</div><span class="ellipsis-1 small style-scope ytg-gaming-video-renderer" id="video-viewership-info" hidden=""></span>
<div id="metadata-badges" class="small style-scope ytg-gaming-video-renderer">
<ytg-live-badge-renderer class="style-scope ytg-gaming-video-renderer x-scope ytg-live-badge-renderer-1">
<template class="style-scope ytg-live-badge-renderer" is="dom-if"></template>
<span aria-label="" class="text layout horizontal center style-scope ytg-live-badge-renderer">4 watching</span>
<template class="style-scope ytg-live-badge-renderer" is="dom-if"></template>
</ytg-live-badge-renderer>
</div>
</div>
Currently, I am trying:
#This part works fine. I can use the unique ID
meta_data = driver.find_element_by_id('video-metadata')
#This part is also fine. Once again, it has an ID.
viewers = meta_data.find_element_by_id('metadata-badges')
print(viewers.text)
However, I am have been having trouble getting to the channel name (in this example 'Rico Eeman', and it is under the first nested div tag). Because its a compound class name, I cannot find the element by class name, and trying the following xpaths doesnt work:
name = meta_data.find_element_by_xpath('/div[#class="channel-info small layout horizontal center style-scope ytg-gaming-video-renderer"]/ytg-formatted-string'
name = meta_data.find_element_by_xpath('/div[1])
They both raise the element not found error. I am not really sure what to do here. Does anyone have a working solution?
The name id not in the <ytg-formatted-string> tag, its in one of it descendants. Try
meta_data.find_element_by_css_selector('.style-scope.ytg-formatted-string.x-scope.ytg-nav-endpoint-2 > a')
Or with xpath
meta_data.find_element_by_xpath('//ytg-nav-endpoint[#class="style-scope ytg-formatted-string x-scope ytg-nav-endpoint-2"]/a')
This will get all the names, even if your xpath worked using video-metadata would not get all the names, the id is repeated per div for each user so you would need find_elements and to iterate over the returned elements:
names = dr.find_elements_by_css_selector("a.style-scope.ytg-nav-endpoint[href^='/channel/']")
print([name.get_attribute("text") for name in names])
Which gives you:
['NinjaNation Gaming', 'DURX DANIEL', 'DEMON', 'Perfection', 'The one and only jd', 'Violator Games', 'KingLuii718', 'NinjaNation Gaming', 'DURX DANIEL', 'DEMON', 'Perfection']

lxml python script, how can I remove count of duplicate id

Ok so I'm getting stuck on how to work around this issue here.
this is just a private counter of online people for a game.
After some research, I managed to get down to this code which I added a bit on the search, to get the count of all the images with on.png ...and it does actually work!
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img[#src="pics/on.png"])'))
Now my frustration is that that "on.png" is repeated 2 times in case of the Guild Master is online.
Can anyone think of a way to get around it? this is part of the HTML
<tr><td class='tabellatitolo a_dx' style=' padding:10px;' >Master
<td class='tabelladati' style=' padding:10px;' ><img align=absmiddle src='pics/on.png'>
<a href='?f=pg&id=55110'>Modernist</a>
<tr><td class='tabellatitolo a_dx' style=' padding:10px;' >Membri<p>(5)
<td class='tabelladati' style=' padding:10px;' >**<img align=absmiddle src='pics/on.png'>
<a href='?f=pg&id=55110'>**Modernist**</a>** - <br><img align=absmiddle src='pics/off.png'>
<a href='?f=pg&id=232720'>Human Slayer</a> - <i>Ti stimo!</i><br>
<img align=absmiddle src='pics/off.png'> <a href='?f=pg&id=68194'>Juggernaut</a><br>
<img align=absmiddle src='pics/off.png'> <a href='?f=pg&id=67121'>XeDiOr ThE KoOl</a><br>
<img align=absmiddle src='pics/on.png'> <a href='?f=pg&id=142638'>Lisbet Irmgard</a><br>
I was maybe thinking to use context position or maybe leverage on that "Membri" (members)?
Thanks any hint will be appriciated :)
I'm going to give a more brutal but possibly simpler answer:
import re
import requests
def get_img_cnt(url):
response = requests.get(url)
# just take the bit after the 'Membri' section
member_content = response.content.split('>Membri<')[1]
# count the number of times you see the image
return len(re.findall('pics/on.png', member_content))
How well it will work will depend on the rest of the html (that you haven't provided). I'd go for string searching (like this) before I start doing html parsing generally. If it works, it's a simpler and faster solution.

Categories