Scrapy, python, Xpath how to match respective items in html

Scrapy, python, Xpath how to match respective items in html - python

I am new to Xpath, trying to scrapy website with below format:
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
<div class="middle"> listed_value </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_date </div>
</div>
<div class="top">
<a> tittle_name </a>
<div class="middle"> listed_value </div>
</div>
The presences of listed_value & listed_date are optional.
I need to group each tittle_name with respective listed_date, listed_value (if available) then insert reach record to MySQL.
I am using scrapy shell which gives some basic examples like
listings = hxs.select('//div[#class=\'top\']')
for listing in listings:
tittle_name = listing.select('/a//text()').extract()
date_values = listing.select('//div[#class=\'middle\']')
Above code give me list of tittle_name and list of available listed_date, listed_value, but how to match them? (we cannot go by index because the format is not symmetric).
Thanks.

Do note that those XPath expressions are absolute:
/a//text()
//div[#class=\'middle\']
You would need relative XPath expression like these:
a
div[#class=\'middle\']
Second. It's not a good idea to select text nodes in a mixed content model like (X)HTML. You should extract the string value with the proper DOM method or with string() function. (In the last case, you would need to eval the expression for each node because the implicit node set casting into singleton node set)

Well, since the website doesn't specify whether something in a div[#class='middle'] is a date or a value, you'll have to code your own way of deciding this.
I guess the dates have some specific format that you could match with some analysis, maybe using a regular expression.
Can you maybe be more specific on what are possible values for listed_date and listed_value?

Related

How to use selenium python to find and return all attributes that contain "cookie"?

I'm trying to return all attributes that contain "cookie". The "cookie" can be anywhere within the first div and doesn't seem to follow any pattern. For example, it could look something like this or be in a completely different structure.
<div class="product">
<div class="random">
<a data="7cookiek">
<a class="messy">
<div id="9cookiep">
</a>
</div>
Is there a simple way to reuturn the full atribute content of all the attributes that contain "cookie". For instance, the code above would return
7cookiek
9cookiep

using nth-type or nth-child to select n element

Q: What XPath or CSS selector I can use to select 2nd <div class="checkbox">?
I have tried to use:
XPath - //div[#class="checkbox"][2]
CSS - div.checkbox:nth-child(2)
However none of them worked on chrome developer tool.
I can use $x('//div[#class="checkbox"]') to see all three checkboxes
I can use $x('//div[#class="checkbox"]')[0] to specify the 1st div.checkbox
I can use $x('//div[#class="checkbox"]')[1] to specify the 2nd div.checkbox
Here's an example of my HTML Structure
<div class="fs">
<div class="f">
<div class="checkbox">
<input type="radio" value="A">
<label for="A">A</label>
</div>
</div>
<div class="f">
<div class="checkbox">
<input type="radio" value="B">
<label for="B">B</label>
</div>
</div>
<div class="f">
<div class="checkbox">
<input type="radio" value="C">
<label for="C">C</label>
</div>
</div>
</div>

Rather than trying to find the second element by index, another possibility would be to get it by the value on the INPUT or the text in the LABEL that is contained in that DIV. A couple XPaths would be
//div[#class='checkbox'][./input[#value='B']]
//div[#class='checkbox'][./label[.='B']]

You need 2nd element from the results. Which can be done by using below
(//div[#class="checkbox"])[2]
I think CSS doesn't allow such a thing to select from a result

Since JeffC and Tarun Lalwani already suggested XPath way of doing it, I'd like to suggest a different approach.
In CSS, one can use :nth-child selector to choose 2nd <div class="f"> and grab the nested div from there. (> can be omitted)
div.f:nth-child(2) > div.checkbox
Similarly, the following works in XPath:
//div[#class='f'][2]/div[#class='checkbox']
One can choose an element based on the attribute value with CSS selector using Attribute selectors, but one cannot select the parent, unfortunately.

Anything similar to "until" in CSS selector?

I would like to get movie names available between "tracked_by" id to "buzz_off" id. I have already created a selector which can grab names after "tracked_by" id. However, my intention is to let the script do the parsing UNTIL it finds "buzz_off" id. The elements within which the names are:
html = '''
<div class="list">
<a id="allow" name="allow"></a>
<h4 class="cluster">Allow</h4>
<div class="base min">Sally</div>
<div class="base max">Blood Diamond</div>
<a id="tracked_by" name="tracked_by"></a>
<h4 class="cluster">Tracked by</h4>
<div class="base min">Gladiator</div>
<div class="base max">Troy</div>
<a id="buzz_off" name="buzz_off"></a>
<h4 class="cluster">Buzz-off</h4>
<div class="base min">Heat</div>
<div class="base max">Matrix</div>
</div>
'''
from lxml import html as htm
root = htm.fromstring(html)
for item in root.cssselect("a#tracked_by ~ div.base a"):
print(item.text)
The selector I've tried with (also mentioned in the above script):
a#tracked_by ~ div.base a
Results I'm having:
Gladiator
Troy
Heat
Matrix
Results I would like to get:
Gladiator
Troy
Btw, I would like to parse the names using this selector not to style.

this is a reference for css selectors. As you can see, it doesn't have any form of logic, as it is not a programming language. You'd have to use a while not loop in python and handle each element one at a time, or append them to a list.

Python list processing to extract substrings

I parsed an HTML page via beautifulsoup, extracting all div elements with specific class names into a list.
I now have to clean out HTML strings from this list, leaving behind string tokens I need.
The list I start with looks like this:
[<div class="info-1">\nName1a <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b <span class="bold">Score3b</span>\n</div>]
The whitespaces are deliberate.
I need to reduce that list to:
[('Name1a', 'Score1a'), ('Name1b', 'Score1b'), ('Name2a', 'Score2a'), ('Name2b', 'Score2b'), ('Name3a', 'Score3a'), ('Name3b', 'Score3b')]
What's an efficient way to parse out substrings like this?
I've tried using the split method (e.g. [item.split('<div class="info-1">\n',1) for item in string_list]), but splitting just results in a substring that requires further splitting (hence inefficient). Likewise for using replace.
I feel I ought to go the other way around and extract the tokens I need, but I can't seem to wrap my head around an elegant way to do this. Being new to this hasn't helped either. I appreicate your help.

Do not convert BS object to string unless you really need to do that.
Use CSS selector to find the class that starts with info
Use stripped_strings to get all the non-empty strings under a tag
Use tuple() to convert an iterable to tuple object
import bs4
html = '''<div class="info-1">\nName1a <span class="bold">Score1a</span>\n</div>, <div class="info-2">\nName1b <span class="bold">Score1b</span>\n</div>, <div class="info-1">\nName2a <span class="bold">Score2a</span>\n</div>, <div class="info-2">\nName2b <span class="bold">Score2b</span>\n</div>, <div class="info-1">\nName3a <span class="bold">Score3a</span>\n</div>, <div class="info-2">\nName3b <span class="bold">Score3b</span>\n</div>'''
soup = bs4.BeautifulSoup(html, 'lxml')
for div in soup.select('div[class^="info"]'):
t = tuple(text for text in div.stripped_strings)
print(t)
out:
('Name1a', 'Score1a')
('Name1b', 'Score1b')
('Name2a', 'Score2a')
('Name2b', 'Score2b')
('Name3a', 'Score3a')
('Name3b', 'Score3b')

handling deeply nested tags in xpath

Please help me!
I don't know how to select deeply nested tag to select the text
inside of it.
If someone would please help me by saying, how to do it in a single line with
xpath query and please give me an explanation regarding the answer.
Below I have given a html code will anybody explain how to display the Hello world or whatever may be in that tags.
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div class="deep">
<span>
<strong class="select">Hello world!</strong>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>

I assume since you asked for the text property the node you'd like to match is the strong tag (the only one with content).
If you are guaranteed only one <strong> tag from the document root and the level of nesting is irrelevant, the simplest xpath would be:
//strong/text()
To match via class specifically as well:
//strong[#class="select"]/text()
// will start from the document root, and # is an attribute match clause.
http://www.b624.net/modelare-software-uml-si-xml/laboratoare-an-3-is/xpath-cheat-sheet

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy, python, Xpath how to match respective items in html - python

Related

How to use selenium python to find and return all attributes that contain "cookie"?

using nth-type or nth-child to select n element

Anything similar to "until" in CSS selector?

Python list processing to extract substrings

handling deeply nested tags in xpath

Categories

Resources