Python Scrapy can't get pseudo class ":not()" - python

It's OK to write
content.css('.text>p::text').extract()
But
content.css('.text:not(.text .text)>p::text').extract()
will not work.
It tells me:
SelectorSyntaxError: Expected ')', got <S ' ' at 15>
Yes, the 15th letter in the '.text:not(.text .text)>p::text' is ' ', but how can I express this meaning without using a ' '?
Update
There are nested <div class='text'>s, I want to extract all the <p>s right beneath the first <div class='text'>.
For example:
<div class='text comment'>
<strong>abc</strong>
<span>def</span>
<p>xxxxxxxxxxxxx</p>
<p>xxxxxxxxxxxxxxxxxxxxxxxxxxx</p>
<div class='text sub_comment'>
<strong>lst</strong>
<span>lll</span>
<p>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p>
<p>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p>
</div>
</div>
I want to get texts in the first two <p>. I can't use .comment and .sub_comment to distinguish them because they change from case to case and are not definitely comment in the outside and sub_comment in the inner tag.

How about trying nth-child(1)?
So your css would be:
".text:nth-child(1)>p"
In Scrapy:
In [54]: from scrapy import Selector
In [55]: a
Out[55]: u"<div><div class='text comment'> <strong>abc</strong> <span>def</span> <p>xxxxxxxxxxxxx</p> <p>xxxxxxxxxxxxxxxxxxxxxxxxxxx</p> <div class='text sub_comment'> <strong>lst</strong> <span>lll</span> <p>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p> <p>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p> </div></div></div>"
In [56]: sel = Selector(text=a)
In [57]: sel.css(".text:nth-child(1)>p::text").extract()
Out[57]: [u'xxxxxxxxxxxxx', u'xxxxxxxxxxxxxxxxxxxxxxxxxxx']
There is nice explanation and demo of nth-child in this tutorial here (scroll down to paragraph 22).

Related

Dynamically matching a string that starts with a substring

I need to dynamically match a string that starts with forsale_. Here, I'm finding it by hardcoding the characters that follow, but I'd like to do this dynamically:
for_sale = response.html.find('span.forsale_QoVFl > a', first=True)
I tried using startswith(), but I'm not sure how to implement it.
Sample response.html:
<section id="release-marketplace" class="section_9nUx6 open_BZ6Zt">
<header class="header_W2hzl">
<div class="header_3eShg">
<h3>Marketplace</h3>
<span class="forsale_QoVFl">2 For Sale from <span class="price_2Wkos">$355.92</span></span>
</div>
</header>
<div class="content_1TFzi">
<div class="buttons_1G_mP">Buy CDSell CD</div>
</div>
</section>
startswith() is straightforward. x = txt.startswith("forsale_") will return a bool, where txt is the string you want to test.
For more involved pattern matching, you want to look at regular expressions. Something like this is the equivalent of the startswith() line above:
import re
txt = "forsale_arbitrarychars"
x = re.search("^forsale_", txt)
where if you were to replace ^forsale_ with something like ^forsale_[0-9]*$, it would only accept ints after the underscore
I assume your final expected output is the link in the target <span>. If so, I would do it using lxml and xpath:
import lxml.html as lh
sale = """[your html above]"""
doc = lh.fromstring(sale)
print(doc.xpath('//span[#class[starts-with(.,"forsale_")]]/a/#href')[0])
Output:
/sell/release/XXX

Find unknown tag containing given text

My HTML is like :
<body>
<div class="afds">
<span class="dfsdf">mytext</span>
</div>
<div class="sdf dzf">
<h1>some random text</h1>
</div>
</body>
I want to find all tags containing "text" & their corresponding classes. In this case, I want:
span, "dfsdf"
h1, null
Next, I want to be able to navigate through the returned tags. For example, find the div parent tag & respective classes of all the returned tags.
If I execute the following
soupx.find_all(text=re.compile(".*text.*"))
it simply returns the text part of the tags:
['mytext', ' some random text']
Please help.
You are probably looking for something along these lines:
ts = soup.find_all(text=re.compile(".*text.*"))
for t in ts:
if len(t.parent.attrs)>0:
for k in t.parent.attrs.keys():
print(t.parent.name,t.parent.attrs[k][0])
else:
print(t.parent.name,"null")
Output:
span dfsdf
h1 null
find_all() does not return just strings, it returns bs4.element.NavigableString.
That means you can call other beautifulsoup functions on those results.
Have a look at find_parent and find_parents: documentation
childs = soupx.find_all(text=re.compile(".*text.*"))
for c in childs:
c.find_parent("div")

xpath to get lists of element in Python

I am trying to scrape lists of elements from a page that looks like this:
<div class="container">
<b>1</b>
<b>2</b>
<b>3</b>
</div>
<div class="container">
<b>4</b>
<b>5</b>
<b>6</b>
</div>
I would like to get lists or tuples using xpath: [1,2,3],[4,5,6]...
Using for loop on the page I get either the first element of each list or all numbers as one list.
Could you please help me to solve the exercise?
Thank you in advance for any help!
For web-scraping of static pages bs4 is best package to work with. and using bs4 you can achieve your goal as easy as below:
from bs4 import BeautifulSoup
source = """<div class="container">
<b>1</b>
<b>2</b>
<b>3</b>
</div>
<div class="container">
<b>4</b>
<b>5</b>
<b>6</b>
</div>"""
soup = BeautifulSoup(source, 'html.parser') # parse content/ page source
soup.find_all('div', {'class': 'container'}) # find all the div element (second argument is optional mentioned to scrape/find only element with attribute value)
print([[int(x.text) for x in i.find_all('b')] for i in soup.find_all('div', {'class': 'container'})]) # get list of all div's number list as you require
Output:
[[1, 2, 3], [4, 5, 6]]
you could use this xpath expression, which will give you two strings
.//*[#class='container'] ➡ '1 2 3', '4 5 6'
if you would prefer 6 strings
.//*[#class='container']/b ➡ '1','2','3','4','5','6'
to get exactly what you are looking for though you would have to separate the xpath expressions
.//*[#class='container'][1]/b ➡ '1','2','3'
.//*[#class='container'][2]/b ➡ '4','5','6'

Can I do a findall regular expression like this?

So I need to grab the numbers after lines looking like this
<div class="gridbarvalue color_blue">79</div>
and
<div class="gridbarvalue color_red">79</div>
Is there a way I can do a findAll('div', text=re.recompile('<>)) where I would find tags with gridbarvalue color_<red or blue>?
I'm using beautifulsoup.
Also sorry if I'm not making my question clear, I'm pretty inexperienced with this.
class is a Python keyword, so BeautifulSoup expects you to put an underscore after it when using it as a keyword parameter
>>> soup.find_all('div', class_=re.compile(r'color_(?:red|blue)'))
[<div class="gridbarvalue color_blue">79</div>, <div class="gridbarvalue color_red">79</div>]
To also match the text, use
>>> soup.find_all('div', class_=re.compile(r'color_(?:red|blue)'), text='79')
[<div class="gridbarvalue color_blue">79</div>, <div class="gridbarvalue color_red">79</div>]
import re
elems = soup.findAll(attrs={'class' : re.compile("color_(blue|red)")})
for each e in elems:
m = re.search(">(\d+)<", str(e))
print "The number is %s" % m.group(1)

Filtering xml file to remove lines with certain text in them?

For example, suppose I have:
<div class="info"><p><b>Orange</b>, <b>One</b>, ...
<div class="info"><p><b>Blue</b>, <b>Two</b>, ...
<div class="info"><p><b>Red</b>, <b>Three</b>, ...
<div class="info"><p><b>Yellow</b>, <b>Four</b>, ...
And I'd like to remove all lines that have words from a list so I'll only use xpath on the lines that fit my criteria. For example, I could use the list as ['Orange', 'Red'] to mark the unwanted lines, so in the above example I'd only want to use lines 2 and 4 for further processing.
How can I do this?
Use:
//div
[not(p/b[contains('|Orange|Red|',
concat('|', ., '|')
)
]
)
]
This selects any div elements in the XML document, such that it has no p child whose b child's string valu is one of the strings in the pipe-separated list of strings to use as filters.
This approach allows extensibility by just adding new filter values to the pipe-separated list, without changing anything else in the XPath expression.
Note: When the structure of the XML document is statically known, always avoid using the // XPath pseudo-operator, because it leads to significant inefficiency (slowdown).
import lxml.html as lh
# http://lxml.de/xpathxslt.html
# http://exslt.org/regexp/functions/match/index.html
content='''\
<table>
<div class="info"><p><b>Orange</b>, <b>One</b></p></div>
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
<div class="info"><p><b>Red</b>, <b>Three</b></p></div>
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
</table>
'''
NS = 'http://exslt.org/regular-expressions'
tree = lh.fromstring(content)
exclude=['Orange','Red']
for elt in tree.xpath(
"//div[not(re:test(p/b[1]/text(), '{0}'))]".format('|'.join(exclude)),
namespaces={'re': NS}):
print(lh.tostring(elt))
print('-'*80)
yields
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
--------------------------------------------------------------------------------
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
--------------------------------------------------------------------------------

Categories