Scrapy extract non-wrapped data

Scrapy extract non-wrapped data - python

I'm trying to extract specific price data with scrapy, but the price is not wrapped in a tag.
How can i compose the price out of this html?
<div class="value">
<span class="currency">€</span>
39
<span class="decimal">.98</span>
<span class="type">New</span>
</div>
How can I refine this to 39,98?
Solution used:
"".join(response.xpath("//section[#class='order order--new order--left']//div[#class='spec price']//div[#class='value']//text()").re("[0-9.]+"))

Get all the text nodes inside the element and join:
$ scrapy shell file://$PWD/index.html
In [1]: "".join(response.xpath("//div[#class = 'value']//text()").re("[0-9.]+"))
Out[1]: u'39.98'
The .re() helps to filter digits and dots only.
If there are multiple values in the document:
In [2]: ["".join(value.xpath(".//text()").re(r"[0-9.]+")) for value in response.xpath("//div[#class = 'value']")]
Out[2]: [u'39.98']

Related

Find tags based on an attribute value with the highest number

I am trying to find a way to find the <span> tag with the highest number in the data-resolution attribute value.
This is what I have managed to do so far.
from bs4 import BeautifulSoup
html_doc = """
<span data-type="video-src" data-url="https://1080.com" data-resolution="1080"></span>
<span data-type="video-src" data-url="https://720.com" data-resolution="720"></span>
<span data-type="video-src" data-url="https://480.com" data-resolution="480"></span>
<span data-type="video-src" data-url="https://360.com" data-resolution="360"></span>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
x = soup.select('span[data-resolution="1080"]')
print(x)
The problem here is that 1080 is a fixed value. I guess I could create a conditional if statement with all numbers I assume will be available, but I would rather just pick the highest number.
So how can I make Beautiful Soup look for the the <span> tag with the highest number in the data-resolution attribute value?

You can narrow down on just the data-resolution items and then use max on that attribute since each item is essentially a dictionary.
def custom_key(css):
val = css.get("data-resolution", 0)
return int(val)
max(soup.select('span[data-resolution]'), key=custom_key)
<span data-resolution="1080" data-type="video-src" data-url="https://1080.com"></span>

There may be better and more succinct ways to do this but here's one idea.
Iterate over all the span elements. Build a tuple comprised of the data-resolution (as an integer) and the entire span element. Append that tuple to a list. Sort the list. Extract the data of interest.
from bs4 import BeautifulSoup
html_doc = """
<span data-type="video-src" data-url="https://1080.com" data-resolution="1080"></span>
<span data-type="video-src" data-url="https://720.com" data-resolution="720"></span>
<span data-type="video-src" data-url="https://480.com" data-resolution="480"></span>
<span data-type="video-src" data-url="https://360.com" data-resolution="360"></span>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
spanlist = []
for span in soup.select('span'):
spanlist.append((int(span['data-resolution']), span))
print(sorted(spanlist)[-1][1])
Output:
<span data-resolution="1080" data-type="video-src" data-url="https://1080.com"></span>
Or, if you want something that's hard to understand but functional then:
print(max([((int(span['data-resolution']), span)) for span in soup.select('span')])[1])

Because we all love one-liners:
result = max(soup.select('span[data-resolution]'), key=lambda span: int(span.get('data-resolution')))

How to get the highest page number using xpath?

I've written an xpath expression to get the highest value of page number from some html elements. However, with the below xpath I'm getting the last text which is Next Page in this case. I wish my xpath act in such a way so that I can get the highest number, as in 6 using it.
The elements upon which the xpath should be applied:
content = """
<div class="nav-links"><span aria-current="page" class="page-numbers current"><span class="meta-nav screen-reader-text">Page </span>1</span>
<a class="page-numbers" href="https://page/2/"><span class="meta-nav screen-reader-text">Page </span>2</a>
<span class="page-numbers dots">…</span>
<a class="page-numbers" href="https://page/6/"><span class="meta-nav screen-reader-text">Page </span>6</a>
<a class="next page-numbers" href="https://page/2/"><span class="screen-reader-text">Next Page</span></a></div>
"""
What I've tried so far:
from lxml.html import fromstring
root = fromstring(above_content)
pagenum = root.xpath("//*[contains(#class,'page-numbers')][last()]/span")[0].text
print(pagenum)
Output I'm having:
Next Page
Output I wish to have:
6

You can use exact class name to avoid fetching Next link:
//a[#class="page-numbers"][last()]
Note that contains(#class,'page-numbers') will return you links with numbers and Next while #class="page-numbers" numbers only

How to extract text which is surrounded by two <span> element?

I am in mid of scraping data from a website, but I encounter following code
code = "<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372 
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> "
I need to extract only "₹ 7,372".
I have tried following.
1. Code.text
but it result to
'\n\n₹ 7,372\xa0\r\n \n–\n\n'
code.text.strip()
but it result to
'₹ 7,372\xa0\r\n \n–'
Is there any method?
Please let me know, so that I can complete my project.

Ok, I managed to clean data that you need. This way is a little ugly, but works=)
from bs4 import BeautifulSoup as BS
html= """<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li> """
soup=BS(html)
li = soup.find('li').text
for j in range(3):
for i in ['\n',' ', '–', '\xa0', '\r','\x20','\x0a','\x09','\x0c','\x0d']:
li=li.strip(i)
print(li)
output:
₹ 7,372
In the loop list I outlined all (as far as I know) ASCII spaces and the symbols that you get.
Loop launches 3 times because needed value doesn't clean from the first time, you can check it every iteration in variable explorer.
Also optionally you can try to figure out what precise symbol gives a lot of pseudo spaces between <span> tags.

from bs4 import BeautifulSoup as bs
code = '''<li class="price-current">
<span class="price-current-label">
</span>₹ 7,372
<span class="price-current-range">
<abbr title="to">–</abbr>
</span>
</li>'''
soup = bs(code,'html.parser')
w = soup.find_all('li')
l = []
for item in w:
l.append(item)
words = str(l)
t = words.split('\n')
print(t[2][7:])
₹ 7,372

Beautiful Soup / Regular Expressions: Extract a portion of text from NavigableString

I'm really new to learning python so this could be really obvious, but I have extracted a NavigableString from BeautifulSoup and I need to find data in the string. However, it's not as easy as some of the examples I've seen online.
My end goal is to create a dictionary that looks something like this:
dict = {'Fandom':'Undertale (Video Game)', 'Works':15341}
Here's are two examples of the strings:
<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
I've already succeeded extracting fandom from the string, but now I need the works count in parenthesis. How would I use Beautiful Soup and/or Regular Expressions to do this?
I also need to do error handling because while a fandom will always be displayed, it may not have a work count next to it.
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>
Here's the relevant pieces of code:
for each_f in cate:
#print(each_f)
result = each_f.find('a')
if result !=-1:
#here is where I grab the Fandom vals
fandom_name = result.contents
#print(result.contents)
NOTE: I know I'm missing the code to append to the dictionary, I haven't made it that far yet. I'm just trying to get the values to print to the screen.

use dict.fromkeys(('Fandom', 'Works')) to get :
In [17]: dict.fromkeys(('Fandom', 'Works'))
Out[17]: {'Fandom': None, 'Works': None}
use zip to combines the key with strings in the li tag, this will only combines the shortest:
zip(('Fandom', 'Works'),li.stripped_strings)
[('Fandom', 'Undertale (Video Game)'), ('Works', '(15341)')]
[('Fandom', 'Sherlock Holmes & Related Fandoms'), ('Works', '(101015)')]
[('Fandom', 'Composer - Fandom')]
then we update the dict with those data:
In [20]: for li in soup.find_all('li'):
...: d = dict.fromkeys(('Fandom', 'Works'))
...: out = zip(('Fandom', 'Works'),li.stripped_strings)
...: d.update(out)
...: print(d)
out:
{'Works': '(15341)', 'Fandom': 'Undertale (Video Game)'}
{'Works': '(101015)', 'Fandom': 'Sherlock Holmes & Related Fandoms'}
{'Works': None, 'Fandom': 'Composer - Fandom'}

You can use stripped_strings and unpack the values to get your blocks of text. You can store the results in a dictso that you can use them later.
Example:
from bs4 import BeautifulSoup
import requests
example = """<li>
<a class="tag" href="/tags/Undertale%20(Video%20Game)/works">Undertale (Video Game)</a>
(15341)
</li>
<li><a class="tag" href="/tags/Sherlock%20Holmes%20*a*%20Related%20Fandoms/works">Sherlock Holmes & Related Fandoms</a>
(101015)
</li>
<li>
<a class="tag" href="/tags/Composer%20-%20Fandom/works">Composer - Fandom</a>
</li>"""
soup = BeautifulSoup(example, "html.parser")
Fandom = {"Fandom" : []}
for li in soup.find_all("li"):
try:
fandom, count = li.stripped_strings
Fandom["Fandom"].append({fandom.strip() : count[1:-1]})
except:
fandom = li.text.strip()
Fandom["Fandom"].append({fandom.strip() : 0})
print (Fandom)
This outputs:
{'Fandom': [{'Undertale (Video Game)': '15341'}, {'Sherlock Holmes & Related Fandoms': '101015'}, {'Composer - Fandom': 0}]}
The try-catch will catch any unpacking that doesn't contains two values: your fandom title and the word count.

Extracting text within em tag in scrapy

So I am in the scrapy shell and one of the selectors I try yields this :
>>> response.css(".result-title a").extract()
[u'<em>Membership warehouse</em>', u'Publix', u'Kroger Pharmacy']
Now to get only the text I do this :
>>> response.css(".result-title a::text").extract()
[u'Publix', u'Kroger Pharmacy']
As is evident,this omits the first element which has an additional em tag around the text.How do I extract that as well along with the normal text so that my final output is :
[u'Membership warehouse', u'Publix', u'Kroger Pharmacy']

I would find all the text nodes at any level inside a and "join" them:
for a in response.css(".result-title a"):
print("".join(a.xpath(".//text()").extract()))
Demo:
$ cat index.html
<div class="result-title">
<em>Membership warehouse</em>
Publix
Kroger Pharmacy
</div>
$ scrapy shell index.html
In [1]: for a in response.css(".result-title a"):
print("".join(a.xpath(".//text()").extract()))
...:
Membership warehouse
Publix
Kroger Pharmacy

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy extract non-wrapped data - python

Related

Find tags based on an attribute value with the highest number

How to get the highest page number using xpath?

How to extract text which is surrounded by two <span> element?

Beautiful Soup / Regular Expressions: Extract a portion of text from NavigableString

Extracting text within em tag in scrapy

Categories

Resources