Scrapy Scrape element within unknown number of <div>

Scrapy Scrape element within unknown number of <div> - python

I am trying to scrape a list of website on Shopee. Some example include dudesgadget and 2ubest. Each of these shopee shop have different design and way of constructing their web element and different domain as well. They looks like stand alone website but they are actually not.
So the main problem here is I am trying to scrape the product details. I will summarize some different structure:
2ubest
<html>
<body>
<div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
<main class="wrapper main-content" role="main">
<div class="grid">
<div class="grid__item">
<div id="shopify-section-product-template" class="shopify-section">
<script id="ProductJson-product-template" type="application/json">
//Things I am looking for
</script>
</div>
</div>
</div>
</main>
</div>
</body>
</html>
littleplayland
<html>
<body id="adjustable-ergonomic-laptop-stand" class="template-product">
<script>
//Things I am looking for
</script>
</body>
</html>
And few other, and I had discover a pattern between them.
The thing that I am looking for will for sure in <body>
The thing that I am looking for is within a <script>
The only thing that I not sure is the distance from <body> to <script>
My solution is:
def parse(self, response):
body = response.xpath("//body")
for script in body.xpath("//script/text()").extract():
#Manipulate the script with js2xml here
I am able to extract the littleplayland, dailysteals and many others which has very less distance from the <body> to <script>, but does not works for the 2ubest which has a lot of other html element in between to the thing I am looking for. Can I know are there solution that I can ignore all the html element in between and only look for the <script> tag?
I need a single solution that are generic and can work across all Shopee website if possible since all of them have the characteristic that I had mention above.
Which mean that the solution should not filter using <div> because every different website have different numbers of <div>

This is how to get the scripts in your HTML using Scrapy:
scriptTagSelector = scrapy.Selector(text=text, type="html")
theScripts = scriptTagSelector.xpath("//script/text()").extract()
for script in theScripts:
#Manipulate the script with js2xml here
print("------->A SCRIPT STARTS HERE<--------")
print(script)
print("------->A SCRIPT ENDS HERE<--------")
Here is an example with the HTML in your question (I added an extra script :) ):
import scrapy
text="""<html>
<body>
<div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
<main class="wrapper main-content" role="main">
<div class="grid">
<div class="grid__item">
<div id="shopify-section-product-template" class="shopify-section">
<script id="ProductJson-product-template" type="application/json">
//Things I am looking for
</script>
</div>
<script id="script 2">I am another script</script>
</div>
</div>
</main>
</div>
</body>
</html>"""
scriptTagSelector = scrapy.Selector(text=text, type="html")
theScripts = scriptTagSelector.xpath("//script/text()").extract()
for script in theScripts:
#Manipulate the script with js2xml here
print("------->A SCRIPT STARTS HERE<--------")
print(script)
print("------->A SCRIPT ENDS HERE<--------")

Try this:
//body//script/text()

Related

Add missing paragraph tags to HTML

I'm processing some medium fancy HTML pages to convert to simpler XHTML ones. The source pages have several divs (that I'm removing), that contain text not inside <p> tags. I need to add these <p> tags.
Here is a minimal example of the source page
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>
I want to convert it to
<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<p>This is a sample page</p>
<p>Lots of things to learn!</p>
<p>And lots to test</p>
<p>Enough with the sample code</p>
</body>
</html>
I'm developing a python script using BeautifulSoup4 to do all the stuff. Now I'm stuck at this step. And it looks more like a regex job to locate the text to embed in <p> tags, and pass it to BeautifulSoup4. What do you think is the best approach to the problem?
I've scan several pages and I've seen these wild texts at the start of divs, but I can't exclude there will be several more around the pages in random places. (i.e. a script that checks at start of divs won't probably be reliable).
Notice the <br/> tags that has to be used to split the <p> paragraphs.

This script will remove all tags from <body> but <p> and then creates new paragraphs in place of <br/>:
from bs4 import BeautifulSoup
txt = '''<!DOCTYPE html>
<html>
<body>
<p>Hello world!</p>
<div style="font-weight: bold;">
This is a sample page
<br/>
Lots of things to learn!
<p>And lots to test</p>
</div>
<p>Enough with the sample code</p>
</body>
</html>'''
soup = BeautifulSoup(txt, 'html.parser')
for tag in soup.body.find_all(lambda tag: tag.name != 'p'):
tag.unwrap()
for txt in soup.body.find_all(text=True):
if txt.find_parent('p') or txt.strip() == '':
continue
txt.wrap(soup.new_tag("p"))
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html>
<body>
<p>
Hello world!
</p>
<p>
This is a sample page
</p>
<p>
Lots of things to learn!
</p>
<p>
And lots to test
</p>
<p>
Enough with the sample code
</p>
</body>
</html>

How to remove tags with different heads and tails in BeautifulSoup?

I'm scraping a webpage using BeautifulSoup. When I cleaned the html I encounter a tag with different head and tails:
<!-- BEGIN mobile-middle-rectangle -->
<div class="mobile-bottom-rectangle hidden-sm hidden-md hidden-lg">
<div class="textrule">
<span>advertisement</span>
</div>
<!-- /1002721/ScienceDaily_Mobile_Bottom_Rectangle -->
<div id="adslot-mobile-bottom-rectangle">
<script type="text/javascript">
googletag.cmd.push(function() {
deployads.push(function() { deployads.gpt.display("adslot-mobile-bottom-rectangle") });
});
</script>
</div>
<hr class="hrrule">
</div>
<!-- END mobile-middle-rectangle -->
How can I remove it?

Generally, you should use regex.
If you want to remove all the tags starting and ending with these types of tags (<!-- BEGIN/END whatever -->), use this:
re.sub("<!-- BEGIN.*>\n(<?.+>?\s+)+<!-- END.*-->", "", html)
If you want to save the content between these tags, use this:
cleaned = re.sub(r"<!-- BEGIN.*>\n((<?.+>?\s+)+)<!-- END.*-->", r"\1", html)

Getting the next UL element using BeautifulSoup

I'm trying to find the next ul element in a give webpage.
I start by plugging in my response into Beautiful Soup like so:
soup = BeautifulSoup(response.context)
printing out response.context gives the following
print(response.context)
<!DOCTYPE html>
<html>
<head>
<title> | FollowUp</title>
<meta name='viewport' content='width=device-width, initial-scale=1.0'>
<link href='/static/css/bootstrap.min.css' rel='stylesheet' media='screen'>
</head>
<body>
<div class='navbar'>
<div class='navbar-inner'>
<a class='brand' href='/'>TellMe.cat</a>
<ul class='nav'>
<li><a href='list'>My Stories</a></li>
<li><a href='add'>Add Story</a></li>
<li><a href='respond'>Add Update</a></li>
</ul>
<form class='navbar-form pull-right' action='process_logout' method='post'>
<input type='hidden' name='csrfmiddlewaretoken' value='RxquwEsaS5Bn1MsKOIJP8uLtRZ9yDusH' />
Hello add!
<button class='btn btn-small'>Logout</button>
</form>
</div>
</div>
<div class='container'>
<ul id='items'>
<ul>
<li><a href='http://www.example.org'>http://www.example.org</a></li>
<ul>
<p>There have been no follow ups.</p>
</ul>
</ul>
</ul>
</div>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script src='/static/js/bootstrap.min.js'></script>
</body>
</html>
I'm trying to get the ul that's named 'items'. I do so with:
items = soup.find(id='items')
Which gives me the correct ul and all of its children. However calling
items.find_next('ul')
Gives the error of
TypeError: 'NoneType' object is not callable
Even though this seems to be how it's supposed to be called accorind to the Beautiful Soup docs: https://beautiful-soup-4.readthedocs.org/en/latest/#find-all-next-and-find-next
What am I doing incorrectly?

Make a virtualenv, pip install BeautifulSoup requests, open python console.
import BeautifulSoup
import requests
html = requests.get("http://yahoo.com").text
b = BeautifulSoup.BeautifulSoup(html)
m = b.find(id='masthead')
item = m.findNext('ul')
dir(m) tells you the functions on m. You can see you want findNext.
You also might find ipython a more forgiving shell to run python in. You can type the name of a variable and hit Tab to see the member variables.

Rendering an iframe from HTML stored in the DB

I have a HTML template that I'm rendering using Django. I'm passing in all the necessary context variables — my_heading is the heading, my_html is the mark-safed html. I'd like to display my_html in the iframe.
<html>
<head>
<title>Iframe Example</title>
</head>
<body>
<p>{% my_heading %}</p>
<iframe name="iframe1" width="600" height="400" src="http://www.yahoo.com" frameborder="yes" scrolling="yes"></iframe>
</body>
</html>
Would you know how to do this? All the examples I've found show the iframe pointing to a URL. I'm god-horrible at HTML and JS. :|
Thanks

Michael Klocker does not answer your question, but the answer he has given is a better solution. You should rather do that.
But to answer you question: You can only pass a url to a IFrame, so if really want to use a IFrame, you need to set up second url+view in django to return my_html as the response. I.e. 2 http requests will happen. 1 for the page containing the IFrame, and 1 request for then contents of the IFrame.

If I understand you correctly, you just want to show the HTML content that you prepared dynamically with Python within the Django template, correct? If that is the case, you do not need an Iframe. Iframes are only necessary if you would like to integrate parts of a different URI (webpage) as a frame (section on your webpage) on your page. Here a sample:
<html>
<head>
<title>Iframe Example</title>
</head>
<body>
<p>{% my_heading %}</p>
<div id="content">{% my_html %}</div>
</body>
</html>
Addition to the above code, base on comment below. Sounds like you want to show an entire page with an Iframe and that this second page comes form another view/url on your box. Then use:
<html>
<head>
<title>Iframe Example</title>
</head>
<body>
<p>{% my_heading %}</p>
<iframe src="URL_TO_MY_2ND_VIEW_IN_DJANGO" width="100%" height="300"></iframe>
</body>
</html>
Iframe explanation on W3Schools
Iframe description on W3C Site

I used this hack to do it. The div is hidden and contains the HTML. The IFrame doesn't point to a location because I don't want any CORS issues. I then inject my HTML into it.
<html>
<head>
<title>Iframe Example</title>
</head>
<body>
<p>{% my_heading %}</p>
<iframe name="iframe2" id="iframe2" width="0" height="0" src="" style="border:0px; overflow-x:hidden;"></iframe>
<div id="page" style="display:none" >{{ my_html }}</div>
<script type="text/javascript">
function getWindow(iframe) {
return (iframe.contentWindow) ? iframe.contentWindow : (iframe.contentDocument.document) ? iframe.contentDocument.document : iframe.contentDocument;
}
getWindow(document.getElementById('iframe2')).document.open();
getWindow(document.getElementById('iframe2')).document.write(document.getElementById('page').innerHTML);
getWindow(document.getElementById('iframe2')).document.close();
document.getElementById('iframe2').style.height = (getWindow(document.getElementById('iframe2')).document.body.scrollHeight + 20) +"px";
document.getElementById('iframe2').style.width = (document.getElementById('iframe2').parentNode.offsetWidth) +"px";
</script>
</body>
</html>

Editing tree in place while iterating in lxml

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.
Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.
However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:
from lxml.html import fromstring, tostring
import urllib2
import re
html = '''
<html>
<head>
</head>
<body>
<div>
<p class="unwanted">This content should go</p>
<p class="fine">This content should stay</p>
</div>
<div id = "second" class="unwanted">
<p class = "alreadydead">This content should not be looked at</p>
<p class = "alreadydead">Nor should this</>
<div class="alreadydead">
<p class="alreadydead">Still dead</p>
</div>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
for element in allElements:
s = "%s%s" % (element.get('class', ''), element.get('id', ''))
if re.compile('unwanted').search(s):
for i in range(len(element.findall('.//*'))):
allElements.next()
element.drop_tree()
print tostring(page.body)
This outputs:
<body>
<div>
<p class="yeswanted">This content should stay</p>
</div>
<div>
<p class="yeswanted">This content should also stay</p>
</div>
</body>
This feels like a nasty hack - is there a more sensible way to achieve this using the library?

To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.
This produces the same result as your script:
EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(#class, '\bunwanted\b') or re:test(#id, '\bunwanted\b')]"
tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
node.drop_tree()
print lxml.html.tostring(tree.body)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy Scrape element within unknown number of <div> - python

Try this: //body//script/text()

Related

Add missing paragraph tags to HTML

How to remove tags with different heads and tails in BeautifulSoup?

Getting the next UL element using BeautifulSoup

Rendering an iframe from HTML stored in the DB

Editing tree in place while iterating in lxml

Categories

Resources