Quotes Messing Up Python Scraper - python

I am trying to scrape all the data within a div as follows. However, the quotes are throwing me off.
<div id="address">
<div class="info">14955 Shady Grove Rd.</div>
<div class="info">Rockville, MD 20850</div>
<div class="info">Suite: 300</div>
</div>
I am trying to start it with something along the lines of
addressStart = page.find("<div id="address">")
but the quotes within the div are messing me up. Does anybody know how I can fix this?

To answer your specific question, you need to escape the quotes, or use a different type of quote on the string itself:
addressStart = page.find("<div id=\"address\">")
# or
addressStart = page.find('<div id="address">')
But don't do that. If you are trying to "parse" HTML, let a third-party library do that. Try Beautiful Soup. You get a nice object back which you can use to traverse or search. You can grab attributes, values, etc... without having to worry about the complexities of parsing HTML or XML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
for address in soup.find_all('div',id='address'): # returns a list, use find if you just want the first
for info in address.find_all('div',class_='info'): # for attribute class, use class_ instead since class is a reserved word
print info.string

Related

BeautifulSoup find partial string in section

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

Python getting the part after div class

Here is what I have
<div class="investor-item" usrid="75500">
<div class="row">
<div class="col-sm-3">
<div class="number">10,000€</div>
<div class="date">03 December 2018</div>
</div>
I would like to scrape "75500", but have no clue how to do that.
When I use
soup.findAll('div',{"class":"investor-item"})
it does not capture what I want. Do you have any suggestion?
There are a number of ways you could capture this. Your command worked for me. Though since you have a Euro sign in there, you may want to make sure your script is using the right encoding. Also, remember that find_all will return a list, not just the first matching item.
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
html = """
<div class="investor-item" usrid="75500">
<div class="row">
<div class="col-sm-3">
<div class="number">10,000€</div>
<div class="date">03 December 2018</div>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
mytag = soup.find('div', {"class": "investor-item"})
mytag2 = soup.find('div', class_="investor-item")
mytag3 = soup.find_all('div', class_="investor-item")[0]
mytag4 = soup.findAll('div', class_="investor-item")[0]
mytag5 = soup.findAll('div',{"class":"investor-item"})[0]
print(mytag['usrid']) # Returns 75500
print(mytag2['usrid']) # Also returns 75500
print(mytag3['usrid']) # Also returns 75500
print(mytag4['usrid']) # Also returns 75500
print(mytag5['usrid']) # Also returns 75500
EDIT: Here are some more details on the 5 different examples I gave.
The typical naming convention for Python functions is to use all lowercase and underscores, whereas some other languages use camel case. So although although find_all() is more of the "official" way to do this in BeautifulSoup with Python, and findAll is something you'd see in BeautifulSoup for other languages, Python seems to accept it too.
As mentioned, find_all returns a list whereas find returns the
first match, so doing a find_all and taking the first element
([0]) gives the same result.
Finally, {"class": "investor-item"} is an example of the general way you can specify attributes beyond just the HTML tag name itself. You just pass in the additional parameters in a dictionary like this. But since class is such a common attribute to look for in a tag, BeautifulSoup gives you the option of not having to use a dictionary and instead typing class_= followed by a string of the class name you're looking for. The reason for that underscore is so that Python doesn't confuse it with class, the Python command to create a Python class in your code.

Error accessing the class having hyphen(-) separated names in html file using BeautifulSoup

I am trying to scrape the data of popular english movies on Hotstar
I downloaded the html source code and I am doing this:
from bs4 import BeautifulSoup as soup
page_soup = soup(open('hotstar.html'),'html.parser')
containers = page_soup.findAll("div",{"class":"col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope"})
container = containers[0]
# To get video link
container.div.hs-cards-directive.article.a
I am getting an error at this point:
NameError: name 'cards' is not defined
These are the first few lines of the html file:
<div bindonce="" class="col-xs-6 col-sm-4 col-md-3 col-lg-3 ng-scope" ng-repeat="slides in gridcardData">
<hs-cards-directive cdata="slides" class="ng-isolate-scope" renderingdone="shownCard()">
<article class="card show-card" ng-class="{'live-sport-card':isLiveSportCard, 'card-active':btnRemoveShow,'tounament-tray-card':record.isTournament}" ng-click="cardeventhandler({cardrecord:record})" ng-init="init()" pdata="record" removecard="removecard" watched="watched">
<a href="http://www.hotstar.com/movies/step-up-revolution/1770016594" ng-href="/movies/step-up-revolution/1770016594" restrict-anchor="">
Please help me out!
I am using Python 3.6.3 on Windows.
As (loosely) explained in the Going down section of the docs, the tag.descendant syntax is just a convenient shortcut for tag.find('descendant').
That shortcut can't be used in cases where you have tags whose names aren't valid Python identifiers.1 (Also in cases where you have tags whose names collide with methods of BS4 itself, like a <find> tag.)
Python identifiers can only have letters, digits, and underscores, not hyphens. So, when you write this:
container.div.hs-cards-directive.article.a
… python parses it like this mathematical expression:
container.div.hs - cards - directive.article.a
BeautifulSoup's div node has no descendant named hs, but that's fine; it just returns None. But then you try to subtract cards from that None, and you get a NameError.
Anyway, the only solution in this case is to not use the shortcut and call find explicitly:
container.div.find('hs-cards-directive').article.a
Or, if it makes sense for your use case, you can just skip down to article, because the shortcut finds any descendants, not just direct children:
container.div.article.a
But I don't think that's appropriate in your case; you want articles only under specific child nodes, not all possible articles, right?
1. Technically, it is actually possible to use the shortcut, it's just not a shortcut anymore. If you understand what getattr(container.div, 'hs-cards-directive').article.a means, then you can write that and it will work… but obviously find is going to be more readable and easier to understand.

python how to identify block html contain text?

I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.
Is there any easy way to do it in python?
is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.
You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.

replacing html tags with BeautifulSoup

I'm currently reformatting some HTML pages with BeautifulSoup, and I ran into bit of a problem.
My problem is that the original HTML has things like this:
<li><p>stff</p></li>
and
<li><div><p>Stuff</p></div></li>
as well as
<li><div><p><strong>stff</strong></p></div><li>
With BeautifulSoup I hope to eliminate the div and the p tags, if they exists, but keep the strong tag.
I'm looking through the beautiful soup documentation and couldn't find any.
Ideas?
Thanks.
This question probably refered to an older version of BeautifulSoup because with bs4 you can simply use the unwrap function:
s = BeautifulSoup('<li><div><p><strong>stff</strong></p></div><li>')
s.div.unwrap()
>> <div></div>
s.p.unwrap()
>> <p></p>
s
>> <html><body><li><strong>stff</strong></li><li></li></body></html>
What you want to do can be done using replaceWith. You have to duplicate the element you want to use as the replacement, and then feed that as the argument to replaceWith. The documentation for replaceWith is pretty clear on how to do this.
I saw many answers for this simple question, i also came here to see something useful but unfortunately i didn't get what i was looking for then after few tries I found a simple solution for this answer and here it is
soup = BeautifulSoup(htmlData, "html.parser")
h2_headers = soup.find_all("h2")
for header in h2_headers:
header.name = "h1" # replaces h2 tag with h1
All h2 tags converted to h1. You can convert any tag by just changing the name.
You can write your own function to strip tags:
import re
def strip_tags(string):
return re.sub(r'<.*?>', '', string)
strip_tags("<li><div><p><strong>stff</strong></p></div><li>")
'stff'
Simple solution get your whole node means div:
Convert to string
Replace <tag> with required tag/string.
Replace corresponding tag with empty string.
Convert the converted string to parsable string by passing to beautifulsoup
What I have done for mint
Example:
<div class="col-md-12 option" itemprop="text">
<span class="label label-info">A</span>
**-2<sup>31</sup> to 2<sup>31</sup>-1**
sup = opt.sup
if sup: //opt has sup tag then
//opts converted to string.
opt = str(opts).replace("<sup>","^").replace("</sup>","") //replacing
//again converted from string to beautiful string.
s = BeautifulSoup(opt, 'lxml')
//resign to required variable after manipulation
opts = s.find("div", class_="col-md-12 option")
Output:
-2^31 to 2^31-1
without manipulation it will like this (-231 to 231-1)

Categories