I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.
Related
enter image description here
Hey, I just start to learn python, want to code a web scraping. The website I got, I only care about the name and price, but all the information was written in a one div which contains other three sub divs in there. The HTML looks like:
<div class="Product">
<div class="product-image-and-name-container"></div>
<div class="prices"></div>
<div class="buy-now-button"></div>
</div>
I tied to use this line to get all the information from "Product"
root_pattern = '<div class="Product">([\s\S]*?)</div>'
But only give the first div-"product-image-and-name-container" information, and then stop. Not get anything from the other divs.
Here is my all codes:
from urllib.request import Request, urlopen
import re
class Shopping_Spider():
url = 'http://www....com/Shop-Online/587'
root_pattern = '<div class="Product">([\s\S]*?)</div>'
name_pattern = '<div class="product-name">([\s\S]*?)</div>'
price_pattern = '<span class="Price">([\s\S]*?)</span>'
def __fetch_content(self):
# page = urllib.urlopen(Shopping_Spider.url)
r = Request(Shopping_Spider.url, headers={'User-Agent': 'Mozilla/5.0'})
html_s = urlopen(r).read()
html_s = str(html_s, encoding='utf-8')
return html_s
def __analysis(self, html_s):
root_html = re.findall(Shopping_Spider.root_pattern, html_s)
anchors = []
for html in root_html:
name = re.findall(Shopping_Spider.name_pattern, html)
price = re.findall(Shopping_Spider.price_pattern, html)
anchor = {'name': name, 'price': price}
anchors.append(anchor)
return anchors
def go(self):
html_s = self.__fetch_content()
self.__analysis(html_s)
shopping_spider = Shopping_Spider()
shopping_spider.go()
Thanks in advance, I think my regular express is wrong, but do not know how to rewrite, I know it may easier use BeautifulSoup to deal with it, but just want to know is that possible I just use the regular express to get what I want! Big Thanks.
You can extract the inner content of the outer divs with a regex like
root_pattern = r'(?:<div class="Product">)(.*)(?:</div>)'
Above you define three capturing groups but reject the ones starting specified with the ?: in the start
But you have to set the dotall flag to include the \n char in the dot (.), that is all chars, specification, so later in your code use
root_html = re.findall(Shopping_Spider.root_pattern, html_s, re.DOTALL)
Then you can adapt the remaining patterns with the same principles.
Edit (important):
Jimmy, unless an expert in regexes shows up with a solution, go with the BeautifulSoup.
Unless your target html pages were that simple (they never are) this will not work in practice (despite working with the sample).Alex comment goes right to the point. Also feel free to un-accept my answer and give him the credits because I tend to believe the better advice will always be to go with BS (despite you asking for a regex alternative). You may always upvote this if you think it was somehow useful anyway.
The problem is that div tags can be arbitrarly nested in a document. The regex I presented captures everything between the start product div until the last div in the document (which will not work in practice). This is because * is "greedy". You can avoid that with ? following the *, but that you'll not solve anything as then it will capture to the first closing div. Also I see no way of matching a closing div with its start, because closing divs are all equal, and because of both arbitrary nesting, or structure changes shuch as having more listed divs inside the product div.
Not without starting to write code to somehow parse the html, which is exactly what BS is for.
This is what im trying to scrape:
<p>Some.Title.html<br />
https://www.somelink.com/yep.html<br />
Some.Title.txt<br />
https://www.somelink.com/yeppers.txt<br />
I have tried several variations of the following:
match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)
I am looking to match lines with the "p" tag and without. "p" tag only occurs on the first instance. Terrible at python so I am pretty rusty, have searched through here and google and nothing seemed to be quite the same. Thanks for any help. Really do appreciate the help I get here when I am stuck.
Desired output is an index:
http://www.SomeLink.com/yep.html
http://www.SomeLink.com/yeppers.txt
Using the Beautiful soup and requests module would be perfect for something like this instead of regex as the commenters noted above.
import requests
import bs4
html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them
This just a simple code that will select all the tags from the html site and store them in a list with the format that you illustrated up above. I'd advise checking here for a nice tutorial on bs4 and here for the actual docs.
I have the following url 'http://www.alriyadh.com/file/278?&page=1'
I would like to write a regex to access urls from page=2 till page=12
For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14
I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work
'.*?=[2-9]'
My aim is to get the content from those urls using newspaper package. I simply want this data for my research
Thanks in advance
does not require regex, a simple preset loop will do.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.alriyadh.com/file/278?&page='
for page in range(2,13):
html = requests.get(url+str(page)).text
soup = bs(html)
Here's a regex to access the proper range (i.e. 2-12):
([2-9]|1[012])
Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead?
Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9?
How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.
I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!
This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.
This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.
I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().
Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:
for line in lines:
result = re.match ('/href="(.*)"/iU', line)
print result
This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.
Can someone give me a hint on this?
Thanks in advance
Beautiful Soup can do this almost trivially:
from BeautifulSoup import BeautifulSoup as soup
html = soup('<body>qweasd</body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]
Another alternative to BeautifulSoup is lxml (http://lxml.de/);
import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/#href")
for link in links:
print link
There's an HTML parser that comes standard in Python. Checkout htmllib.
As previously mentioned: regex does not have the power to parse HTML. Do not use regex for parsing HTML. Do not pass Go. Do not collect £200.
Use an HTML parser.
But for completeness, the primary problem is:
re.match ('/href="(.*)"/iU', line)
You don't use the “/.../flags” syntax for decorating regexes in Python. Instead put the flags in a separate argument:
re.match('href="(.*)"', line, re.I|re.U)
Another problem is the greedy ‘.*’ pattern. If you have two hrefs in a line, it'll happily suck up all the content between the opening " of the first match and the closing " of the second match. You can use the non-greedy ‘.*?’ or, more simply, ‘[^"]*’ to only match up to the first closing quote.
But don't use regexes for parsing HTML. Really.
What others haven't told you is that using regular expressions for this is not a reliable solution.
Using regular expression will give you wrong results on many situations: if there are <A> tags that are commented out, or if there are text in the page which include the string "href=", or if there are <textarea> elements with html code in it, and many others. Plus, the href attribute may exist on tags other that the anchor tag.
What you need for this is XPath, which is a query language for DOM trees, i.e. it lets you retrieve any set of nodes satisfying the conditions you specify (HTML attributes are nodes in the DOM).
XPath is a well standarized language now a days (W3C), and is well supported by all major languages. I strongly suggest you use XPath and not regexp for this.
adw's answer shows one example of using XPath for your particular case.
Don't divide the html content into lines, as there maybe multiple matches in a single line. Also don't assume there is always quotes around the url.
Do something like this:
links = re.finditer(' href="?([^\s^"]+)', content)
for link in links:
print link
Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.
Here follows the code to list all URL's from a webpage:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
Thanks for all the replies.