Python Regex Match Line If Ends With? - python

This is what im trying to scrape:
<p>Some.Title.html<br />
https://www.somelink.com/yep.html<br />
Some.Title.txt<br />
https://www.somelink.com/yeppers.txt<br />
I have tried several variations of the following:
match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)
I am looking to match lines with the "p" tag and without. "p" tag only occurs on the first instance. Terrible at python so I am pretty rusty, have searched through here and google and nothing seemed to be quite the same. Thanks for any help. Really do appreciate the help I get here when I am stuck.
Desired output is an index:
http://www.SomeLink.com/yep.html
http://www.SomeLink.com/yeppers.txt

Using the Beautiful soup and requests module would be perfect for something like this instead of regex as the commenters noted above.
import requests
import bs4
html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them
This just a simple code that will select all the tags from the html site and store them in a list with the format that you illustrated up above. I'd advise checking here for a nice tutorial on bs4 and here for the actual docs.

Related

BeautifulSoup find partial string in section

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

Python - Find the correct link within a post with a wildcard

i try to extract a link out of a forum with the following python code. The post contains a lot of html links, and i try to find a special one:
Daily news <img src="https://site.html/pic.png" class="bbCodeImage LbImage" alt="[​IMG]" data-url="https://site.html/pic.png">
Here is my code:
from bs4 import BeautifulSoup
import defs
import re
def find_link(soup ,date, section, URL):
#Find the right post
section = soup.find('li', {"data-author":"Ghostwriter"})
#Search the link inside the post
link = section.find(string=" Daily news ")
#Mark the whole html section
section_new = str(link.find_parents('a'))
#get the link
link_new = re.search("(?P<url>https?://[^\s]+)", section_new).group("url")
The problem ist now, that sometimes there is no space before or after "Daily news" and than my code fails:
AttributeError: 'NoneType' object has no attribute 'find_parents'
How can i make my code more flexible, for example with some wildcards. For example:
link = section.find(string="*Daily news*")
Thanks a lot!
I believe you can use re.compile as an argument to string. This should allow you create a regex that matches the strings you are looking for. More information about python regex can be found here: https://docs.python.org/3/library/re.html
try using tags.get to return a string then you should should be able to use str.statswith to do exactly what you want.

Find a string in a string which starts and ends with different string in Python

I have complete html of a page and from that I need to find GA (google Analytics) id of it. For example:
<script>ga('create', 'UA-4444444444-1', 'auto');</script>
From above string I need to get UA-4444444444-1, which starts from "UA-" and ends with "-1". I have tried this:
re.findall(r"\"trackingId\"\s?:\s?\"(UA-\d+-\d+)\"", raw_html)
but didn't get any success. Please let me know what mistake I am making.
Thanks
It seems that you are overthinking it, you could just seek for the UA token directly:
re.findall(r"UA-\d+-\d+")
Never use regex in parsing through the html. BeautifulSoup should be find in extracting text from tags. Here we extract script tags from html, then we apply regex to text located in script tags.
import re
from bs4 import BeautifulSoup as bs4
html = "<script>ga('create', 'UA-4444444444-1', 'auto');</script>"
soup = bs4(html, 'lxml')
pattern = re.compile("UA-[0-9]+-[0-9]+")
ids = []
for i in soup.findAll("script"):
ids.append(pattern.findall(i.text)[0])
print(ids)

Going through HTML DOM in Python

I'm looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.
I currently have this:
#!/usr/bin/env python
import urllib.request
def getSite(url):
return urllib.request.urlopen(url)
if __name__ == '__main__':
content = getSite('http://www.google.com').read()
print(content)
When I print content it does print out the entire html page which is something close to what I want... although I would ideally like to be able to navigate through the DOM rather then treating it as a giant string.
I'm still fairly new to Python but have experience with multiple other languages (mainly Java, C#, C++, C, PHP, JS). I've done something similar with Java before but wanted to try it out in Python.
There are many different modules you could use. For example, lxml or BeautifulSoup.
Here's an lxml example:
import lxml.html
mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)
description = lxml_mysite.xpath("//meta[#name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag
>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
And a BeautifulSoup example:
from bs4 import BeautifulSoup
mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)
description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute
>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
Notice how BeautifulSoup returns a unicode string, while lxml does not. This can be useful/hurtful depending on what is needed.
Check out the BeautifulSoup module.
from bs4 import BeautifulSoup
import urllib
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())
for link in soup.find_all('a'):
print(link.get('href'))

Python Regex scraping data from a webpage

My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.
In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...
How about changing RESATAURANT1 to RESTAURANT1, for starters?

Categories