How to do regular expression to extract string from HTML file - python

I still cannot figure out how to extract links like this:
http: example.com/AA-HDCM-300B.pdf
Since I want to extract the product part number "AA-HDCM-300B" which begins with "AA-".
Does anyone what the extraction code will be?

import re
url = 'dview.com/IDVIEW/Products/Cameras/Covert/assets/IV-PC229XP.pdf'
result = re.findall('(IV.*)\.', url)
Output:
IV-PC229XP

Related

BeautifulSoup find partial string in section

I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.

Printing out a specific part of text from requests

I've been trying to scrape out data from a profile to have a set of information whether something changed, here's a snippet of what overall code would probably would look like:
import requests
response = requests.get('https://twitter.com/elonmusk')
print(response.text[30907:30957])
#need to print out "sensitive_media_settings_enabled":{"value":false}
I need to have "sensitive_media_settings_enabled":{"value":false} printed out in the shell, how can I do this?
Like Ali said in a comment, a better approach to this is to use a regular expression to find and extract the string you're looking for. When I tried this, the index start and stop were at 43539 and 43589 respectively.
Here's how you could do it with regex
import re
import requests
response = requests.get('https://twitter.com/elonmusk')
reg_expression = r'"sensitive_media_settings_enabled":{"value":(true|false)}'
result = re.search(reg_expression, response.text)
print(result[0])
prints "sensitive_media_settings_enabled":{"value":false}

Find a string in a string which starts and ends with different string in Python

I have complete html of a page and from that I need to find GA (google Analytics) id of it. For example:
<script>ga('create', 'UA-4444444444-1', 'auto');</script>
From above string I need to get UA-4444444444-1, which starts from "UA-" and ends with "-1". I have tried this:
re.findall(r"\"trackingId\"\s?:\s?\"(UA-\d+-\d+)\"", raw_html)
but didn't get any success. Please let me know what mistake I am making.
Thanks
It seems that you are overthinking it, you could just seek for the UA token directly:
re.findall(r"UA-\d+-\d+")
Never use regex in parsing through the html. BeautifulSoup should be find in extracting text from tags. Here we extract script tags from html, then we apply regex to text located in script tags.
import re
from bs4 import BeautifulSoup as bs4
html = "<script>ga('create', 'UA-4444444444-1', 'auto');</script>"
soup = bs4(html, 'lxml')
pattern = re.compile("UA-[0-9]+-[0-9]+")
ids = []
for i in soup.findAll("script"):
ids.append(pattern.findall(i.text)[0])
print(ids)

Exclude certain keyword from URL

I am successfully able to get the url using my technique but point is that i need to change the url slightly like this: "http://www.example.com/static/p/no-name-0330-227404-1.jpg". Where as in img tag i get this link: "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
HTML CODE:
<div class="swiper-wrapper"><img data-error-placeholder="PlaceholderPDP.jpg" class="swiper-lazy swiper-lazy-loaded" src="http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"></div>
Python Code:
imagesList = []
imagesList.append([re.findall(re.compile(u'http.*?\.jpg'), etree.tostring(imagesList).decode("utf-8")) for imagesList in productTree.xpath('//*[#class="swiper-wrapper"]/img')])
print (imagesList)
output:
[['http://www.example.com/static/p/no-name-8143-225244-1-product.jpg']]
NOTE: I need to remove "-product" from url and I have no idea why this url is inside two square brackets.
If you are intending to remove just the product keyword then you can simply use the .replace() API. Otherwise you can construct regular expressions to manipulate the string. Below is an example code for the replace API.
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myURL = myURL.replace("-product", "") # gives u "http://www.example.com/static/p/no-name-0330-227404-1.jpg"
print(myURL)
Regular expression version: (Probably not a clean solution, as in it is difficult to understand). However it is better than the first approach because it dynamically discard the last set of -words (e.g. -product)
What I have done is capture 3 parts of the URL but omit the middle part because that is the -product bit, and combine part 1 and 3 together to form your URL.
import re
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myPattern = "(.*)(-.*)(\.jpg)$"
pattern = re.compile(myPattern)
match = re.search(pattern, myURL)
print (match.group(1) + match.group(3))
Same output as above:
http://www.example.com/static/p/no-name-0330-227404-1.jpg
If all the images have the word "product" could you just do a simple string replace and remove just that word? Whatever you are trying to do (including renaming files) I see that as the simplest solution.

Find all HTML and non-HTML encoded URLs in string

I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string.
For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml.
On the other hand, if my string contained only a plain URL without HTML tags, this answer recommends using a regular expression.
I wasn't able to find a good solution given my string contains both HTML encoded URL as well as a plain URL. Here is some example code:
import lxml.html
example_data = """Click Me!
http://www.another-random-domain.com/xyz.html"""
dom = lxml.html.fromstring(example_data)
for link in dom.xpath('//a/#href'):
print "Found Link: ", link
As expected, this results in:
Found Link: http://www.some-random-domain.com/abc123/def.html
I also tried the twitter-text-python library that #Yannisp mentioned, but it doesn't seem to extract both URLS:
>>> from ttp.ttp import Parser
>>> p = Parser()
>>> r = p.parse(example_data)
>>> r.urls
['http://www.another-random-domain.com/xyz.html']
What is the best approach for extracting both kinds of URLs from a string containing a mix of HTML and non HTML encoded data? Is there a good module that already does this? Or am I forced to combine regex with BeautifulSoup/lxml?
I upvoted because it triggered my curiosity. There seems to be a library called twitter-text-python, that parses Twitter posts to detect both urls and hrefs. Otherwise, I would go with the combination regex + lxml
You could use RE to find all URLs:
import re
urls = re.findall("(https?://[\w\/\$\-\_\.\+\!\*\'\(\)]+)", example_data)
It's including alphanumerics, '/' and "Characters allowed in a URL"
Based on the answer by #YannisP, I was able to come up with this solution:
import lxml.html
from ttp.ttp import Parser
def extract_urls(data):
urls = set()
# First extract HTML-encoded URLs
dom = lxml.html.fromstring(data)
for link in dom.xpath('//a/#href'):
urls.add(link)
# Next, extract URLs from plain text
parser = Parser()
results = parser.parse(data)
for url in results.urls:
urls.add(url)
return list(urls)
This results in:
>>> example_data
'Click Me!\nhttp://www.another-random-domain.com/xyz.html'
>>> urls = extract_urls(example_data)
>>> print urls
['http://www.another-random-domain.com/xyz.html', 'http://www.some-random-domain.com/abc123/def.html']
I'm not sure how well this will work on other URLs, but it seems to work for what I need it to do.

Categories