Retrieve specefic URL from string of multiple URLs - python

I have a list of URLS in a string and I need to retrieve a specific URL from the string. (Using Python)
Here is the output string:
https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=100&q=60 100w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=200&q=60 200w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=300&q=60 300w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=400&q=60 400w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=500&q=60 500w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=600&q=60 600w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=700&q=60 700w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=800&q=60 800w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=900&q=60 900w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=60 1000w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1100&q=60 1100w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1200&q=60 1200w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1296&q=60 1296w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1400&q=60 1400w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1600&q=60 1600w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1800&q=60 1800w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2000&q=60 2000w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2200&q=60 2200w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2400&q=60 2400w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2588&q=60 2588w
I want to extract the URL that is before 1800w and after 1600w. Any help on how to go about this will be helpful!

If its always the part between 1600w and 1800w you can use this regex.
import re
regex = r"1600w,(.+)1800w"
test_str = "https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=100&q=60 100w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=200&q=60 200w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=300&q=60 300w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=400&q=60 400w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=500&q=60 500w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=600&q=60 600w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=700&q=60 700w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=800&q=60 800w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=900&q=60 900w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=60 1000w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1100&q=60 1100w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1200&q=60 1200w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1296&q=60 1296w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1400&q=60 1400w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1600&q=60 1600w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1800&q=60 1800w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2000&q=60 2000w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2200&q=60 2200w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2400&q=60 2400w, https://images.unsplash.com/photo-1535083783855-76ae62b2914e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2588&q=60 2588w"
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches[0])

Related

How to remove all balise in text python

I want to extract data from a tag to simply retrieve the text. Unfortunately I can't extract just the text, I always have links in this one.
Is it possible to remove all of the <img> and <a href> tags from my text?
<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>
I just want to recover this : its a good day and ignore the content of the <a href> tag in my <div> tag
Currently I perform the extraction via a beautifulsoup.find('div)
Try to do this
import requests
from bs4 import BeautifulSoup
#response = requests.get('your url')
html = BeautifulSoup('''<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a>
</div>''', 'html.parser')
soup = html.find_all(class_='xxx')
print(soup[0].text.split('\n')[0])
Let's import re and use re.sub:
import re
s1 = '<div class="xxx" data-handler="xxx">its a good day'
s2 = '<a class="link" href="https://" title="text">https:// link</a></div>'
s1 = re.sub(r'\<[^()]*\>', '', s1)
s2 = re.sub(r'\<[^()]*\>', '', s2)
Output
>>> print(s1)
... 'its a good day'
>>> print(s2)
... ''
EDIT
Based on your comment, that all text before <a> should be captured and not only the first one in element, select all previous_siblings and check for NavigableString:
' '.join(
[s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)
Example
from bs4 import Tag, NavigableString, BeautifulSoup
html='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)
' '.join(
[s for s in soup.select_one('.xxx a').previous_siblings if isinstance(s, NavigableString)]
)
To focus just on the text and not the children tags of an element, you could use :
.find(text=True)
In case the pattern is always the same and text the first part of content in the element:
.contents[0]
Example
from bs4 import BeautifulSoup
html='''
<div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div>
'''
soup = BeautifulSoup(html)
soup.div.find(text=True).strip()
Output
its a good day
So basically you don't want any text inside the <a> tags and everything within all tags.
from bs4 import BeautifulSoup
html1='''
<div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a></div>
'''
html2 = ''' <div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div> '''
html3 = ''' <div class="xxx" data-handler="xxx"><br>New wallpaper <br>Find over 100+ of <a class="link" href="https://" title="text">https:// link </a><div class="xxx" data-handler="xxx">its a good day
<a class="link" href="https://" title="text">https:// link</a></div></div> '''
soup = BeautifulSoup(html3,'html.parser')
for t in soup.find_all('a', href=True):
t.decompose()
test = soup.find('div',class_='xxx').getText().strip()
print(test)
output:
#for html1: New wallpaper Find over 100+ of
#for html2: its a good day
#for html3: New wallpaper Find over 100+ of its a good day

Skip the leading and trailing spaces when finding tag by text with beautiful soup

Previously I used req = soup.find("td", string = "tags text")(just example) method to find elements by its text, but in this case the tags string has some spaces before and after the text.
Tag is like the below:
<dt> I am text </dt>
How should I ignore the leading and trailing spaces?
If I want to use the previous method, I have to write: req = soup.find("td", string = " I am text "), but I think there should be a better way.
You can use function in the text= parameter of .find() function (or string=):
from bs4 import BeautifulSoup
html_doc = '''<dt>Other text</dt>
<dt> I am text </dt>'''
soup = BeautifulSoup(html_doc, 'html.parser')
dt = soup.find('dt', text=lambda t: 'I am text' == t.strip())
print(dt)
Prints:
<dt> I am text </dt>

How to extract br text from span element?

Using Beautiful Soup v4, I've a span as follows:
<span style="color: grey;">32.44 MB<br/>10454 Downloads<br/>35:25 Mins<br/>128kbps Stereo</span>
I'd like to extract the text for the br elements individually. How can I do it?
Try this:
from bs4 import BeautifulSoup
txt = '''<span style="color: grey;">32.44 MB<br/>10454 Downloads<br/>35:25 Mins<br/>128kbps Stereo</span>'''
soup = BeautifulSoup(txt, 'html.parser')
for tag in soup.select('span br'):
print(tag.next)
Output:
10454 Downloads
35:25 Mins
128kbps Stereo
although this may not be proper way to do that, but if you use your span as a string, you can extract the words like this:
user_input = '<span style="color: grey;">32.44 MB<br/>10454 Downloads<br/>35:25 Mins<br/>128kbps Stereo</span>'.split( "<br/>" )
WordList = []
for word in user_input:
if ">" in word:
word = word[word.index(">")+1:]
if word:
WordList.append( [word] )
print(WordList)

re.sub fails to execute - even if the regex pattern is found?

Consider this example, which I've ran on Python 2.7:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
tstr = r''' <div class="thebibliography">
<p class="bibitem" ><span class="biblabel">
[1]<span class="bibsp">   </span></span><a
id="Xtester"></a><span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
<span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H. </span> testöng ... . <span
class="cmti-10">Draftin:</span>
<a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
</div>
'''
# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
print( re.findall(regstr, tout2, re.DOTALL)) # finds
print("------") #
print( re.sub(regstr, "AAAAAAA", tout2, re.DOTALL )) # does nothing?
When I run this - the first regex is replaced/sub'd as expected ( is gone); then in the output I get:
[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]
... which means that the second regex is written correctly (all three parts are found) - but then, when I try to replace all of that snippet with "AAAAAAA" - nothing happens in that part of output:
------
<div class="thebibliography">
<p class="bibitem" ><span class="biblabel">
[1]<span class="bibsp">   </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
<span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H. </span> testöng ... . <span
class="cmti-10">Draftin:</span>
<a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
</div>
Clearly, there is no "AAAAAAA" here, as I'd expect.
What is the problem, and what should I do, to get sub to replace the matches that apparently have been found?
Why don't use an HTML parser for parsing and modifying HTML.
Example, using BeautifulSoup and replace_with():
from bs4 import BeautifulSoup
data = """Your html here"""
soup = BeautifulSoup(data)
for link in soup('a', id=True):
link.replace_with('AAAAAA')
print(soup.prettify())
This replaces all of the links that have id attribute with AAAAAA text:
<div class="thebibliography">
<p class="bibitem">
<span class="biblabel">
[1]
<span class="bibsp">
</span>
</span>
AAAAAA
<span class="cmcsc-10">
...
Also see:
RegEx match open tags except XHTML self-contained tags
Your replacement doesn't work due to a misuse of the re.sub method, If you look at the documentation:
re.sub(pattern, repl, string, count=0, flags=0)
But in your code, you put the "flag" in the "count" place. This is the reason why, the re.DOTALL flag is ignored, cause it is at the wrong place.
Since you don't need to use the count param, you can remove the re.DOTALL flag and use an inline modifier instead:
regstr = r'''(?s)(<a.*?)(class=['"].*?['"])([\s]*>)'''
However, using something like bs4 is probably more convenient. (as you can see in #alecxe answer).
It's quite simple : Python Standard Library Reference says syntax or re.sub is : re.sub(pattern, repl, string, count=0, flags=0). So your last sub is in fact (as re.DOTALL == 16):
re.sub(regstr, "AAAAAAA", tout2, count = 16, flags = 0 )
when you need :
re.sub(regstr, "AAAAAAA", tout2, flags = re.DOTALL )
and that last sub works perfectly ...
Problem is - your arguments were wrong.
Python 2.7 Source:
def re.sub(pattern, repl, string, count=0, flags=0):
//code
Here, your argument re.DOTALL is being treated as count argument.
FIX: Use re.sub(regstr, "AAAAAAA", tout2, flags=re.DOTALL ) instead
Note: If you try using compile with your regex, sub works just fine.
Well, in this case apparently, I should have used a compiled regex object (instead of going directly through the re. module call), and all seems to work (can even use backreferences) - but I still don't understand why the problem occurred at all? Would be good to learn why eventually... Anyways, this is the corrected code snippet:
# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
pat = re.compile(regstr, re.DOTALL)
#~ print( re.findall(regstr, tout2, re.DOTALL)) # finds
print( pat.findall(tout2)) # finds
print("------") #
# re.purge() # no need
print( pat.sub(r'\1AAAAAAA\3', tout2, re.DOTALL )) # does nothing?
... and this is the output:
[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]
------
<div class="thebibliography">
<p class="bibitem" ><span class="biblabel">
[1]<span class="bibsp">   </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
<span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H. </span> testöng ... . <span
class="cmti-10">Draftin:</span>
<a
href="http://www.example.com/test.html" AAAAAAA ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
</div>

Using python/BeautifulSoup to replace HTML tag pair with a different one

I need to replace a matching pair of HTML tags by another tag. Probably BeautifulSoup (4) would be suitable for the task, but I've never used it before and haven't found a suitable example anywhere, can someone give me a hint?
For example, this HTML code:
<font color="red">this text is red</font>
Should be changed to this:
<span style="color: red;">this text is red</span>
The beginning and ending HTML tags may not be in the same line.
Use replace_with() to replace elements. Adapting the documentation example to your example gives:
>>> from bs4 import BeautifulSoup
>>> markup = '<font color="red">this text is red</font>'
>>> soup = BeautifulSoup(markup)
>>> soup.font
<font color="red">this text is red</font>
>>> new_tag = soup.new_tag('span')
>>> new_tag['style'] = 'color: ' + soup.font['color']
>>> new_tag.string = soup.font.string
>>> soup.font.replace_with(new_tag)
<font color="red">this text is red</font>
>>> soup
<span style="color: red">this text is red</span>

Categories