Ignore a link in parenthesis while trying to extract other links - python

I'm trying to extract links from a p block but I'd like to ignore anything within parenthesis. For example,
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
I would like to only select any links after what is in parenthesis so in the above case just the link_text2 link. I currently grab the links using this...
ps = content.find_all('p', recursive=False)
for p in ps:
as = p.find_all('a', recursive=False)
I think I have to use a regex but not sure how to incorporate it so that it ignores any links in parenthesis. This regex works to isolate anything in parenthesis - \(.*?\).
Anyone able to help?

You can analyze the elements in BeautifulSoup.contents to find all a objects. The latter can then be filtered to ensure that surrounding content does not create a ( and ) pair:
from bs4 import BeautifulSoup as soup
def is_valid(ind:int, content:list, flag=False) -> bool:
return not isinstance(content[ind], str) or (['(', ')'][flag] not in content[ind])
s = """
<p>
Some text (even more text <a href='link_text'>link_text</a>) another link <a href='link_text2'>link_text2</a>
</p>
"""
d = soup(s, 'html.parser').p.contents
l = [[i, a] for i, a in enumerate(d) if getattr(a, 'name', None) == 'a']
new_l = [a for i, a in l if (not i or i == len(d)-1) or (is_valid(i-1, d) and is_valid(i+1, d, True))]
Output:
[link_text2]

Related

How to extract value from specific P-tag with BeautifulSoup?

Is there a way to only extract the value for Acid(5.9 g/L) and Alcohol(14.5%)?
I thought of using find_all('p'), but it is giving me all the p tag while I only need two of them.
Select the <h3> by its content and from there its direct sibling:
soup.select_one('h3:-soup-contains("Acid") + p').text
You could adapt this also for other elements if they are known otherwise you have to select all and check content against list
l = ['Acid','...']
for e in soup.select('.wine-specs p'):
if e.text in l:
print(e.text)

Trying to isolate URL suffix's from list of href tags

I'm currently working on a simple web crawling program that will crawl the SCP wiki to find links to other articles in each article. So far I have been able to get a list of href tags that go to other articles, but can't navigate to them since the URL I need is embedded in the tag:
[ SCP-1512,
SCP-2756,
SCP-002,
SCP-004 ]
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
The code used to get the list looks like this:
import requests
import lxml
from bs4 import BeautifulSoup
import re
def searchSCP(x):
url = str(SCoutP(x))
c = requests.get(url)
crawl = BeautifulSoup(c.content, 'lxml')
#Searches HTML for text containing "SCP-" and href tags containing "scp-"
ref = crawl.find_all(text=re.compile("SCP-"), href=re.compile("scp-",))
param = "SCP-" + str(SkateP(x)) #SkateP takes int and inserts an appropriate number of 0's.
for i in ref: #Below function is for sorting out references to the article being searched
if str(param) in i:
ref.remove(i)
if ref != []:
print(ref)
The main idea I've tried to use is finding every item that contains items in quotations, but obviously that just returned the same list. What I want to be able to do is select a specific item in the list and take out ONLY the "scp-xxxx" part or, alternatively, change the initial code to only extract the href content in quotations to the list.
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
If I understand correctly, you want to extract the href attribute - for that, you can use i.get('href') (or probably even just i['href']).
With .select and list comprehension, you won't even need regex to filter the results:
[a.get('href') for a in crawl.select('*[href*="scp-"]') if 'SCP-' in a.get_text()]
would return
['/scp-1512', '/scp-2756', '/scp-002', '/scp-004']
If you want the parent url attached:
root_url = 'https://PARENT-URL.com' ## replace with the actual parent url
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t]
scpLinks should return
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-002', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']
If you want to filter out param, add str(param) not in t to the filter:
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t and str(param) not in t]
if str(param) was 'SCP-002', then scpLinks would be
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']

Extracting text inside tags from html document

I have an html document like this: https://dropmefiles.com/wezmb
So I need to extract text inside tags <span id="1" and </span , but I don't know how.
I'm trying and write this code:
from bs4 import BeautifulSoup
with open("10_01.htm") as fp:
soup = BeautifulSoup(fp,features="html.parser")
for a in soup.find_all('span'):
print (a.string)
But it extract all information from all 'span' tags. So, how can i extract text inside tags <span id="1" and </span in Python?
What you need is the .contents function. documentation
Find the span <span id = "1"> ... </span> using
for x in soup.find(id = 1).contents:
print(x)
OR
x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)
This will give you :
10
that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.
The string will correctly be '\n10\n'.
If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.

scraping B in <span><span> flow text </span> B </span> using BeautifulSoup

I have the following bs4 element tag :
<span><span>some content</span> B</span>
The len of string B is unknown (I named it B for simplification)
How can I use beautifulSoup to extract "b" ? Or I just have as solution to extract the text and then use some regexp techniques
Thanks
Edit : Complete code
def get_doc_yakarouler(license_plate,url = 'https://www.yakarouler.com/car_search/immat?immat='):
response = requests.get(url+license_plate)
content = response.content
doc = BeautifulSoup(content,'html.parser')
result = doc.span.text
if 'identifié' in result :
return doc
else :
return f"La plaque {license_plate} n'est pas recensé sur yakarouler"
doc = get_doc_yakarouler('AA300AA')
span = doc.find_all('span')
motorisation_tag = span[1]
I want to extract "1.6 TDI"
I found solution using : motorisation_tag.text.replace(u'\xa0', ' ').split(' ')[1] but I would like if it is directly possible using bs4
Assuming you have a variable span which represents the outer <span> tag, you can do the following to extract 'B': span.contents[1]. This works because .contents will return a list of the tag's contents, in this case [<span>some content</span>, ' B']. And then you can access the 'B' text as the second element of the array. Be aware that if there is a space before B, like you showed in your HTML sample, the space will be included in the string
from bs4 import BeautifulSoup as bs , NavigableString
html = '<span><span>some content</span> B</span>'
soup = bs(html, 'html.parser')
span = soup.find("span")
# First approach Using Regular Expressions
outer_text_1 = span.find(text=True, recursive=False)
# Second approach is looping through the contents of the tag and check if it's the outer text and not a tag
outer_text_2 = ' '.join([t for t in span.contents if type(t)== NavigableString])
print(outer_text_1) # output B
print(outer_text_2) # output B

Beautiful Soup: Extracting all the <br/> from the <strong>

I have a really silly and annoying problem, I try to convert html into markdown but my html is silly formatted: I keep having stuff like that:
<strong>Ihre Aufgaben:<br/></strong>
or
<strong> <br/>Über die XXXX GmbH:<br/></strong>
which is totally valid HTML.
However my library to convert to Markdown (HTML2Text) converts it to:
**Ihre Aufgaben:\n**
and
** \nÜber die XXXX GmbH:\n**
which is an already reported issue because the markdown is then invalid and cannot be rendered properly
My approach to this problem was the following:
Use BeautifulSoup to find all the strong that lead to this problem
Classify the <br/> into 2 groups: the ones coming before the text and the ones coming after the text.
Unwrap the ones after the text in order to push them out of the <strong>
My code (not so great formatted yet):
soup = BeautifulSoup(html)
emphased = soup.find_all('strong')
for single in emphased:
children = single.children
before = 0
foundText = None
after = 0
for child in children:
if not isinstance(child, NavigableString):
if foundText:
after += 1
child.unwrap()
else:
before += 1
# DOES NOT WORK
child.unwrap()
else:
foundText = single.get_text().strip()
What is my current problem?
I want to unwrap the <br/> before the content and put them before the <strong> element and I cannot achieve that (and didn't find how to proceed in the doc).
What do I want to achieve more generally?:
I want to transform that:
<strong> <br/>Über die XXXX GmbH: </strong>
into
# Note the space
(whitespace)<br/><strong>Über die XXXX GmbH:</strong>(whitespace)
It doesn't have to use Beautiful Soup, I'm just not aware of other solutions.
Thanks in advance!
Per your example you can extract all the br tags from the strong ones and prepend them, replacing the latest tag with the new one.
Here is a snippet:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<strong>Ihre Aufgaben:<br/></strong>", "html.parser")
for strong in soup.find_all("strong"):
[s.extract() for s in strong.find_all('br')]
strong.string = strong.get_text(strip=True)
strong.replaceWith(BeautifulSoup( " %s%s " % ("<br/>", strong), "html.parser"))
print soup
Which outputs:
<br/><strong>Ihre Aufgaben:</strong>

Categories