I want to change the output of my soup.find.all. In the original source we have this:
my soup.find_all:
href = [b.get('href') for b in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
gives me this:
/book/nfo/?id=4756888
but I want this:
http://127.0.0.1/book/download/?id=4756888
You can use the properties of a Python string to add and replace parts to/from it:
a='/book/nfo/?id=4756888'
b = 'http://127.0.0.1' + a.replace('nfo', 'download')
print(b)
which gives:
'http://127.0.0.1/book/download/?id=4756888'
There's no need to use regex here.
You could compile a regular expression and apply it in a list comprehension as follows:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('', 'html.parser')
re_s = re.compile(r'(.*?\/)nfo(\/.*?)').sub
hrefs = [re_s('http://127.0.0.1' + r'\1download\2', a.get('href')) for a in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
print(hrefs)
Giving you:
['http://127.0.0.1/book/download/?id=4756888']
You can prepend http://127.0.0.1 in front and replace 'nfo' by 'download' using python's re.sub() function.
re.sub(r'pattern_to_match',r'replacement_string', string)
You can implement it as follows:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("""""")
c = ['http://127.0.0.1'+b.get('href') for b in soup.find_all('a', href=re.compile(r'.*\?id\=\d{4,8}'))]
print([re.sub(r'nfo',r'download',q) for q in c ])
Output:
['http://127.0.0.1/book/download/?id=4756888']
Related
I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.
My code so far:
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')
It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.
Try using zip():
import requests
from bs4 import BeautifulSoup
URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
print(text, tag.get("alt"))
break
Output:
What is the value of $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$
Edit:
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
print(text.text.strip(), tag.get("alt"))
Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.
import bs4
import requests
import re
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
mo = latex_reg.search(i)
if mo:
elements[n] = mo.group(1)
elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
if elements[n][0] == "$":
elements[n] = " "+elements[n]+" "
print(elements)
print("".join(elements))
I have the following bs4 element tag :
<span><span>some content</span> B</span>
The len of string B is unknown (I named it B for simplification)
How can I use beautifulSoup to extract "b" ? Or I just have as solution to extract the text and then use some regexp techniques
Thanks
Edit : Complete code
def get_doc_yakarouler(license_plate,url = 'https://www.yakarouler.com/car_search/immat?immat='):
response = requests.get(url+license_plate)
content = response.content
doc = BeautifulSoup(content,'html.parser')
result = doc.span.text
if 'identifié' in result :
return doc
else :
return f"La plaque {license_plate} n'est pas recensé sur yakarouler"
doc = get_doc_yakarouler('AA300AA')
span = doc.find_all('span')
motorisation_tag = span[1]
I want to extract "1.6 TDI"
I found solution using : motorisation_tag.text.replace(u'\xa0', ' ').split(' ')[1] but I would like if it is directly possible using bs4
Assuming you have a variable span which represents the outer <span> tag, you can do the following to extract 'B': span.contents[1]. This works because .contents will return a list of the tag's contents, in this case [<span>some content</span>, ' B']. And then you can access the 'B' text as the second element of the array. Be aware that if there is a space before B, like you showed in your HTML sample, the space will be included in the string
from bs4 import BeautifulSoup as bs , NavigableString
html = '<span><span>some content</span> B</span>'
soup = bs(html, 'html.parser')
span = soup.find("span")
# First approach Using Regular Expressions
outer_text_1 = span.find(text=True, recursive=False)
# Second approach is looping through the contents of the tag and check if it's the outer text and not a tag
outer_text_2 = ' '.join([t for t in span.contents if type(t)== NavigableString])
print(outer_text_1) # output B
print(outer_text_2) # output B
I wanted to extract the data from tags which is coming in two forms :
<td><div><font> Something else</font></div></td>
and
<td><div><font> Something <br/>else</font></div></td>
I am using .string() method where in the first case it gives me the required string (Something else) but in the second case, it gives me None.
Is there any better way or alternative way to do it?
Try using .text property instead of .string
from bs4 import BeautifulSoup
html1 = '<td><div><font> Something else</font></div></td>'
html2 = '<td><div><font> Something <br/>else</font></div></td>'
if __name__ == '__main__':
soup1 = BeautifulSoup(html1, 'html.parser')
div1 = soup1.select_one('div')
print(div1.text.strip())
soup2 = BeautifulSoup(html2, 'html.parser')
div2 = soup2.select_one('div')
print(div2.text.strip())
which outputs:
Something else
Something else
You can use regular expression always for such things!
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
print(result[1])
This will work your case. To avoid capturing tag, you need to manipulate string.
Check via print("<br/>" in result[1]), if string contains tag then it'll return True, in that case you need to drop the tag.
result = str(result[1]).split("<br/>") this will give you a list [' Something ', 'else'], join them to get your answer.. result = (" ").join(result)
Here is the complete snippet:
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
if "<br/>" in result[1]:
result = str(result[1]).split("<br/>")
result = (" ").join(result)
print(result)
else:
print(result[1])
I understand this is a pretty poor solution, but it'll work for you!
I use this code:
from bs4 import BeautifulSoup
parser = BeautifulSoup(remote_data)
parse_data = parser.find_all('a')
for atag_data in parse_data:
URL_list = atag_data.get('href')
When i try to split URL_list to array:
array = str.split(URL_list)
I give this 3 arrays:
['index1.html']
['example.exe']
['document.doc']
But i need only one array like:
['index1.html','example.exe','document.doc']
Any suggestions please?
You don't get an array - its a list!
Also, you should avoid naming variables like builtins.
Regarding your question:
from bs4 import BeautifulSoup
parser = BeautifulSoup(remote_data)
link_list = [a['href'] for a in parser.find_all('a')]
I am trying to sort through HTML tags and I can't seem to get it right.
What I have done so far
import urllib
import re
s = raw_input('Enter URL: ')
f = urllib.urlopen(s)
s = f.read()
f.close
r = re.compile('<TAG\b[^>]*>(.*?)</TAG>',)
result = re.findall(r, s)
print(result)
Where I replace "TAG" with tag I want to see.
Thanks in advance.
You should really try using libraries which can perform HTML parsing out of the box. Beautiful Soup is one of my favorites.
An example from BS is this
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
[<b>one</b>, <b>two</b>]
As for a regular expression, you can use
aa = doc[0]
aa
'<html><head><title>Page title</title></head>'
pt = re.compile('(?<=<title>).*?(?=</title>)')
re.findall(pt,aa)
['Page title']
I'm not entirely clear on what you are trying to achieve with the regex. Capturing the contents between two div tags for instance works with
re.compile("<div.*?>.*?</div>")
Although you will run into some problems with nested divs with the above one.