Using RegEx to identify emails from Beautiful Soup - python

I am a beginner working on a program that can scrape emails from a given website. The code is as follows:
import requests, bs4, re
print('Fetching Website...')
res = requests.get('https://examplewebsite.com')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
type(soup)
my_list = []
for link in soup.find_all('a'):
my_list.append(link.get('href'))
emailregex = re.compile(r'''(
[a-zA-Z0-9._%+-:]+
#
[a-zA-Z0-9.-]+
\.[a-zA-Z]{2,4}
)''', re.VERBOSE)
newlist = list(filter(emailregex.search, my_list))
print(newlist)
print('---Done---')
When I run the code, however, I get an error: "TypeError: expected string or bytes-like object". I found that if I do:
newlist = list(filter(emailregex.search, str(my_list)))
print(newlist)
The error will go away, but my "newlist" doesn't contain any results. I have verified that "my_list" does return a list of expected results. I found that if I print "my_list" and paste its contents into a new file where I add it to a list run the same code, it works just fine, so I don't believe its an issue with the Regex. I think it might be something with the data-type in "my_list"? I don't really have any good ideas, so any help at all would be appreciated.
Thank you

"TypeError: expected string or bytes-like object" is because my_list is not including only string, however str(my_list) is going to convert the variable into a big string
print(str(my_list)) # this is a string
print(type(str(my_list))) # output: str
You need to change every item of my_list to string and then try again
my_list = list(map(str, my_list))
newlist = list(filter(emailregex.search, my_list))

import requests
from bs4 import BeautifulSoup
import re
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = "".join([item.get("href")
for item in soup.findAll("a", href=True)])
matches = re.findall(
r'''[a-zA-Z0-9._%+-:]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}''', re.VERBOSE, target)
for match in matches:
print(match)
main("https://www.example.com")

Related

Recording web scraping data

Hi everyone although I got the data I was looking for in a text format, when I try to record it as a list or convert it into a dataframe, it simply doesn't work. What I got was a huge list with only one item, which is the last text line of the data I got, i.e. the number '9.054.333,18'. Can anyone help me, please? I need to organize all this data in a list or dataframe.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
html = urlopen('http://www.b3.com.br/pt_br/market-data-e-indices/servicos-de-dados/market-data/consultas/mercado-a-vista/termo/posicoes-em-aberto/posicoes-em-aberto-8AA8D0CC77D179750177DF167F150965.htm?data=16/04/2021&f=0#conteudo-principal')
soup = BeautifulSoup(html.read(), 'html.parser')
texto = soup.find_all('td')
for t in texto:
print(t.text)
lista=[]
for i in soup.find_all('td'):
lista.append(t.text)
print(lista)
Your iterators are wrong -- you're using i in the last loop while appending t.text.
You can just use a list comprehension:
# ...
soup = BeautifulSoup(html.read(), 'html.parser')
lista = [t.text for t in soup.find_all('td')]

How to get the content of a tag with a Beautiful Soup?

I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.
My code so far:
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')
It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.
Try using zip():
import requests
from bs4 import BeautifulSoup
URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
print(text, tag.get("alt"))
break
Output:
What is the value of $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$
Edit:
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
print(text.text.strip(), tag.get("alt"))
Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.
import bs4
import requests
import re
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
mo = latex_reg.search(i)
if mo:
elements[n] = mo.group(1)
elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
if elements[n][0] == "$":
elements[n] = " "+elements[n]+" "
print(elements)
print("".join(elements))

Beautiful Soup : How to extract data from HTML Tags from inconsistent data

I wanted to extract the data from tags which is coming in two forms :
<td><div><font> Something else</font></div></td>
and
<td><div><font> Something <br/>else</font></div></td>
I am using .string() method where in the first case it gives me the required string (Something else) but in the second case, it gives me None.
Is there any better way or alternative way to do it?
Try using .text property instead of .string
from bs4 import BeautifulSoup
html1 = '<td><div><font> Something else</font></div></td>'
html2 = '<td><div><font> Something <br/>else</font></div></td>'
if __name__ == '__main__':
soup1 = BeautifulSoup(html1, 'html.parser')
div1 = soup1.select_one('div')
print(div1.text.strip())
soup2 = BeautifulSoup(html2, 'html.parser')
div2 = soup2.select_one('div')
print(div2.text.strip())
which outputs:
Something else
Something else
You can use regular expression always for such things!
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
print(result[1])
This will work your case. To avoid capturing tag, you need to manipulate string.
Check via print("<br/>" in result[1]), if string contains tag then it'll return True, in that case you need to drop the tag.
result = str(result[1]).split("<br/>") this will give you a list [' Something ', 'else'], join them to get your answer.. result = (" ").join(result)
Here is the complete snippet:
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
if "<br/>" in result[1]:
result = str(result[1]).split("<br/>")
result = (" ").join(result)
print(result)
else:
print(result[1])
I understand this is a pretty poor solution, but it'll work for you!

How to search for matched string then extract the string after it and a colon

I'm new to Python and web scraping so I apology if the question is too basic!
I want to extract the "score" and "rate" (rating) from the following example BeautifulSoup object
import bs4
import re
text = '<html><body>{"count":1,"results":[{"score":"2-1","MatchId":{"number":"889349"},"name":"Match","rating":{"rate":9.0}}],"performance":{"comment":{}}}</body></html>'
page = bs4.BeautifulSoup(text, "lxml")
print type(page)
I have tried these but nothing showed up (just blank [])
tmp = page.find_all(text=re.compile("score:(.*)"));
print(tmp)
tmp = page.findAll("score");
print(tmp)
I found this similar question but it gave me error
tmp = page.findAll(text = lambda(x): x.lower.index('score') != -1)
print(tmp)
AttributeError: 'builtin_function_or_method' object has no attribute 'index'
What did I do wrong? Thanks in advance!
This is two thirds of the way to a turducken of protocols. You can use beautifulsoup to find the body text and decode that with json. Then you have some python dicts and lists to through.
>>> import json
>>> import bs4
>>> import re
>>> text = '<html><body>{"count":1,"results":[{"score":"2-1","MatchId":{"number":"889349"},"name":"Match","rating":{"rate":9.0}}],"performance":{"comment":{}}}</body></html>'
>>> page = bs4.BeautifulSoup(text, "lxml")
>>>
>>> data = json.loads(page.find('body').text)
>>> for result in data["results"]:
... print(result["score"], result["rating"]["rate"])
...
2-1 9.0
>>>

Why my regex doesn't work with BeautifulSoup?

I am parsing an HTML file and would like to match everything between two sequences of characters: Sent: and the <br> tag.
I have seen several very similar questions and tried all of their methods and none have worked for me, probably because I'm a novice and am doing something very simple incorrectly.
Here's my relevant code:
for filename in os.listdir(path): #capture email year, month, day
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
a = re.findall(r'Sent:/.+?(?=<br>)/', soup.text)[0]
#a = re.findall(r'Sent:(.*)', soup.text)[0]
print(a)
d = parser.parse(a)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
and I've also tried these for my RegEx: a = re.findall(r'Sent:/^(.*?)<br>/', soup.text)[0] and a = re.findall(r'Sent:/^[^<br>]*/', soup.text)[0]
But I keep getting the error list index out of range.... but even when I remove the [0] I get the error AttributeError: 'list' object has no attribute 'read' on the line d = parser.parse(a).... with only [] printed as a result of print(a)
Here's the relevant block of HTML:
<b>Sent:</b> Friday, June 14, 2013 12:07 PM<br><b>To:</b> David Leveille<br><b>Subject:</b>
The problem is not really your regex, but the fact that BeautifulSoup parse the HTML (its job after all) and change its content. For example, your <br> will be transformed to <br/>. Another point : soup.text erases all the tags, so your regex won't work anymore.
It will be more clear trying this script :
from bs4 import *
import re
from dateutil import parser
pattern = re.compile(r'Sent:(.+?)(?=<br/>)')
with open("myfile.html", 'r') as f:
html = f.read()
print("html: ", html)
soup = BeautifulSoup(html, 'lxml')
print("soup.text: ", soup.text)
print("str(soup): ", str(soup))
a = pattern.findall(str(soup))[0]
print("pattern extraction: ", a)
For the second part : since your date string is not formally correct (because of the initial <br/>), you should add the parameter fuzzy=True, as its explained in the documentation of dateutil.
d = parser.parse(a, fuzzy=True)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
Another solution would be to use a more precise regex. For example :
pattern = re.compile(r'Sent:</b>(.+?)(?=<br/>)')
Try this. It also takes into consideration if the <br> tag contains a slash.
/Sent:(.*?)<\/*br>/
You don't need the usual slash escapes:
a = re.findall(r"Sent:(.*?)<br>", soup.text)[0]
That being said, you should probably check for the output (or at least use try/except) before trying to get a value from it.
Can you please replace your regex with the one below that looks for the key terms and then anything between them and tell me what error if any you are now receiving?
a=re.findall(r"Sent:(.*?)<br>", soup.text)[0]

Categories