Why my regex doesn't work with BeautifulSoup? - python

I am parsing an HTML file and would like to match everything between two sequences of characters: Sent: and the <br> tag.
I have seen several very similar questions and tried all of their methods and none have worked for me, probably because I'm a novice and am doing something very simple incorrectly.
Here's my relevant code:
for filename in os.listdir(path): #capture email year, month, day
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
a = re.findall(r'Sent:/.+?(?=<br>)/', soup.text)[0]
#a = re.findall(r'Sent:(.*)', soup.text)[0]
print(a)
d = parser.parse(a)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
and I've also tried these for my RegEx: a = re.findall(r'Sent:/^(.*?)<br>/', soup.text)[0] and a = re.findall(r'Sent:/^[^<br>]*/', soup.text)[0]
But I keep getting the error list index out of range.... but even when I remove the [0] I get the error AttributeError: 'list' object has no attribute 'read' on the line d = parser.parse(a).... with only [] printed as a result of print(a)
Here's the relevant block of HTML:
<b>Sent:</b> Friday, June 14, 2013 12:07 PM<br><b>To:</b> David Leveille<br><b>Subject:</b>

The problem is not really your regex, but the fact that BeautifulSoup parse the HTML (its job after all) and change its content. For example, your <br> will be transformed to <br/>. Another point : soup.text erases all the tags, so your regex won't work anymore.
It will be more clear trying this script :
from bs4 import *
import re
from dateutil import parser
pattern = re.compile(r'Sent:(.+?)(?=<br/>)')
with open("myfile.html", 'r') as f:
html = f.read()
print("html: ", html)
soup = BeautifulSoup(html, 'lxml')
print("soup.text: ", soup.text)
print("str(soup): ", str(soup))
a = pattern.findall(str(soup))[0]
print("pattern extraction: ", a)
For the second part : since your date string is not formally correct (because of the initial <br/>), you should add the parameter fuzzy=True, as its explained in the documentation of dateutil.
d = parser.parse(a, fuzzy=True)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
Another solution would be to use a more precise regex. For example :
pattern = re.compile(r'Sent:</b>(.+?)(?=<br/>)')

Try this. It also takes into consideration if the <br> tag contains a slash.
/Sent:(.*?)<\/*br>/

You don't need the usual slash escapes:
a = re.findall(r"Sent:(.*?)<br>", soup.text)[0]
That being said, you should probably check for the output (or at least use try/except) before trying to get a value from it.

Can you please replace your regex with the one below that looks for the key terms and then anything between them and tell me what error if any you are now receiving?
a=re.findall(r"Sent:(.*?)<br>", soup.text)[0]

Related

How to get the content of a tag with a Beautiful Soup?

I'm trying to extract questions from various AMC tests. Consider https://artofproblemsolving.com/wiki/index.php/2002_AMC_10B_Problems/Problem_1 for example. To get the problem text, I just need the regular string text in the first <p> element and the latex in the <img> in the first <p> element.
My code so far:
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
latex_equation = soup.select('p img')[0].get('alt')
It works when I get the latex equation, but there is more parts of the question before in double quotes. Is there a way to get the other part of the question which is "What is the value of". I'm thinking of using a regex but I want to see if Beautiful Soup has a feature that can get it for me.
Try using zip():
import requests
from bs4 import BeautifulSoup
URL = "https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select_one(".mw-parser-output p"), soup.select("p img")):
print(text, tag.get("alt"))
break
Output:
What is the value of $\frac{2a^{-1}+\frac{a^{-1}}{2}}{a}$
Edit:
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for text, tag in zip(soup.select(".mw-parser-output p"), soup.select("p img")):
print(text.text.strip(), tag.get("alt"))
Well BS4 seems to be a bit buggy. Took me a while to get this. Don't think that it is viable with these weird spacings and everything. A RegEx would be your best option. Let me know if this is good. Checked on the first 2 questions and they worked fine. The AMC does have some image problems with geometry, however, so I don't think it will work for those.
import bs4
import requests
import re
res = requests.get('https://artofproblemsolving.com/wiki/index.php/2016_AMC_10B_Problems/Problem_1')
soup = bs4.BeautifulSoup(res.content, 'html.parser').find('p')
elements = [i for i in soup.prettify().split("\n") if i][1:-2]
latex_reg = re.compile(r'alt="(.*?)"')
for n, i in enumerate(elements):
mo = latex_reg.search(i)
if mo:
elements[n] = mo.group(1)
elements[n] = re.sub(' +', ' ', elements[n]).lstrip()
if elements[n][0] == "$":
elements[n] = " "+elements[n]+" "
print(elements)
print("".join(elements))

Using RegEx to identify emails from Beautiful Soup

I am a beginner working on a program that can scrape emails from a given website. The code is as follows:
import requests, bs4, re
print('Fetching Website...')
res = requests.get('https://examplewebsite.com')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
type(soup)
my_list = []
for link in soup.find_all('a'):
my_list.append(link.get('href'))
emailregex = re.compile(r'''(
[a-zA-Z0-9._%+-:]+
#
[a-zA-Z0-9.-]+
\.[a-zA-Z]{2,4}
)''', re.VERBOSE)
newlist = list(filter(emailregex.search, my_list))
print(newlist)
print('---Done---')
When I run the code, however, I get an error: "TypeError: expected string or bytes-like object". I found that if I do:
newlist = list(filter(emailregex.search, str(my_list)))
print(newlist)
The error will go away, but my "newlist" doesn't contain any results. I have verified that "my_list" does return a list of expected results. I found that if I print "my_list" and paste its contents into a new file where I add it to a list run the same code, it works just fine, so I don't believe its an issue with the Regex. I think it might be something with the data-type in "my_list"? I don't really have any good ideas, so any help at all would be appreciated.
Thank you
"TypeError: expected string or bytes-like object" is because my_list is not including only string, however str(my_list) is going to convert the variable into a big string
print(str(my_list)) # this is a string
print(type(str(my_list))) # output: str
You need to change every item of my_list to string and then try again
my_list = list(map(str, my_list))
newlist = list(filter(emailregex.search, my_list))
import requests
from bs4 import BeautifulSoup
import re
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = "".join([item.get("href")
for item in soup.findAll("a", href=True)])
matches = re.findall(
r'''[a-zA-Z0-9._%+-:]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}''', re.VERBOSE, target)
for match in matches:
print(match)
main("https://www.example.com")

How to match and extract using regex - Python

I'm taking a look how to use regex and trying to figure out how to extract the Latitude and Longitude, no matter if the number is positive or negative, right after the "?ll=" as shown below:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
I have used the following code in python to get only the first digits marked above:
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
print(lnk)
m = re.match('-?\d+(?!.*ll=)(?!&q=loc)*', lnk)
print(m)
#lat, *long = m.split(',')
#print(lat)
#print(long)
The result I got isn't what I was expecting:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
None
I'm getting "None" rather than the value "-6.148222,106.8462". I also tried to split those numbers into two variables called lat and long, but since I always got "None" python stops processing with "exit code 1" until I comment lines.
Cheers,
You should use re.search() instead of re.match() cause re.match() is used for exact matches.
This can solve the problem
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
m = re.search(r"(-?\d*\.\d*),(-?\d*\.\d*)", lnk)
print(m.group())
print("lat = "+m.group(1))
print("lng = "+m.group(2))
I'd use a proper URL parser, using regex here is asking for problems in case the URL embedded in the page you are crawling is changing in a way that will break the regex you use.
from urllib.parse import urlparse, parse_qs
url = 'https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&'
scheme, netloc, path, params, query, fragment = urlparse(url)
# or just
# query = urlparse(url).query
parsed_query_string = parse_qs(query)
print(parsed_query_string)
lat, long = parsed_query_string['ll'][0].split(',')
print(lat)
print(long)
outputs
{'ll': ['-6.148222,106.8462'], 'q': ['loc:-6.148222,106.8462']}
-6.148222
106.8462
use diff regex for latitude and longitude
import re
str1="https://maps.google.com/maps?ll=6.148222,-106.8462&q=loc:-6.148222,106.8462&"
lat=re.search(r"(-)*\d+(.)\d+",str1).group()
lon=re.search(r",(-)*\d+(.)\d+",str1).group()
print(lat)
print(lon[1:])
output
6.148222
-106.8462

Beautiful Soup : How to extract data from HTML Tags from inconsistent data

I wanted to extract the data from tags which is coming in two forms :
<td><div><font> Something else</font></div></td>
and
<td><div><font> Something <br/>else</font></div></td>
I am using .string() method where in the first case it gives me the required string (Something else) but in the second case, it gives me None.
Is there any better way or alternative way to do it?
Try using .text property instead of .string
from bs4 import BeautifulSoup
html1 = '<td><div><font> Something else</font></div></td>'
html2 = '<td><div><font> Something <br/>else</font></div></td>'
if __name__ == '__main__':
soup1 = BeautifulSoup(html1, 'html.parser')
div1 = soup1.select_one('div')
print(div1.text.strip())
soup2 = BeautifulSoup(html2, 'html.parser')
div2 = soup2.select_one('div')
print(div2.text.strip())
which outputs:
Something else
Something else
You can use regular expression always for such things!
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
print(result[1])
This will work your case. To avoid capturing tag, you need to manipulate string.
Check via print("<br/>" in result[1]), if string contains tag then it'll return True, in that case you need to drop the tag.
result = str(result[1]).split("<br/>") this will give you a list [' Something ', 'else'], join them to get your answer.. result = (" ").join(result)
Here is the complete snippet:
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
if "<br/>" in result[1]:
result = str(result[1]).split("<br/>")
result = (" ").join(result)
print(result)
else:
print(result[1])
I understand this is a pretty poor solution, but it'll work for you!

Why doesn't this function return the same output in both situations(webscraping project)?

import requests
import re
from bs4 import BeautifulSoup
#The website I like to get, converts the contents of the web page to lxml format
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content, "lxml")
#Modifies the given string to look visually good. Like this:
#['21 / JulZaterdag2018'] becomes 21 Jul 2018
def remove_char(string):
#All blacklisted characters and words
blacklist = ["/", "[", "]", "'", "Maandag", "Dinsdag", "Woensdag",
"Donderdag", "Vrijdag", "Zaterdag", "Zondag"]
#Replace every blacklisted character with white space
for char in blacklist:
string = string.replace(char,' ')
#Replace more than 2 consecutive white spaces
string = re.sub("\s\s+", " ", string)
#Gets the date of the festival I'm interested in
def get_date_info():
#Makes a list for the data
raw_info = []
#Adds every "div" with a certain name to list, and converts it to text
for link in soup.find_all("div", {"class": "event-single-data"}):
raw_info.append(link.text)
#Converts list into string, because remove_char() only accepts strings
raw_info = str(raw_info)
#Modifies the string as explained above
final_date = remove_char(raw_info)
#Prints the date in this format: 21 Jul 2018(example)
print(final_date)
get_date_info()
Hi there! So I'm currently working on a little webscraping project. I thought I had a good idea and I wanted to get more experienced with Python. What it basically does is it gets festival information like date, time and price and puts it in a little text file. I'm using BeautifulSoup to navigate and edit the web page. Link is down there!
But now I'm kinda running into a problem. I can't figure out what's wrong. Maybe I'm totally looking over it. So when I run this program it should give me this: 21 Jul 2018. But instead it returns 'None'. For some reason every character in the string gets removed.
I tried running remove_char() on its own, with the same list(converted it to string first) as input. This worked perfectly. It returned "21 Jul 2018" like it was supposed to do. So I'm quite sure the error is not in this function.
So somehow I'm missing something. Maybe it has to do with BeautifulSoup and how it handles things?
Hope someone can help me out!
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Web page:
https://festivalfans.nl/event/dominator-festival
You forgot to return the value in the remove_char() function.
That's it!
Neither of your functions has a return statement, and so return None by default. remove_char() should end with return string for example.
import requests
from bs4 import BeautifulSoup
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content , "html.parser")
def get_date_info():
for link in soup.find_all("div", {"class": "event-single-data"}):
day = link.find('div', {"class":"event-single-day"}).text.replace(" ", '')
month = link.find('div', {"class": "event-single-month"}).text.replace('/', "").replace(' ', '')
year = link.find('div', {"class": "event-single-year"}).text.replace(" ", '')
print(day, month, year)
get_date_info()
here is an easier code no need of re

Categories