How to match and extract using regex - Python - python

I'm taking a look how to use regex and trying to figure out how to extract the Latitude and Longitude, no matter if the number is positive or negative, right after the "?ll=" as shown below:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
I have used the following code in python to get only the first digits marked above:
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
print(lnk)
m = re.match('-?\d+(?!.*ll=)(?!&q=loc)*', lnk)
print(m)
#lat, *long = m.split(',')
#print(lat)
#print(long)
The result I got isn't what I was expecting:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
None
I'm getting "None" rather than the value "-6.148222,106.8462". I also tried to split those numbers into two variables called lat and long, but since I always got "None" python stops processing with "exit code 1" until I comment lines.
Cheers,

You should use re.search() instead of re.match() cause re.match() is used for exact matches.
This can solve the problem
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
m = re.search(r"(-?\d*\.\d*),(-?\d*\.\d*)", lnk)
print(m.group())
print("lat = "+m.group(1))
print("lng = "+m.group(2))

I'd use a proper URL parser, using regex here is asking for problems in case the URL embedded in the page you are crawling is changing in a way that will break the regex you use.
from urllib.parse import urlparse, parse_qs
url = 'https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&'
scheme, netloc, path, params, query, fragment = urlparse(url)
# or just
# query = urlparse(url).query
parsed_query_string = parse_qs(query)
print(parsed_query_string)
lat, long = parsed_query_string['ll'][0].split(',')
print(lat)
print(long)
outputs
{'ll': ['-6.148222,106.8462'], 'q': ['loc:-6.148222,106.8462']}
-6.148222
106.8462

use diff regex for latitude and longitude
import re
str1="https://maps.google.com/maps?ll=6.148222,-106.8462&q=loc:-6.148222,106.8462&"
lat=re.search(r"(-)*\d+(.)\d+",str1).group()
lon=re.search(r",(-)*\d+(.)\d+",str1).group()
print(lat)
print(lon[1:])
output
6.148222
-106.8462

Related

I need to scrape the instagram link that is highlighted in the image

I am trying regex in python. I am facing a problem how to clip out the portion that is<
"www.instagram.com%2FMohakMeet"
I need to know the characters which I need to use in regex.
#python3
for d in g:
stripped = (d.rstrip())
url = stripped+"/about"
print("Retreiving" + url)
response = requests.get(url)
data = response.text
link = re.findall('''(www.instagram.com.+?)\s?\"?''', data)
if link == []:
print ('No Link')
else: x = str(link[0])
print ("Insta Link", x)
y = x.replace("%2F", '/', 3)
print (y)
# with open ('l.txt', 'a') as v:
# v.write(y)
# v.write("\n")
This is My Code but the main problem is, while scraping Python is scraping the Description of the Youtube page shown in the 2nd picture.
Please Help.
This is the pattern which is not working.
(www.instagram.com.+?)\s?\"?'''
this link will let you debug your regex https://regex101.com/
a common pitfall when creating regex, is using standard string ('my string') and not raw strings r'my string' .
see also https://docs.python.org/3/library/re.html

Get number sequence after an specific string in url text

I'm coding a python script to check a bunch of URL's and get their ID text, the URL's follow this sequence:
http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
Up to
http://XXXXXXX.XXX/index.php?id=YYYYYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
What I'm trying to do is get only the numbers after the id= and before the &
I've tried to use the regex (\D+)(\d+) but I'm also getting the auth numbers too.
Any suggestion on how to get only the id sequence?
Another way is to use split:
string = 'http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX'
string.split('id=')[1].split('&auth=')[0]
Output:
YY
These are URL addresses, so I would just use url parser in that case.
Look at urllib.parse
Use urlparse to get query parameters, and then parse_qs to get query dict.
import urllib.parse as p
url = "http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"
query = p.urlparse(url).query
params = p.parse_qs(query)
print(params['id'])
You can include the start and stop tokens in the regex:
pattern = r'id=(\d+)(?:&|$)'
You can try this regex
import re
urls = ["http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"]
for url in urls:
id_value = re.search(r"id=(.*)(?=&)", url).group(1)
print(id_value)
that will get you the id value from the URL
YY
YYY
YYYY
variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX""".splitlines()
for v in variables:
p1 = v.split("id=")[1]
p2 = p1.split("&")[0]
print(p2)
outoput:
YY
YYY
YYYY
If you prefer regex
import re
variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""
pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)
output:
['YY', 'YYY', 'YYYY']
I don't know if you mean with only numbers after id= and before & you mean that there could be letters and numbers between those letters, so I though to this
import re
variables = """http://XXXXXXX.XXX/index.php?id=5Y44Y&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=Y2242YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=5YY453YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""
pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)
x2 = []
for p in x:
x2.append(re.sub("\\D", "", p))
print(x2)
Output:
['5Y44Y', 'Y2242YY', '5YY453YY']
['544', '2242', '5453']
Use the regex id=[0-9]+:
pattern = "id=[0-9]+"
id = re.findall(pattern, url)[0].split("id=")[1]
If you do it this way, there is no need for &auth to follow the id, which makes it very versatile. However, the &auth won't make the code stop working. It works for the edge cases, as well as the simple ones.

Beautiful Soup : How to extract data from HTML Tags from inconsistent data

I wanted to extract the data from tags which is coming in two forms :
<td><div><font> Something else</font></div></td>
and
<td><div><font> Something <br/>else</font></div></td>
I am using .string() method where in the first case it gives me the required string (Something else) but in the second case, it gives me None.
Is there any better way or alternative way to do it?
Try using .text property instead of .string
from bs4 import BeautifulSoup
html1 = '<td><div><font> Something else</font></div></td>'
html2 = '<td><div><font> Something <br/>else</font></div></td>'
if __name__ == '__main__':
soup1 = BeautifulSoup(html1, 'html.parser')
div1 = soup1.select_one('div')
print(div1.text.strip())
soup2 = BeautifulSoup(html2, 'html.parser')
div2 = soup2.select_one('div')
print(div2.text.strip())
which outputs:
Something else
Something else
You can use regular expression always for such things!
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
print(result[1])
This will work your case. To avoid capturing tag, you need to manipulate string.
Check via print("<br/>" in result[1]), if string contains tag then it'll return True, in that case you need to drop the tag.
result = str(result[1]).split("<br/>") this will give you a list [' Something ', 'else'], join them to get your answer.. result = (" ").join(result)
Here is the complete snippet:
import re
result = re.search('font>(.*?)</font', str(scrapped_html))
if "<br/>" in result[1]:
result = str(result[1]).split("<br/>")
result = (" ").join(result)
print(result)
else:
print(result[1])
I understand this is a pretty poor solution, but it'll work for you!

Python Reg Pattern URL select/filter

links = [
'http://www.npr.org/sections/thesalt/2017/03/10/519650091/falling-stars-negative-yelp-reviews-target-trump-restaurants-hotels',
'https://ondemand.npr.org/anon.npr-mp3/npr/wesat/2017/03/20170311_wesat_south_korea_wrap.mp3?orgId=1&topicId=1125&d=195&p=7&story=519807707&t=progseg&e=519805215&seg=12&siteplayer=true&dl=1',
'https://www.facebook.com/NPR',
'https://www.twitter.com/NPR']
Objective: get links contain (/yyyy/mm/dd/ddddddddd/) format. e.g. /2017/03/10/519650091/
for some reasons just cannot get it right, always has the facebook, twitter and 2017/03/20170311 format links in it.
sel_links = []
def selectedLinks(links):
r = re.compile("^(/[0-9]{4}/[0-9]{2}/[0-9]{2}/[0-9]{9})$")
for link in links:
if r.search(link)!="None":
sel_links.append(link)
return set(sel_links)
selectedLinks(links)
You have several problems here:
The pattern ^(/[0-9]{4}/[0-9]{2}/[0-9]{2}/[0-9]{9})$ requires the string to start with /[0-9]{4}/, but all your strings start with http.
The condition r.search(link)!="None" will never be true, because re.search returns None or a match object, so comparison to the string "None" is inappropriate
It seems you're looking for this:
def selectedLinks(links):
r = re.compile(r"/[0-9]{4}/[0-9]{2}/[0-9]{2}/[0-9]{9}")
for link in links:
if r.search(link):
sel_links.append(link)
return set(sel_links)

Why my regex doesn't work with BeautifulSoup?

I am parsing an HTML file and would like to match everything between two sequences of characters: Sent: and the <br> tag.
I have seen several very similar questions and tried all of their methods and none have worked for me, probably because I'm a novice and am doing something very simple incorrectly.
Here's my relevant code:
for filename in os.listdir(path): #capture email year, month, day
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
a = re.findall(r'Sent:/.+?(?=<br>)/', soup.text)[0]
#a = re.findall(r'Sent:(.*)', soup.text)[0]
print(a)
d = parser.parse(a)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
and I've also tried these for my RegEx: a = re.findall(r'Sent:/^(.*?)<br>/', soup.text)[0] and a = re.findall(r'Sent:/^[^<br>]*/', soup.text)[0]
But I keep getting the error list index out of range.... but even when I remove the [0] I get the error AttributeError: 'list' object has no attribute 'read' on the line d = parser.parse(a).... with only [] printed as a result of print(a)
Here's the relevant block of HTML:
<b>Sent:</b> Friday, June 14, 2013 12:07 PM<br><b>To:</b> David Leveille<br><b>Subject:</b>
The problem is not really your regex, but the fact that BeautifulSoup parse the HTML (its job after all) and change its content. For example, your <br> will be transformed to <br/>. Another point : soup.text erases all the tags, so your regex won't work anymore.
It will be more clear trying this script :
from bs4 import *
import re
from dateutil import parser
pattern = re.compile(r'Sent:(.+?)(?=<br/>)')
with open("myfile.html", 'r') as f:
html = f.read()
print("html: ", html)
soup = BeautifulSoup(html, 'lxml')
print("soup.text: ", soup.text)
print("str(soup): ", str(soup))
a = pattern.findall(str(soup))[0]
print("pattern extraction: ", a)
For the second part : since your date string is not formally correct (because of the initial <br/>), you should add the parameter fuzzy=True, as its explained in the documentation of dateutil.
d = parser.parse(a, fuzzy=True)
print("year:", d.year)
print("month:", d.month)
print("day:", d.day)
Another solution would be to use a more precise regex. For example :
pattern = re.compile(r'Sent:</b>(.+?)(?=<br/>)')
Try this. It also takes into consideration if the <br> tag contains a slash.
/Sent:(.*?)<\/*br>/
You don't need the usual slash escapes:
a = re.findall(r"Sent:(.*?)<br>", soup.text)[0]
That being said, you should probably check for the output (or at least use try/except) before trying to get a value from it.
Can you please replace your regex with the one below that looks for the key terms and then anything between them and tell me what error if any you are now receiving?
a=re.findall(r"Sent:(.*?)<br>", soup.text)[0]

Categories