Python Reg Pattern URL select/filter - python

links = [
'http://www.npr.org/sections/thesalt/2017/03/10/519650091/falling-stars-negative-yelp-reviews-target-trump-restaurants-hotels',
'https://ondemand.npr.org/anon.npr-mp3/npr/wesat/2017/03/20170311_wesat_south_korea_wrap.mp3?orgId=1&topicId=1125&d=195&p=7&story=519807707&t=progseg&e=519805215&seg=12&siteplayer=true&dl=1',
'https://www.facebook.com/NPR',
'https://www.twitter.com/NPR']
Objective: get links contain (/yyyy/mm/dd/ddddddddd/) format. e.g. /2017/03/10/519650091/
for some reasons just cannot get it right, always has the facebook, twitter and 2017/03/20170311 format links in it.
sel_links = []
def selectedLinks(links):
r = re.compile("^(/[0-9]{4}/[0-9]{2}/[0-9]{2}/[0-9]{9})$")
for link in links:
if r.search(link)!="None":
sel_links.append(link)
return set(sel_links)
selectedLinks(links)

You have several problems here:
The pattern ^(/[0-9]{4}/[0-9]{2}/[0-9]{2}/[0-9]{9})$ requires the string to start with /[0-9]{4}/, but all your strings start with http.
The condition r.search(link)!="None" will never be true, because re.search returns None or a match object, so comparison to the string "None" is inappropriate
It seems you're looking for this:
def selectedLinks(links):
r = re.compile(r"/[0-9]{4}/[0-9]{2}/[0-9]{2}/[0-9]{9}")
for link in links:
if r.search(link):
sel_links.append(link)
return set(sel_links)

Related

Trying to isolate URL suffix's from list of href tags

I'm currently working on a simple web crawling program that will crawl the SCP wiki to find links to other articles in each article. So far I have been able to get a list of href tags that go to other articles, but can't navigate to them since the URL I need is embedded in the tag:
[ SCP-1512,
SCP-2756,
SCP-002,
SCP-004 ]
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
The code used to get the list looks like this:
import requests
import lxml
from bs4 import BeautifulSoup
import re
def searchSCP(x):
url = str(SCoutP(x))
c = requests.get(url)
crawl = BeautifulSoup(c.content, 'lxml')
#Searches HTML for text containing "SCP-" and href tags containing "scp-"
ref = crawl.find_all(text=re.compile("SCP-"), href=re.compile("scp-",))
param = "SCP-" + str(SkateP(x)) #SkateP takes int and inserts an appropriate number of 0's.
for i in ref: #Below function is for sorting out references to the article being searched
if str(param) in i:
ref.remove(i)
if ref != []:
print(ref)
The main idea I've tried to use is finding every item that contains items in quotations, but obviously that just returned the same list. What I want to be able to do is select a specific item in the list and take out ONLY the "scp-xxxx" part or, alternatively, change the initial code to only extract the href content in quotations to the list.
Is there any way I would be able to isolate the "/scp-xxxx" from each item in the list so I can append it to the parent URL?
If I understand correctly, you want to extract the href attribute - for that, you can use i.get('href') (or probably even just i['href']).
With .select and list comprehension, you won't even need regex to filter the results:
[a.get('href') for a in crawl.select('*[href*="scp-"]') if 'SCP-' in a.get_text()]
would return
['/scp-1512', '/scp-2756', '/scp-002', '/scp-004']
If you want the parent url attached:
root_url = 'https://PARENT-URL.com' ## replace with the actual parent url
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t]
scpLinks should return
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-002', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']
If you want to filter out param, add str(param) not in t to the filter:
scpLinks = [root_url + l for l, t in list(set([
(a.get('href'), a.get_text()) for a in crawl.select('*[href*="scp-"]')
])) if 'SCP-' in t and str(param) not in t]
if str(param) was 'SCP-002', then scpLinks would be
['https://PARENT-URL.com/scp-004', 'https://PARENT-URL.com/scp-1512', 'https://PARENT-URL.com/scp-2756']

I need to scrape the instagram link that is highlighted in the image

I am trying regex in python. I am facing a problem how to clip out the portion that is<
"www.instagram.com%2FMohakMeet"
I need to know the characters which I need to use in regex.
#python3
for d in g:
stripped = (d.rstrip())
url = stripped+"/about"
print("Retreiving" + url)
response = requests.get(url)
data = response.text
link = re.findall('''(www.instagram.com.+?)\s?\"?''', data)
if link == []:
print ('No Link')
else: x = str(link[0])
print ("Insta Link", x)
y = x.replace("%2F", '/', 3)
print (y)
# with open ('l.txt', 'a') as v:
# v.write(y)
# v.write("\n")
This is My Code but the main problem is, while scraping Python is scraping the Description of the Youtube page shown in the 2nd picture.
Please Help.
This is the pattern which is not working.
(www.instagram.com.+?)\s?\"?'''
this link will let you debug your regex https://regex101.com/
a common pitfall when creating regex, is using standard string ('my string') and not raw strings r'my string' .
see also https://docs.python.org/3/library/re.html

Scraping lists of items from Wikipedia

I would need to get all the information from this page:
https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana
from symbol " to letter Z.
Then:
"
"900", Cahiers d'Italie et d'Europe
A
Abitare
Aerei
Aeronautica & Difesa
Airone (periodico)
Alp (periodico)
Alto Adige (quotidiano)
Altreconomia
....
In order to do this, I have tried using the following code:
res = requests.get("https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana")
soup = bs(res.text, "html.parser")
url_list = []
links = soup.find_all('a')
for link in links:
url = link.get("href", "")
url_list.append(url)
lists_A=[]
for url in url_list:
lists_A(url)
print(lists_A)
However this code collects more information than what I would need.
In particular, the last item that I should collect would be La Zanzara (possibly all the items should not have any word in the brackets, i.e. they should not contain (rivista), (periodico), (settimanale), and so on, but just the title (e.g. Jack (periodico) should be just Jack).
Could you give me any advice on how to get this information? Thanks
This will help you to filter out some of the unwanted urls (not all though). Basically everything before "Corriere della Sera", which I'm assuming should be the first expected URL.
links = [a.get('href') for a in soup.find_all('a', {'title': True, 'href': re.compile('/wiki/(.*)'), 'accesskey': False})]
You can safely assume that all the magazine URLs are ordered at this point and since you know that "La Zanzara" should be the last expected URL you can get the position of that particular string in your new list and slice up to that index + 1
links.index('/wiki/La_zanzara_(periodico)')
Out[20]: 144
links = links[:145]
As for removing ('periodico') and other data cleaning you need to inspect your data and figure out what is it that you want to remove.
Write a simple function like this maybe:
def clean(string):
to_remove = ['_(periodico)', '_(quotidiano)']
for s in to_remove:
if s in string:
return replace(string, s, '')

How to match and extract using regex - Python

I'm taking a look how to use regex and trying to figure out how to extract the Latitude and Longitude, no matter if the number is positive or negative, right after the "?ll=" as shown below:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
I have used the following code in python to get only the first digits marked above:
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
print(lnk)
m = re.match('-?\d+(?!.*ll=)(?!&q=loc)*', lnk)
print(m)
#lat, *long = m.split(',')
#print(lat)
#print(long)
The result I got isn't what I was expecting:
https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&
None
I'm getting "None" rather than the value "-6.148222,106.8462". I also tried to split those numbers into two variables called lat and long, but since I always got "None" python stops processing with "exit code 1" until I comment lines.
Cheers,
You should use re.search() instead of re.match() cause re.match() is used for exact matches.
This can solve the problem
for link in soup.find_all('a', {'class': 'popup-gmaps'}):
lnk = str(link.get('href'))
m = re.search(r"(-?\d*\.\d*),(-?\d*\.\d*)", lnk)
print(m.group())
print("lat = "+m.group(1))
print("lng = "+m.group(2))
I'd use a proper URL parser, using regex here is asking for problems in case the URL embedded in the page you are crawling is changing in a way that will break the regex you use.
from urllib.parse import urlparse, parse_qs
url = 'https://maps.google.com/maps?ll=-6.148222,106.8462&q=loc:-6.148222,106.8462&'
scheme, netloc, path, params, query, fragment = urlparse(url)
# or just
# query = urlparse(url).query
parsed_query_string = parse_qs(query)
print(parsed_query_string)
lat, long = parsed_query_string['ll'][0].split(',')
print(lat)
print(long)
outputs
{'ll': ['-6.148222,106.8462'], 'q': ['loc:-6.148222,106.8462']}
-6.148222
106.8462
use diff regex for latitude and longitude
import re
str1="https://maps.google.com/maps?ll=6.148222,-106.8462&q=loc:-6.148222,106.8462&"
lat=re.search(r"(-)*\d+(.)\d+",str1).group()
lon=re.search(r",(-)*\d+(.)\d+",str1).group()
print(lat)
print(lon[1:])
output
6.148222
-106.8462

Web Scraping a wikipedia page

In some wikipedia pages, after the title of the article (appearing in bold), there is some text inside of parentheses used to explain the pronunciation and phonetics of the words in the title. For example, on this, after the bold title diglossia in the <p>, there is an open parenthesis. In order to find the corresponding close parenthesis, you would have to iterate through the text nodes one by one to find it, which is simple. What I'm trying to do is find the very next href link and store it.
The issue here is that (AFAIK), there isn't a way to uniquely identify the text node with the close parenthesis and then get the following href. Is there any straight forward (not convoluted) way to get the first link outside of the initial parentheses?
EDIT
In the case of the link provided here, the href to be stored should be: https://en.wikipedia.org/wiki/Dialects since that is the first link outside of the parenthesis
Is this what you want?
import requests
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
print parsed_html.body.findAll('p')[0].findAll('a')[0]
This gives:
linguistics
if you want to extract href then you can use this:
parsed_html.body.findAll('p')[0].findAll('a')[0].attrs[0][1]
UPDATE
It seems you want href after parentheses not the before one.
I have written script for it. Try this:
import requests
from BeautifulSoup import BeautifulSoup
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
temp = parsed_html.body.findAll('p')[0]
start_count = 0
started = False
found = False
while temp.next and found is False:
temp = temp.next
if '(' in temp:
start_count += 1
if started is False:
started = True
if ')' in temp and started and start_count > 1:
start_count -= 1
elif ')' in temp and started and start_count == 1:
found = True
print temp.findNext('a').attrs[0][1]

Categories