Extract single href from a web page - python

I am working on a code, in which I have to extract a single href link, the problem which I am facing is that it extracts two links which have everything same except the last ID part, I have one ID, I just want to extract the other one from the link. This is my code:-
import requests,re
from bs4 import BeautifulSoup
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for elem in a_1:
print elem['href']
The output which I am getting is:-
/on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758921
/on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758910
I have the first ID i.e, 500758921, I want to extract the other one.
Please Help. Thanks in advance!

If you need every link except the first one, just slice the result of find_all():
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
print link['href']
The reason that slicing works is that find_all() returns a ResultSet instance which is based internally on regular Python list:
class ResultSet(list):
"""A ResultSet is just a list that keeps track of the SoupStrainer
that created it."""
def __init__(self, source, result=()):
super(ResultSet, self).__init__(result)
self.source = source
To extract the pid from the links you've got, you can use a regular expression search saving the pid value in a capturing group:
import re
pattern = re.compile("pid=(\w+)")
for item in g_1:
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
match = pattern.search(link["href"])
if match:
print match.group(1)

Run this regex for every link
^/on/demandware.store/Sites-BNY-Site/default/Product-Variation\?pid=([0-9]+)
Get the result from the last regex group.

this might do :
import requests,re
from bs4 import BeautifulSoup
def getPID(url):
return re.findall('(\d+)',url.rstrip('.html'))
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
having_pid = getPID(url)
print(having_pid)
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for elem in a_1:
if (getPID(elem['href'])[0] not in having_pid):
print elem['href']

Related

Python Regex Webcrawling, Get Double results, need just one

I am working on a basic python webcrawling program to go into a website and read the email addresses and show that as output. I am getting the right answer but it is getting duplicated. Can you please help to fix it?
Here is the program:
from re import findall
import urllib.request
url = "https://www.uta.edu/academics/schools-colleges/business/admissions-and-advising/cob-advising"
print("Email addresses for advisors:")
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()
pdata = findall(r"[A-Za-z0-9._%+-]+"
r"#[A-Za-z0-9.-]+"
r"\.[A-Za-z]{2,4}", htmlStr)
for item in pdata:
print(item)
for item in list(dict.fromkeys(pdata)):
print(item)
"dict.fromkeys(pdata)" import list's items to its key. (In this case value will be None) When importing, same key value will be ignored.
Finally list(dict.fromkeys(pdata)) will make duplicated items to be removed.
You get each e-mail address twice, because your website contains each e-mail address two times. You can convert your list to a set to get only the unique items. You can then convert it back to a list, if you need the results in a list:
pdata = list(set(pdata))
There are two copies of all emails in the html file (one in text and another one in href attribute). Here is an example of this case:
<a href="mailto:micah.washington#uta.edu" class="uta-btn uta-btn-ghost">
<span>micah.washington#uta.edu</span>
</a>
The standard way would be to use a parser to only get the text of html and not the attributes/tags. But here, easiest way would be to print every other element:
for item in pdata[::2]:
print(item)
And here is a more standard way of doing it using BeautifulSoup html parser where div.text extracts text of html and removes tags and attributes:
from re import findall
import urllib.request
from bs4 import BeautifulSoup as bs
url = "https://www.uta.edu/academics/schools-colleges/business/admissions-and-advising/cob-advising"
print("Email addresses for advisors:")
response = urllib.request.urlopen(url)
div = bs(response, 'html5lib')
pdata = findall(r"[A-Za-z0-9._%+-]+"
r"#[A-Za-z0-9.-]+"
r"\.[A-Za-z]{2,4}", div.text)
for item in pdata:
print(item)

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

How to solve, finding two of each link (Beautifulsoup, python)

Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance
This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])
set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need

Python BS4 crawler indexerror

I am trying to create a simple crawler that pulls meta data from websites and saves the information into a csv. So far I am stuck here, I have followed some guides but am now stuck with the error:
IndexError: list of index out of range.
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')
# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')
# Print all of the paragraphs to screen
for i in paragList:
print i
print '\n'
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
print soup2.findAll('title')
print soup2.findAll('link')
titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')
for i in listIterator:
print titleSoup[i]
print linkSoup[i]
print '\n'
Any help would be greatly appreciated.
The error I get is
File "C:\Users......", line 24, in (module)
print findPatTitle[i] # the title
IndexError:list of index out of range
Thank you.
It seems that you are not using all the power that bs4 can give you.
You are getting this error because the lenght of patFinderTitle is just one, since all html has usually only one title element per document.
A simple way to grab the title of a HTML, is using bs4 itself:
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)
# get the content of title
title = soup.title.text
You will probably get the same error if you try to iterate over your findPatLink in the currently way, since it has length 6. For me, it is not clear enough if you want to get all the link elements or all the anchor elements, but stickying with the first idea, you can improve your code using bs4 again:
link_href_list = [link['href'] for link in soup.find_all("link")]
And finally, since you don't want some urls, you can slice link_href_list in the way that you want. An improved version of the last expression which excludes the first and the second result could be:
link_href_list = [link['href'] for link in soup.find_all("link")[2:]]

Getting AttributeError:'NoneType' object has no attribute getText

I have half-written a code to pull the titles and links from an RSS feed but it results in the above error. The error is in both the functions while getting the text. I want to strip the entered string of the title and link tags.
from bs4 import BeautifulSoup
import urllib.request
import re
def getlink(a):
a= str(a)
bsoup=BeautifulSoup(a)
a=bsoup.find('link').getText()
return a
def gettitle(b):
b=str(b)
bsoup=BeautifulSoup(b)
b=bsoup.find('title').getText()
return b
webpage= urllib.request.urlopen("http://feeds.feedburner.com/JohnnyWebber?format=xml").read()
soup=BeautifulSoup(webpage)
titlesoup=soup.findAll('title')
linksoup= soup.findAll('link')
for i,j in zip(titlesoup,linksoup):
i = getlink(i)
j= gettitle(j)
print (i)
print(j)
print ("\n")
EDIT: falsetru's method worked perfectly.
I have one more question. Can text be extracted out of any tag by just doing getText ?
I expect the problem is in
def getlink(a):
...
a=bsoup.find('a').getText()
....
Remember find matches tag names, there is no link tag but an a tag. BeautifulSoup will return None from find if there is no matching tag, thus the NoneType error. Check the docs for details.
Edit:
If you really are looking for the text 'link' you can use bsoup.find(text=re.compile('link'))
i, j is title, link already. Why do you find them again?
for i, j in zip(titlesoup, linksoup):
print(i.getText())
print(j.getText())
print("\n")
Beside that, pass features='xml' to BeautifulSoup if you parse xml file.
soup = BeautifulSoup(webpage, features='xml')
b=bsoup.find('title') returns None
try checking your input

Categories