BeautifulSoup to extract URLs (same URL repeating) - python

I've tried using BeautifulSoup and regex to extract URLs from a web page. This is my code:
Ref_pattern = re.compile('<TD width="200"><A href="(.*?)" target=')
Ref_data = Ref_pattern.search(web_page)
if Ref_data:
Ref_data.group(1)
data = [item for item in csv.reader(output_file)]
new_column1 = ["Reference", Ref_data.group(1)]
new_data = []
for i, item in enumerate(data):
try:
item.append(new_column1[i])
except IndexError, e:
item.append(Ref_data.group(1)).next()
new_data.append(item)
Though it has many URLs in it, it just repeats the first URL. I know there's something wrong with
except IndexError, e:
item.append(Ref_data.group(1)).next()
this part because if I remove it, it just gives me the first URL (without repetition). Could you please help me extract all the URLs and write them into a CSV file.
Thank you.

Although it's not entirely clear what you're looking for, based on what you've stated, if there are specific elements (classes or id's or text, for instance) associated with the links you're attempting to extract, then you can do something like the following:
from bs4 import BeautifulSoup
string = """\
Linked Text
Linked Text
Image
Phone Number"""
soup = BeautifulSoup(string)
for link in soup.findAll('a', { "class" : "pooper" }, href=True, text='Linked Text'):
print link['href']
As you can see, I am using the bs4's attribute feature to select only those anchor tags that include the "pooper" class (class="pooper"), and then I am further narrowing the return values by passing a text argument (Linked Text rather than Image).
Based on your feedback below, try the following code. Let me know.
for items in soup.select("td[width=200]"):
for link in items:
link.findAll('a', { "target" : "_blank" }, href=True)
print link['href']

Related

How to avoid attribute error while extracting links from articles?

I am trying my hand at webscraping using BeautifulSoup.
I had posted this before here, but I was not very clear as to what I wanted, so it only partially answers my issue.
How do I extract only the content from this webpage
I want to extract the content from the webpage and then extract all the links from the output. Please can someone help me understand where I am going wrong.
This is what I have after updating my previous code with the answer provided in the link above.
# Define the content to retrieve (webpage's URL)
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'
# Retrieve the page
http = urllib3.PoolManager()
r = http.request('GET', quote_page)
if r.status == 200:
page = r.data
print(f'Type of Variable "page": {page.__class__.__name__}')
print(f'Page Retrieved. Request Status: {r.status}, Page Size:{len(page)}')
else:
print(f'Some problem occured. Request status: {r.status}')
# Convert the stream of bytes into a BeautifulSoup representation
soup = BeautifulSoup(page, 'html.parser')
print(f'Type of variable "soup": {soup.__class__.__name__}')
# Check the content
print(f'{soup. Prettify()[:1000]}')
# Check the HTML's Title
print(f'Title tag: {soup.title}')
print(f'Title text: {soup.title.string}')
# Find the main content
article_tag = 'p'
articles = soup.find_all(article_tag)
print(f'Type of the variable "article":{article.__class__.__name__}')
for p in articles:
print (p.text)
I then used the code below to get all the links, but get an error
# Find the links in the text
# identify the type of tag to retrieve
tag = 'a'
# create a list with the links from the `<a>` tag
tag_list = [t.get('href') for t in articles.find_all(tag)]
tag_list
That is cause articles is a ResultSet of soup.find_all(article_tag) what you can check with type(articles)
To get your goal you have to iterate articles first - So simply add an additional for-loop to your list comprehension:
[t.get('href') for article in articles for t in article.find_all(tag)]
In addition you may should use a set comprehension to avoid duplicates and also concat paths with base url:
list(set(t.get('href') if t.get('href').startswith('http') else 'https://bigbangtheory.fandom.com'+t.get('href') for article in articles for t in article.find_all(tag)))
Output:
['https://bigbangtheory.fandom.com/wiki/The_Killer_Robot_Instability',
'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
'https://bigbangtheory.fandom.com/wiki/The_Valentino_Submergence',
'https://bigbangtheory.fandom.com/wiki/The_Beta_Test_Initiation',
'https://bigbangtheory.fandom.com/wiki/Season_2',
'https://bigbangtheory.fandom.com/wiki/Dr._Pemberton',...]

Python Get Links Script - Needs Wildcard search

I have the below code that when you put in a URL with a a bunch of links it will return the list to you. This works well, except that I only want links that start with ... and this will return EVERY link, including ones like the home/back/etc. is there a way to use a wild card or a "starts with" function?
from bs4 import BeautifulSoup
import requests
url = ""
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')
# Extracting all the <a> tags into a list.
tags = soup.find_all('a')
# Extracting URLs from the attribute href in the <a> tags.
for tags in tags:
print(tags.get('href'))
Also, is there a way to export to excel? I am not great with python and I am not sure how I got this far to be honest.
Thanks,
Here is an updated version of your code that will get all https hrefs from the page:
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com"
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data)
# Extracting all the <a> tags into a list.
tags = soup.find_all('a')
# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
if str.startswith(tag.get('href'), 'https'):
print(tag.get('href'))
If you want to get hrefs that start with something other than https, change the 2nd to last line :)
References:
https://www.tutorialspoint.com/python/string_startswith.htm
You could use startswith() :
for tag in tags:
if tag.get('href').startswith('pre'):
print(tag.get('href'))
For your second question: Is there a way to export to Excel - I've been using a python module XlsxWriter.
import xlsxwriter
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Expenses01.xlsx')
worksheet = workbook.add_worksheet()
# Some data we want to write to the worksheet.
expenses = (
['Rent', 1000],
['Gas', 100],
['Food', 300],
['Gym', 50],
)
# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0
# Iterate over the data and write it out row by row.
for item, cost in (expenses):
worksheet.write(row, col, item)
worksheet.write(row, col + 1, cost)
row += 1
# Write a total using a formula.
worksheet.write(row, 0, 'Total')
worksheet.write(row, 1, '=SUM(B1:B4)')
workbook.close()
XlsxWriter allows the coding to follow basic excel conventions - I, being new to python, getting this up, running and working was easy with the first attempt.
If tags.get returns a string, you should be able to filter on whatever start string you want like so:
URLs = [URL for URL in [tag.get('href') for tag in tags]
if URL.startswith('/some/path/')]
Edit:
It turns out that in your case, tags.get doesn't always return a string. For tags that don't contain links, the return type is NoneType, and we can't use string methods on NoneType. It's easy to check if the return value of tags.get is None before using the string method startswith on it.
URLs = [URL for URL in [tag.get('href') for tag in tags]
if URL is not None and URL.startswith('/some/path/')]
Notice the addition of URL is not None and. That has to become before URL.startswith, otherwise Python will try to use a string method on None and complain. You can read this just like an English sentence, which highlights one of the great things about Python; the code is easier to read than just about any other programming language, which makes it really good for communicating ideas to other people.

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

How to solve, finding two of each link (Beautifulsoup, python)

Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance
This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])
set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need

Extract single href from a web page

I am working on a code, in which I have to extract a single href link, the problem which I am facing is that it extracts two links which have everything same except the last ID part, I have one ID, I just want to extract the other one from the link. This is my code:-
import requests,re
from bs4 import BeautifulSoup
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for elem in a_1:
print elem['href']
The output which I am getting is:-
/on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758921
/on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758910
I have the first ID i.e, 500758921, I want to extract the other one.
Please Help. Thanks in advance!
If you need every link except the first one, just slice the result of find_all():
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
print link['href']
The reason that slicing works is that find_all() returns a ResultSet instance which is based internally on regular Python list:
class ResultSet(list):
"""A ResultSet is just a list that keeps track of the SoupStrainer
that created it."""
def __init__(self, source, result=()):
super(ResultSet, self).__init__(result)
self.source = source
To extract the pid from the links you've got, you can use a regular expression search saving the pid value in a capturing group:
import re
pattern = re.compile("pid=(\w+)")
for item in g_1:
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
match = pattern.search(link["href"])
if match:
print match.group(1)
Run this regex for every link
^/on/demandware.store/Sites-BNY-Site/default/Product-Variation\?pid=([0-9]+)
Get the result from the last regex group.
this might do :
import requests,re
from bs4 import BeautifulSoup
def getPID(url):
return re.findall('(\d+)',url.rstrip('.html'))
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
having_pid = getPID(url)
print(having_pid)
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for elem in a_1:
if (getPID(elem['href'])[0] not in having_pid):
print elem['href']

Categories