Python Get Links Script - Needs Wildcard search - python

I have the below code that when you put in a URL with a a bunch of links it will return the list to you. This works well, except that I only want links that start with ... and this will return EVERY link, including ones like the home/back/etc. is there a way to use a wild card or a "starts with" function?
from bs4 import BeautifulSoup
import requests
url = ""
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')
# Extracting all the <a> tags into a list.
tags = soup.find_all('a')
# Extracting URLs from the attribute href in the <a> tags.
for tags in tags:
print(tags.get('href'))
Also, is there a way to export to excel? I am not great with python and I am not sure how I got this far to be honest.
Thanks,

Here is an updated version of your code that will get all https hrefs from the page:
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com"
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data)
# Extracting all the <a> tags into a list.
tags = soup.find_all('a')
# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
if str.startswith(tag.get('href'), 'https'):
print(tag.get('href'))
If you want to get hrefs that start with something other than https, change the 2nd to last line :)
References:
https://www.tutorialspoint.com/python/string_startswith.htm

You could use startswith() :
for tag in tags:
if tag.get('href').startswith('pre'):
print(tag.get('href'))

For your second question: Is there a way to export to Excel - I've been using a python module XlsxWriter.
import xlsxwriter
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Expenses01.xlsx')
worksheet = workbook.add_worksheet()
# Some data we want to write to the worksheet.
expenses = (
['Rent', 1000],
['Gas', 100],
['Food', 300],
['Gym', 50],
)
# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0
# Iterate over the data and write it out row by row.
for item, cost in (expenses):
worksheet.write(row, col, item)
worksheet.write(row, col + 1, cost)
row += 1
# Write a total using a formula.
worksheet.write(row, 0, 'Total')
worksheet.write(row, 1, '=SUM(B1:B4)')
workbook.close()
XlsxWriter allows the coding to follow basic excel conventions - I, being new to python, getting this up, running and working was easy with the first attempt.

If tags.get returns a string, you should be able to filter on whatever start string you want like so:
URLs = [URL for URL in [tag.get('href') for tag in tags]
if URL.startswith('/some/path/')]
Edit:
It turns out that in your case, tags.get doesn't always return a string. For tags that don't contain links, the return type is NoneType, and we can't use string methods on NoneType. It's easy to check if the return value of tags.get is None before using the string method startswith on it.
URLs = [URL for URL in [tag.get('href') for tag in tags]
if URL is not None and URL.startswith('/some/path/')]
Notice the addition of URL is not None and. That has to become before URL.startswith, otherwise Python will try to use a string method on None and complain. You can read this just like an English sentence, which highlights one of the great things about Python; the code is easier to read than just about any other programming language, which makes it really good for communicating ideas to other people.

Related

Treating a list of items like a single item error: how to find links within each 'link' within string already scraped

I am writing a python code to scrape the pdfs of meetings off this website: https://www.gmcameetings.co.uk
The pdf links are within links, which are also within links. I have the first set of links off the page above, then I need to scrape links within the new urls.
When I do this I get the following error:
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
This is my code so far which is all fine and checked in jupyter notebook:
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
This is the line that then receives the error:
second_links = meeting_links.find_all('a', href='TRUE')
I have tried the find() as python suggests but that doesn't work either. But I understand that it can't treat meeting_links as a single item.
So basically, how do you search for links within each bit of the new string variable (meeting_links).
I already have code to get the pdfs once I have the second set of urls which seems to work fine but need to obviously get these first.
Hopefully this makes sense and I've explained ok - I only properly started using python on Monday so I'm a complete beginner.
To get all meeting links try
from bs4 import BeautifulSoup as bs
import requests
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# Scrape to find all links
all_links = soup.find_all('a', href=True)
# Loop through links to find those containing '/meetings/'
meeting_links = []
for link in all_links:
href = link['href']
if '/meetings/' in href:
meeting_links.append(href)
print(meeting_links)
The .find() function that you use in your original code is specific to beautiful soup objects. To find a substring within a string, just use native Python: 'a' in 'abcd'.
Hope that helps!

Beautiful Soup. Text extraction into a dataframe

I'm trying to extract the information from a single web-page that contains multiple similarly structured recordings. Information is contained within div tags with different classes (I'm interested in username, main text and date). Here is the code I use:
import bs4 as bs
import urllib
import pandas as pd
href = 'https://example.ru/'
sause = urllib.urlopen(href).read()
soup = bs.BeautifulSoup(sause, 'lxml')
user = pd.Series(soup.Series('div', class_='Username'))
main_text = pd.Series(soup.find_all('div', class_='MainText'))
date = pd.Series(soup.find_all('div', class_='Date'))
result = pd.DataFrame()
result = pd.concat([user, main_text, date], axis=1)
The problem is that I receive information with all tags, while I want only a text. Surprisingly, .text attribute doesn't work with find_all method, so now I'm completely out of ides.
Thank you for any help!
list comprehension is the way to go, to get all the text within MainText for example, try
[elem.text for elem in soup.find_all('div', class_='MainText')]

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

how to extract specific csv from web page html containing multiple csv file links

I need to extract csv file from html page see below and once I get that I can do stuff with it. below is code to extract that particular line of html code from a previous assignment. The url is 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
that is test code so it breaks temporarly when it finds that line.
part of the line with my csv is href is csv/datasets/co2.csv ( unicode I think as type)
how to open the co2.csv?
sorry about any formatting issues with the question. The code has been sliced and diced by the editor.
import urllib
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *
def scrapper(url,k):
c=0
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
#. Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
y= (tag.get('href', None))
#print ((y))
if y == 'csv/datasets/co2.csv':
print y
break
c= c+ 1
if c is k:
return y
print(type(y))
for w in range(29):
print(scrapper(url,w))
You're re-downloading and reparsing the full html page for all of the 30 iterations of your loop, just to get the next csv file and see if that is the one you want. That is very inefficient, and not very polite to the server. Just read the html page once, and use the loop over the tags you already had to check if the tag is the one you want! If so, do something with it, and stop looping to avoid needless further processing because you said you only needed one particular file.
The other issue related to your question is that in the html file the csv hrefs are relative urls. So you have to join them on the base url of the document they're in. urlparse.urljoin() does just that.
Not related to the question directly, but you should also try to clean up your code;
fix your indentation (the comment on line 9)
choose better variable names; y/c/k/w are meaningless.
Resulting in something like:
import urllib
import urlparse
url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *
def scraper(url):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
href = (tag.get('href', None))
if href.endswith("/co2.csv"):
csv_url = urlparse.urljoin(url, href)
# ... do something with the csv file....
contents = urllib.urlopen(csv_url).read()
print "csv file size=", len(contents)
break # we only needed this one file, so we end the loop.
scraper(url)

BeautifulSoup to extract URLs (same URL repeating)

I've tried using BeautifulSoup and regex to extract URLs from a web page. This is my code:
Ref_pattern = re.compile('<TD width="200"><A href="(.*?)" target=')
Ref_data = Ref_pattern.search(web_page)
if Ref_data:
Ref_data.group(1)
data = [item for item in csv.reader(output_file)]
new_column1 = ["Reference", Ref_data.group(1)]
new_data = []
for i, item in enumerate(data):
try:
item.append(new_column1[i])
except IndexError, e:
item.append(Ref_data.group(1)).next()
new_data.append(item)
Though it has many URLs in it, it just repeats the first URL. I know there's something wrong with
except IndexError, e:
item.append(Ref_data.group(1)).next()
this part because if I remove it, it just gives me the first URL (without repetition). Could you please help me extract all the URLs and write them into a CSV file.
Thank you.
Although it's not entirely clear what you're looking for, based on what you've stated, if there are specific elements (classes or id's or text, for instance) associated with the links you're attempting to extract, then you can do something like the following:
from bs4 import BeautifulSoup
string = """\
Linked Text
Linked Text
Image
Phone Number"""
soup = BeautifulSoup(string)
for link in soup.findAll('a', { "class" : "pooper" }, href=True, text='Linked Text'):
print link['href']
As you can see, I am using the bs4's attribute feature to select only those anchor tags that include the "pooper" class (class="pooper"), and then I am further narrowing the return values by passing a text argument (Linked Text rather than Image).
Based on your feedback below, try the following code. Let me know.
for items in soup.select("td[width=200]"):
for link in items:
link.findAll('a', { "target" : "_blank" }, href=True)
print link['href']

Categories