Python Regex Webcrawling, Get Double results, need just one - python

I am working on a basic python webcrawling program to go into a website and read the email addresses and show that as output. I am getting the right answer but it is getting duplicated. Can you please help to fix it?
Here is the program:
from re import findall
import urllib.request
url = "https://www.uta.edu/academics/schools-colleges/business/admissions-and-advising/cob-advising"
print("Email addresses for advisors:")
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()
pdata = findall(r"[A-Za-z0-9._%+-]+"
r"#[A-Za-z0-9.-]+"
r"\.[A-Za-z]{2,4}", htmlStr)
for item in pdata:
print(item)

for item in list(dict.fromkeys(pdata)):
print(item)
"dict.fromkeys(pdata)" import list's items to its key. (In this case value will be None) When importing, same key value will be ignored.
Finally list(dict.fromkeys(pdata)) will make duplicated items to be removed.

You get each e-mail address twice, because your website contains each e-mail address two times. You can convert your list to a set to get only the unique items. You can then convert it back to a list, if you need the results in a list:
pdata = list(set(pdata))

There are two copies of all emails in the html file (one in text and another one in href attribute). Here is an example of this case:
<a href="mailto:micah.washington#uta.edu" class="uta-btn uta-btn-ghost">
<span>micah.washington#uta.edu</span>
</a>
The standard way would be to use a parser to only get the text of html and not the attributes/tags. But here, easiest way would be to print every other element:
for item in pdata[::2]:
print(item)
And here is a more standard way of doing it using BeautifulSoup html parser where div.text extracts text of html and removes tags and attributes:
from re import findall
import urllib.request
from bs4 import BeautifulSoup as bs
url = "https://www.uta.edu/academics/schools-colleges/business/admissions-and-advising/cob-advising"
print("Email addresses for advisors:")
response = urllib.request.urlopen(url)
div = bs(response, 'html5lib')
pdata = findall(r"[A-Za-z0-9._%+-]+"
r"#[A-Za-z0-9.-]+"
r"\.[A-Za-z]{2,4}", div.text)
for item in pdata:
print(item)

Related

Extracting a specific substring from a specific hyper-reference using Python

I'm new to Python, and for my second attempt at a project, I wanted to extract a substring – specifically, an identifying number – from a hyper-reference on a url.
For example, this url is the result of my search query, giving the hyper-reference http://www.chessgames.com/perl/chessgame?gid=1012809. From this I want to extract the identifying number "1012809" and append it to navigate to the url http://www.chessgames.com/perl/chessgame?gid=1012809, after which I plan to download the file at the url http://www.chessgames.com/pgn/alekhine_naegeli_1932.pgn?gid=1012809 . But I am currently stuck a few steps behind this because I can't figure out a way to extract the identifier.
Here is my MWE:
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
import re
y = str(soup)
x = re.findall("gid=[0-9]+",y)
print x
z = re.sub("gid=", "", x(1)) #At this point, things have completely broken down...
As Albin Paul commented, re.findall return a list, you need to extract elements from it. By the way, you don't need BeautifulSoup here, use urllib2.urlopen(url).read() to get the string of the content, and the re.sub is also not needed here, one regex pattern (?:gid=)([0-9]+) is enough.
import re
import urllib2
url = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
page = urllib2.urlopen(url).read()
result = re.findall(r"(?:gid=)([0-9]+)",page)
print(result[0])
#'1012809'
You don't need regex here at all. Css selector along with string manipulation will lead you to the right direction. Try the below script:
import requests
from bs4 import BeautifulSoup
page_link = 'http://www.chessgames.com/perl/chess.pl?yearcomp=exactly&year=1932&playercomp=white&pid=&player=Alekhine&pid2=&player2=Naegeli&movescomp=exactly&moves=&opening=&eco=&result=1%2F2-1%2F2'
soup = BeautifulSoup(requests.get(page_link).text, 'lxml')
item_num = soup.select_one("[href*='gid=']")['href'].split("gid=")[1]
print(item_num)
Output:
1012809

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

Loading more links in a page after sending json requests in Python

I am parsing this URL to get links from one of the boxes with infinite scroll. Here is mo code for sending the requests for the website to get next 10 links:
import requests
from bs4 import BeautifulSoup
import urllib2
import urllib
import extraction
import json
from json2html import *
baseUrl = 'http://www.marketwatch.com/news/headline/getheadlines'
parameters2 = {
'ticker':'XOM',
'countryCode':'US',
'docType':'2007',
'sequence':'6e09aca3-7207-446e-bb8a-db1a4ea6545c',
'messageNumber':'1830',
'count':'10',
'channelName':'',
'topic':' ',
'_':'1479539628362'}
html2 = requests.get(baseUrl, params = parameters2)
html3 = json.loads(html2.text) # array of size 10
In the corresponding HTML , there is an element like:
<li class="loading">Loading more headlines...</li>
that tells there are more items to be loaded by scrolling dowwn , but I don't know how to use json file to write a loop to gets more links.
My first try was to use Beautiful Soup and to write the following code to get links and ids :
url = 'http://www.marketwatch.com/investing/stock/xom'
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
pressReleaseBox = soup.find('div', attrs={'id':'prheadlines'})
and then check if there is more link to scrape, get the next json file:
loadingMore = pressReleaseBox.find('li',attrs={'class':'loading'})
while loadingMore != None:
# get the links from json file and load more links
I don't know hot to implement the comment part. do you have any idea about it?
I am not obliged to use BeautifulSoup, and any other working library will be fine.
Here is how you can load more json file:
get last json file, extract value of key UniqueId in last item.
if the value is something looks like e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2:8499
extract e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2 as sequence
extract 8499 as messageNumber
let docId be empty
if the value is something looks like 1222712881
let sequence be empty
let messageNumber be empty
extract 1222712881 as docId
put parameters sequence, messageNumber, docId into your parameters2.
use requests.get(baseUrl, params = parameters2) to get your next json file.

Extract single href from a web page

I am working on a code, in which I have to extract a single href link, the problem which I am facing is that it extracts two links which have everything same except the last ID part, I have one ID, I just want to extract the other one from the link. This is my code:-
import requests,re
from bs4 import BeautifulSoup
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for elem in a_1:
print elem['href']
The output which I am getting is:-
/on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758921
/on/demandware.store/Sites-BNY-Site/default/Product-Variation?pid=500758910
I have the first ID i.e, 500758921, I want to extract the other one.
Please Help. Thanks in advance!
If you need every link except the first one, just slice the result of find_all():
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
print link['href']
The reason that slicing works is that find_all() returns a ResultSet instance which is based internally on regular Python list:
class ResultSet(list):
"""A ResultSet is just a list that keeps track of the SoupStrainer
that created it."""
def __init__(self, source, result=()):
super(ResultSet, self).__init__(result)
self.source = source
To extract the pid from the links you've got, you can use a regular expression search saving the pid value in a capturing group:
import re
pattern = re.compile("pid=(\w+)")
for item in g_1:
links = soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for link in links[1:]:
match = pattern.search(link["href"])
if match:
print match.group(1)
Run this regex for every link
^/on/demandware.store/Sites-BNY-Site/default/Product-Variation\?pid=([0-9]+)
Get the result from the last regex group.
this might do :
import requests,re
from bs4 import BeautifulSoup
def getPID(url):
return re.findall('(\d+)',url.rstrip('.html'))
url="http://www.barneys.com/band-of-outsiders-oxford-sport-shirt-500758921.html"
having_pid = getPID(url)
print(having_pid)
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_1=soup.find_all("div",{"class":"color-scroll"})
for item in g_1:
a_1=soup.find_all('a', href=re.compile('^/on/demandware.store/Sites-BNY-Site/default/Product-Variation'))
for elem in a_1:
if (getPID(elem['href'])[0] not in having_pid):
print elem['href']

Python BS4 crawler indexerror

I am trying to create a simple crawler that pulls meta data from websites and saves the information into a csv. So far I am stuck here, I have followed some guides but am now stuck with the error:
IndexError: list of index out of range.
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
# Copy all of the content from the provided web page
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile('<title>(.*)</title>')
# Grab the link to the original article using a REGEX
patFinderLink = re.compile('<link rel.*href="(.*)" />')
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find('<div>') # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll('p')
# Print all of the paragraphs to screen
for i in paragList:
print i
print '\n'
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
print soup2.findAll('title')
print soup2.findAll('link')
titleSoup = soup2.findAll('title')
linkSoup = soup2.findAll('link')
for i in listIterator:
print titleSoup[i]
print linkSoup[i]
print '\n'
Any help would be greatly appreciated.
The error I get is
File "C:\Users......", line 24, in (module)
print findPatTitle[i] # the title
IndexError:list of index out of range
Thank you.
It seems that you are not using all the power that bs4 can give you.
You are getting this error because the lenght of patFinderTitle is just one, since all html has usually only one title element per document.
A simple way to grab the title of a HTML, is using bs4 itself:
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('http://www.tidyawaytoday.co.uk/').read()
soup = BeautifulSoup(webpage)
# get the content of title
title = soup.title.text
You will probably get the same error if you try to iterate over your findPatLink in the currently way, since it has length 6. For me, it is not clear enough if you want to get all the link elements or all the anchor elements, but stickying with the first idea, you can improve your code using bs4 again:
link_href_list = [link['href'] for link in soup.find_all("link")]
And finally, since you don't want some urls, you can slice link_href_list in the way that you want. An improved version of the last expression which excludes the first and the second result could be:
link_href_list = [link['href'] for link in soup.find_all("link")[2:]]

Categories