Opening page using urllib2 - diacritics - python

I'm trying to open multiple pages using urllib2. The problem is that some pages can't be opened. It returns urllib2.HTTPerror: HTTP Error 400: Bad Request
I'm getting hrefs of this pages from another web page (in the head of the page is charset = "utf-8").
The error is returned only then, when I'm trying to open a page containing 'č','ž' or 'ř' in url.
Here is the code:
def getSoup(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
page = response.read()
soup = BeautifulSoup(page, 'html.parser')
return soup
hovienko = getSoup("http://www.hovno.cz/hovna-az/a/1/")
lis = hovienko.find("div", class_="span12").find('ul').findAll('li')
for liTag in lis:
aTag = liTag.find('a')['href']
href = "http://www.hovno.cz"+aTag """ hrefs, I'm trying to open using urllib2 """
soup = getSoup(href.encode("iso-8859-2")) """ here occures errors when 'č','ž' or 'ř' in url """
Do anybody knows, what I have to do to avoid errors?
Thank you

This site is UTF-8. Why you need href.encode("iso-8859-2") ? I have taken the next code from http://programming-review.com/beautifulsoasome-interesting-python-functions/
import urllib2
import cgitb
cgitb.enable()
from BeautifulSoup import BeautifulSoup
from urlparse import urlparse
# print all links
def PrintLinks(localurl):
data = urllib2.urlopen(localurl).read()
print 'Encoding of fetched HTML : %s', type(data)
soup = BeautifulSoup(data)
parse = urlparse(localurl)
localurl = parse[0] + "://" + parse[1]
print "<h3>Page links statistics</h3>"
l = soup.findAll("a", attrs={"href":True})
print "<h4>Total links count = " + str(len(l)) + '</h4>'
externallinks = [] # external links list
for link in l:
# if it's external link
if link['href'].find("http://") == 0 and link['href'].find(localurl) == -1:
externallinks = externallinks + [link]
print "<h4>External links count = " + str(len(externallinks)) + '</h4>'
if len(externallinks) > 0:
print "<h3>External links list:</h3>"
for link in externallinks:
if link.text != '':
print '<h5>' + link.text.encode('utf-8')
print ' => [' + '<a href="' + link['href'] + '" >' + link['href'] + '</a>' + ']' + '</h5>'
else:
print '<h5>' + '[image]',
print ' => [' + '<a href="' + link['href'] + '" >' + link['href'] + '</a>' + ']' + '</h5>'
PrintLinks( "http://www.zlatestranky.cz/pro-mobily/")

The solution was very simple. I should used urllib2.quote().
EDITED CODE:
for liTag in lis:
aTag = liTag.find('a')['href']
href = "http://www.hovno.cz"+urllib2.quote(aTag.encode("utf-8"))
soup = getSoup(href)

Couple of things here.
First, you URIs can't contain non-ASCII. You have to replace them. See this:
How to fetch a non-ascii url with Python urlopen?
Secondly, save yourself a world of pain and use requests for HTTP stuff.

Related

Appending text to a string if it matches a condition

I am learning to scrape websites. I need to get document titles and links to them, I already manage to do this, but the format of the resulting links is sometimes not what I need. Here is a snippet of the information I get:
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm
['Численность мужчин и женщин', '/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
/storage/mediabank/yKsfiyjR/demo13.xls
You can see that in the second case I get only part of the link, while in the first I get the whole link. To the format of the second link, I need to add a part of the text that I know in advance. But this must be done on the basis of the condition that the format of this link will be defined. That is, at the output, I want to receive the following:
['Плотность населения субъектов Российской Федерации', 'http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm']
Плотность населения субъектов Российской Федерации
http://gks.ru/free_doc/new_site/population/demo/dem11_map.htm
['Численность мужчин и женщин', 'https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls']
Численность мужчин и женщин
https://rosstat.gov.ru/storage/mediabank/yKsfiyjR/demo13.xls
How should I do it? Here is the previously reproduced code:
import requests
from bs4 import BeautifulSoup
URL = "https://rosstat.gov.ru/folder/12781"
responce = requests.get(URL).text
soup = BeautifulSoup(responce, 'lxml')
block = soup.find('div', class_="col-lg-8 order-1 order-lg-1")
list_info_block_row = block.find_all('div', class_='document-list__item document-list__item--row')
list_info_block_col = block.find_all('div', class_='document-list__item document-list__item--col')
sources = []
for text_block_row in list_info_block_row:
new_list = []
title_element_row = text_block_row.find('div', class_='document-list__item-title')
preprocessing_title = title_element_row.text.strip()
link_element_row = text_block_row.find('a').get('href')
new_list.append(preprocessing_title)
new_list.append(link_element_row)
print(new_list)
print(title_element_row.text.strip())
print(link_element_row)
print('\n\n')
You can check if the string has an scheme, and if not add it and also the host:
if not link_element_row.startswith("http"):
parsed_url = urlparse(URL)
link_element_row = (
parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
)
Full working code:
import requests
from urllib.parse import urlparse
from bs4 import BeautifulSoup
URL = "https://rosstat.gov.ru/folder/12781"
responce = requests.get(URL).text
soup = BeautifulSoup(responce, "lxml")
block = soup.find("div", class_="col-lg-8 order-1 order-lg-1")
list_info_block_row = block.find_all(
"div", class_="document-list__item document-list__item--row"
)
list_info_block_col = block.find_all(
"div", class_="document-list__item document-list__item--col"
)
for text_block_row in list_info_block_row:
new_list = []
title_element_row = text_block_row.find("div", class_="document-list__item-title")
preprocessing_title = title_element_row.text.strip()
link_element_row = text_block_row.find("a").get("href")
new_list.append(preprocessing_title)
if not link_element_row.startswith("http"):
parsed_url = urlparse(URL)
link_element_row = (
parsed_url.scheme + "://" + parsed_url.netloc + link_element_row
)
new_list.append(link_element_row)
print(new_list)
print(title_element_row.text.strip())
print(link_element_row)
print("\n\n")
Research:
Get protocol + host name from URL
startswith

How to save my links from BeautifulSoup in a text file with python?

I'm learning python and webscraping, It is very cool but I am not able to get what I want.
I'm trying to save products links in a text file to scrape data after.
here is my script, which work correctly (almost) in the console of pycharm :
import bs4 as bs4
from bs4 import BeautifulSoup
import requests
suffixeUrl = '_puis_nblignes_est_200.html'
for i in range(15):
url = 'https://www.topachat.com/pages/produits_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_page_est_' + str(i) + suffixeUrl
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
if response.ok:
print('Page: ' + str(i))
for data in soup.find_all('div', class_='price'):
for a in data.find_all('a'):
link = (a.get('href'))
links = ('https://www.topachat.com/' + link)
print(links) #for getting link
My goal is to save the result of the links variable, line by line in a text file.
I tried this, but something is wrong and I can't get each url :
for link in links:
with open("urls.txt", "a") as f:
f.write(links+"\n")
Please, does someone can help me?
You can try this way.
Just open the file once and write the complete data to it. Opening and closing files inside a loop is not a good thing to do.
import bs4 as bs4
from bs4 import BeautifulSoup
import requests
suffixeUrl = '_puis_nblignes_est_200.html'
with open('text.txt', 'w') as f:
for i in range(15):
url = 'https://www.topachat.com/pages/produits_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_page_est_' + str(i) + suffixeUrl
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
if response.ok:
print('Page: ' + str(i))
for data in soup.find_all('div', class_='price'):
for a in data.find_all('a'):
link = 'https://www.topachat.com/' + a.get('href')
f.write(link+'\n')
Sample output from text.txt
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in11020650.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in10119254.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20005046.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20002036.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20002591.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20004309.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20002592.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in10089390.html
.
.
.
Your problem is in for link in links line:
link = (a.get('href'))
links = ('https://www.topachat.com/' + link)
print(links)
for link in links:
with open("urls.txt", "a") as f:
f.write(links+"\n")
Type of links is string and your for loop iterates it letter-by-letter (or characater-by-character). That is why you see a single character at each line in your txt file. You can just remove the for loop and the code will work:
from bs4 import BeautifulSoup
import requests
suffixeUrl = '_puis_nblignes_est_200.html'
for i in range(15):
url = 'https://www.topachat.com/pages/produits_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_page_est_' + str(i) + suffixeUrl
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
if response.ok:
print('Page: ' + str(i))
for data in soup.find_all('div', class_='price'):
for a in data.find_all('a'):
link = (a.get('href'))
links = ('https://www.topachat.com/' + link)
print(links) #for getting link
with open("urls.txt", "a") as f:
f.write(links+"\n")
You can do like this:
import bs4 as bs4
from bs4 import BeautifulSoup
import requests
suffixeUrl = '_puis_nblignes_est_200.html'
url_list = set()
for i in range(15):
url = 'https://www.topachat.com/pages/produits_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_page_est_' + str(i) + suffixeUrl
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
if response.ok:
print('Page: ' + str(i))
for data in soup.find_all('div', class_='price'):
for a in data.find_all('a'):
link = (a.get('href'))
links = ('https://www.topachat.com/' + link)
print(links) #for getting link
url_list.add(links)
with open("urls.txt", "a") as f:
for link in url_list:
f.write(link+"\n")

Extracting link text and writing to file

I have crawler that extract links from page only if the link text include given text and I'm writing the output to html file. Its working but I would like to add whole link text next to these links like this - "Junior Java developer - https://www.jobs.cz/junior-developer/" How can I do this?
Thanks
import requests
from bs4 import BeautifulSoup
import re
def jobs_crawler(max_pages):
page = 1
file_name = 'links.html'
while page < max_pages:
url = 'https://www.jobs.cz/prace/praha/?field%5B%5D=200900011&field%5B%5D=200900012&field%5B%5D=200900013&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
page += 1
file = open(file_name,'w')
for link in soup.find_all('a', {'class': 'search-list__main-info__title__link'}, text=re.compile('IT', re.IGNORECASE)):
href = link.get('href') + '\n'
file.write(''+ 'LINK TEXT HERE' + '' + '<br />')
print(href)
file.close()
print('Saved to %s' % file_name)
jobs_crawler(5)
This should help.
file.write('''{1}<br />'''.format(link.get('href'), link.text ))
Try this:--
href = link.get('href') + '\n'
txt = link.get_text('href') #will give you text

Python Web Scraping Randomly Failing

I am trying to make a sitemap for the following website:
http://aogweb.state.ak.us/WebLink/0/fol/12497/Row1.aspx
The code goes through and first determines how many pages are on the top directory level, then it stores the each page number and its corresponding link. Then it goes through each page and creates a dictionary that contains each 3 digit file value and the corresponding link for that value. From there the code takes creates another dictionary of the pages and links for each 3 digit directory (this is the point at which I am stuck). Once this is complete the goal is to create a dictionary that contains each 6 digit file number and its corresponding link.
However, the code randomly fails at certain points throughout the scraping process and gives the following error message:
Traceback (most recent call last):
File "C:\Scraping_Test.py", line 76, in <module>
totalPages = totalPages.text
AttributeError: 'NoneType' object has no attribute 'text'
Sometimes the code does not even run and automatically skips to the end of the program without any errors.
I am currently running python 3.6.0 and using all updated libraries on Visual Studio Community 2015. Any help will be appreciated as I am new to programming.
import bs4 as bs
import requests
import re
import time
def stop():
print('sleep 5 sec')
time.sleep(5)
url0 = 'http://aogweb.state.ak.us'
url1 = 'http://aogweb.state.ak.us/WebLink/'
r = requests.get('http://aogweb.state.ak.us/WebLink/0/fol/12497/Row1.aspx')
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
pagesTopDic = {}
pagesTopDic['1'] = '/WebLink/0/fol/12497/Row1.aspx'
dig3Dic = {}
for link in soup.find_all('a'): #find top pages
if not link.get('title') is None:
if 'page' in link.get('title').lower():
page = link.get('title')
page = page.split(' ')[1]
#print(page)
pagesTopDic[page] = link.get('href')
listKeys = pagesTopDic.keys()
for page in listKeys: #on each page find urls for beggining 3 digits
url = url0 + pagesTopDic[page]
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
for link in soup.find_all('a'):
if not link.get("aria-label") is None:
folder = link.get("aria-label")
folder = folder.split(' ')[0]
dig3Dic[folder] = link.get('href')
listKeys = dig3Dic.keys()
pages3Dic = {}
for entry in listKeys: #pages for each three digit num
print(entry)
url = url1 + dig3Dic[entry]
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
tmpDic = {}
tmpDic['1'] = '/Weblink/' + dig3Dic[entry]
totalPages = soup.find('div',{"class": "PageXofY"})
print(totalPages)
totalPages = totalPages.text
print(totalPages)
totalPages = totalPages.split(' ')[3]
print(totalPages)
while len(tmpDic.keys()) < int(totalPages):
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
for link in soup.find_all('a'): #find top pages
if not link.get('title') is None:
#print(link.get('title'))
if 'Page' in link.get('title'):
page = link.get('title')
page = page.split(' ')[1]
tmpDic[page] = link.get('href')
num = len(tmpDic.keys())
url = url0 + tmpDic[str(num)]
print()
pages3Dic[entry] = tmpDic

How to obtain all the links in a domain using Python?

I want to use Python to obtain all the links in a domain given the 'root' URL (in a list). Suppose given a URL http://www.example.com this should return all the links on this page of the same domain as the root URL, then recurse on each of these links visiting them and extracting all the links of the same domain and so on. What I mean by same domain is if given http://www.example.com the only links I want back are http://www.example.com/something, http://www.example.com/somethingelse ... Anything external such as http://www.otherwebsite.com should be discarded. How can I do this using Python?
EDIT: I made an attempt using lxml. I don't think this works fully, and I am not sure how to take into account links to already processed pages (causing infinite loop).
import urllib
import lxml.html
#given a url returns list of all sublinks within the same domain
def getLinks(url):
urlList = []
urlList.append(url)
sublinks = getSubLinks(url)
for link in sublinks:
absolute = url+'/'+link
urlList.extend(getLinks(absolute))
return urlList
#determine whether two links are within the same domain
def sameDomain(url, dom):
return url.startswith(dom)
#get tree of sublinks in same domain, url is root
def getSubLinks(url):
sublinks = []
connection = urllib.urlopen(url)
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
if not (link.startswith('#') or link.startswith('http') or link.startswith('mailto:')):
sublinks.append(link)
return sublinks
~
import sys
import requests
import hashlib
from bs4 import BeautifulSoup
from datetime import datetime
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content)
return soup
def get_status_code(link):
"""
Return the error code for any url
param: link
"""
try:
error_code = requests.get(link).status_code
except requests.exceptions.ConnectionError:
error_code =
return error_code
def find_internal_urls(lufthansa_url, depth=0, max_depth=2):
all_urls_info = []
status_dict = {}
soup = get_soup(lufthansa_url)
a_tags = soup.findAll("a", href=True)
if depth > max_depth:
return {}
else:
for a_tag in a_tags:
if "http" not in a_tag["href"] and "/" in a_tag["href"]:
url = "http://www.lufthansa.com" + a_tag['href']
elif "http" in a_tag["href"]:
url = a_tag["href"]
else:
continue
status_dict["url"] = url
status_dict["status_code"] = get_status_code(url)
status_dict["timestamp"] = datetime.now()
status_dict["depth"] = depth + 1
all_urls_info.append(status_dict)
return all_urls_info
if __name__ == "__main__":
depth = 2 # suppose
all_page_urls = find_internal_urls("someurl", 2, 2)
if depth > 1:
for status_dict in all_page_urls:
find_internal_urls(status_dict['url'])
The above snippet contains necessary modules for scrapping urls from lufthansa arlines website. The only thing additional here is you can specify depth to which you want to scrape recursively.
Here is what I've done, only following full urls like http://domain[xxx]. Quick but a bit dirty.
import requests
import re
domain = u"stackoverflow.com"
http_re = re.compile(u"(http:\/\/" + domain + "[\/\w \.-]*\/?)")
visited = set([])
def visit (url):
visited.add (url)
extracted_body = requests.get (url).text
matches = re.findall (http_re, extracted_body)
for match in matches:
if match not in visited :
visit (match)
visit(u"http://" + domain)
print (visited)
There are some bugs in the code of #namita . I modify it and it works well now.
import sys
import requests
import hashlib
from bs4 import BeautifulSoup
from datetime import datetime
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content, "lxml")
return soup
def get_status_code(link):
"""
Return the error code for any url
param: link
"""
try:
error_code = requests.get(link).status_code
except requests.exceptions.ConnectionError:
error_code = -1
return error_code
def find_internal_urls(main_url, depth=0, max_depth=2):
all_urls_info = []
soup = get_soup(main_url)
a_tags = soup.findAll("a", href=True)
if main_url.endswith("/"):
domain = main_url
else:
domain = "/".join(main_url.split("/")[:-1])
print(domain)
if depth > max_depth:
return {}
else:
for a_tag in a_tags:
if "http://" not in a_tag["href"] and "https://" not in a_tag["href"] and "/" in a_tag["href"]:
url = domain + a_tag['href']
elif "http://" in a_tag["href"] or "https://" in a_tag["href"]:
url = a_tag["href"]
else:
continue
# print(url)
status_dict = {}
status_dict["url"] = url
status_dict["status_code"] = get_status_code(url)
status_dict["timestamp"] = datetime.now()
status_dict["depth"] = depth + 1
all_urls_info.append(status_dict)
return all_urls_info
if __name__ == "__main__":
url = # your domain here
depth = 1
all_page_urls = find_internal_urls(url, 0, 2)
# print("\n\n",all_page_urls)
if depth > 1:
for status_dict in all_page_urls:
find_internal_urls(status_dict['url'])
The code worked, but I don't know if it's 100% correct
it is extracting all the internal urls in the website
import requests
from bs4 import BeautifulSoup
def get_soup(link):
"""
Return the BeautifulSoup object for input link
"""
request_object = requests.get(link, auth=('user', 'pass'))
soup = BeautifulSoup(request_object.content, "lxml")
return soup
visited = set([])
def visit (url,domain):
visited.add (url)
soup = get_soup(url)
a_tags = soup.findAll("a", href=True)
for a_tag in a_tags:
if "http://" not in a_tag["href"] and "https://" not in a_tag["href"] and "/" in a_tag["href"]:
url = domain + a_tag['href']
elif "http://" in a_tag["href"] or "https://" in a_tag["href"]:
url = a_tag["href"]
else:
continue
if url not in visited and domain in url:
# print(url)
visit (url,domain)
url=input("Url: ")
domain=input("domain: ")
visit(u"" + url,domain)
print (visited)
From the tags of your question, I assume you are using Beautiful Soup.
At first, you obviously need to download the webpage, for example with urllib.request. After you did that and have the contents in a string, you pass it to Beautiful Soup. After that, you can find all links with soup.find_all('a'), assuming soup is your beautiful soup object. After that, you simply need to check the hrefs:
The most simple version would be to just check if "http://www.example.com" is in the href, but that won't catch relative links. I guess some wild regular expression would do (find everything with "www.example.com" or starting with "/" or starting with "?" (php)), or you might look for everything that contains a www, but is not www.example.com and discard it, etc. The correct strategy might be depending on the website you are scraping, and it's coding style.
You can use regular expression to filter out such links
eg
<a\shref\=\"(http\:\/\/example\.com[^\"]*)\"
Take the above regex as reference and start writing script based on that.

Categories