I'm trying to extract data from BBB but I get no response. I don't get any error messages, just a blinking cursor. Is it my regex that is the issue? Also, if you see anything that I can improve on in terms of efficiency or coding style, I
am open to your advice!
Here is the code:
import urllib2
import re
print "Enter an industry keyword."
print "Example: florists, construction, tiles"
keyword = raw_input('> ')
print "How many pages to dig through BBB?"
total_pages = raw_input('> ')
print "Working..."
page_number = 1
address_list = []
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
address_pattern = r'<address>(.*?)<\/address>'
while page_number <= total_pages:
business_address = re.findall(address_pattern,str(respData))
for each in business_address:
address_list.append(each)
page_number += 1
for each in address_list:
print each
print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')
file = open('export.txt','w')
for each in address_list:
file.write('%r \n' % each)
file.close()
print 'File saved!'
EDITED, but still don't get any results:
import urllib2
import re
print "Enter an industry keyword."
print "Example: florists, construction, tiles"
keyword = raw_input('> ')
print "How many pages to dig through BBB?"
total_pages = int(raw_input('> '))
print "Working..."
page_number = 1
address_list = []
for page_number in range(1,total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
address_pattern = r'<address>(.*?)<\/address>'
business_address = re.findall(address_pattern,respData)
address_list.extend(business_address)
for each in address_list:
print each
print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')
file = open('export.txt','w')
for each in address_list:
file.write('%r \n' % each)
file.close()
print 'File saved!'
Convert total_pages using int and use range instead of your while loop:
total_pages = int(raw_input('> '))
...............
for page_number in range(2, total_pages+1):
That will fix your issue but the loop is redundant, you use the same respData and address_pattern in the loop so you will keep adding the same thing repeatedly, if you want to crawl multiple pages you need to move the urllib code inside the for loop so you crawl using each page_number:
for page_number in range(1, total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
business_address = re.findall(address_pattern, respData)
# use extend to add the data from findall
address_list.extend(business_address)
respData is also already a string so you don't need to call str on it, also using requests can simplify your code further:
import requests
for page_number in range(1,total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
respData = requests.get(url).content
business_address = re.findall(address_pattern,str(respData))
address_list.extend(business_address)
The main issue I see in your code, that is causing the infinite loop is that total_pages is defined as a string in lines -
total_pages = raw_input('> ')
But page_number is defined as an int.
Hence , the while loop -
while page_number <= total_pages:
would not end unless some exception occurs from within it, since str is always larger than int in Python 2.x .
You would most probably need to convert the raw_input() to int() since you are only using total_pages in the condition in the while loop. Example -
total_pages = int(raw_input('> '))
I have not checked whether the rest of your logic is correct or not, but I believe the above is the reason you are getting the infinite loop.
Related
Just to clarify from the beginning: I'm a total beginner (I wrote something in Python for the first time today). This was more applying from a guide and trying to remember what I did 7 years ago when I tried learning java than anything else.
I wanted to scrape the image tags from a website (to plot them later) but have to stay logged in to view all images. After I got the scraping down I noticed that there were some tags blocked so the issue with the login came up. I now managed to log in but it doesn't work outside of the session itself which makes the rest of my code useless. Can I get this to work or do I have to give up?
This is the working login:
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
login_data = {
'user' : 'theusername',
'pass' : 'thepassword',
'op' : 'Log in'
}
with requests.Session() as s:
url = "https://thatwebsite.com/index.php?page=account&s=login&code=00"
r = s.get(url)
r = s.post(url, data=login_data)
And what I had working before to scrape the website but with the login missing:
filename = "taglist.txt"
f = open(filename, "w", encoding="utf-8")
headers = "tags\n"
f.write(headers)
pid = 0
actual_page = 1
while pid < 150:
url = "https://thatwebsite.com/index.php?page=post&s=list&tags=absurdres&pid=" + str(pid)
print(url)
client = urlopen(url)
page_html = client.read()
client.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"thumbnail-preview"})
print("Current pid: " + str(pid))
for container in containers:
tags = container.span.a.img["title"]
f.write(tags.replace(" ", "\n") + "\n")
pid = pid + 42
print("Current page: " + str(actual_page))
actual_page += 1
print("Done.")
f.close()
Out comes a list of every tag used by high res images.
I hope I don't offend anyone with this.
Edit: The code is working now, had a cookie typo:
import requests
from bs4 import BeautifulSoup as soup
login_data = {
'user' : 'myusername',
'pass' : 'mypassword',
'op' : 'Log in'
}
s = requests.Session()
print("\n\n\n\n\n")
filename = "taglist.txt"
f = open(filename, "w", encoding="utf-8")
headers = "tags\n"
f.write(headers)
pid = 0
actual_page = 1
while pid < 42:
url2 = "https://thiswebsite.com/index.php?page=post&s=list&tags=rating:questionable&pid=" + str(pid)
r = s.get(url2, cookies={'duid' : 'somehash', 'user_id' : 'my userid', 'pass_hash' : 'somehash'})
page_html = str(r.content)
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"thumbnail-preview"})
for container in containers:
tags = container.span.a.img["title"]
f.write(tags.replace(" ", "\n") + "\n")
print("\nCurrent page: " + str(actual_page) + " Current pid: " + str(pid) + "\nDone.")
actual_page += 1
pid = pid + 42
f.close()
You use two different libraries for doing web requests right now. requests and urllib. I would opt for using only requests.
Also don't use the Session() context manager. Context manager are used to do some cleanup after leaving the indented block and have that with ... as x syntax you use on the requests.Session() object. In context of requests this will clear the cookies as you leave the session. (I assume login is managed by cookies at this site).
Keep the session in a variable instead that you can use for subsequent requests as this stores your cookies at login. You need them for subsequent requests.
s = requests.Session()
url = "https://thatwebsite.com/index.php?page=account&s=login&code=00"
r = s.get(url) # do you need this request?
r = s.post(url, data=login_data)
Also make the subsequent call in the loop with requests:
client = s.get(url)
My code:
number = 0
while True:
number +=1
url = url + '?curpage={}'.format(number)
html = urllib2.urlopen(url).read()
My issue: I have a while loop and within the while loop, I have a URL. For each step, I want URL to change to:
url?curpage=1
url?curpage=2
...
What I am getting:
url?curpage=1
url?curpage=1?curpage=2
...
Any suggestions on how to resolve this issue?
Don't modify url in the loop. For example:
url = "<base url>"
number = 0
while True:
number +=1
html = urllib2.urlopen('{}?curpage={}'.format(url, number)).read()
url = url + ...
says to add to the end of url, making it longer with each iteration
From your expected output, you would seem to want:
html = urllib2.urlopen(url+'?curpage={}'.format(number)).read()
Also, your loop will never end.
number = 0
for i in range(10):
url = 'http://www.example.com/'
number = number + 1
url = url + '?curpage={}'.format(number)
print (url)
I want to create a piece of code that works as follows:
You feed it an URL, it looks on that webpage how many links there are, follows one, looks on that new webpage again, follows one link, and so on.
I have a piece of code that opens a web page, searches for links and creates a list from them:
import urllib
from bs4 import BeautifulSoup
list_links = []
page = raw_input('enter an url')
url = urllib.urlopen(page).read()
html = BeautifulSoup(url, 'html.parser')
for link in html.find_all('a'):
link = link.get('href')
list_links.append(link)
Next, I want user to decide which link to follow, so I have this:
link_number = len(list_links)
print 'enter a number between 0 and', (link_number)
number = raw_input('')
for number in number:
if int(number) < 0 or int(number) > link_number:
print "The End."
break
else:
continue
url_2 = urllib.urlopen(list_links[int(number)]).read()
Here my code crashes
Ideally, I would like to have an endless process (unsell user would stop it by entering a wrong number) like this: open the page -> count amount of links -> choose one -> follow this link and open new page -> count amount of links...
Can anybody help me?
You can try using this (sorry if it's not exactly pretty, I wrote it in a bit of a hurry):
import requests, random
from bs4 import BeautifulSoup as BS
from time import sleep
def main(url):
content = scraping_call(url)
if not content:
print "Couldn't get html..."
return
else:
links_list = []
soup = BS(content, 'html5lib')
for link in soup.findAll('a'):
try:
links_list.append(link['href'])
except KeyError:
continue
chosen_link_index = input("Enter a number between 0 and %d: " % len(links_list))
if not 0 < chosen_link_index <= len(links_list):
raise ValueError ('Number must be between 0 and %d: ' % len(links_list))
#script will crash here.
#If you want the user to try again, you can
#set up a nr of attempts, like in scraping_call()
else:
#if user wants to stop the infinite loop
next_step = raw_input('Continue or exit? (Y/N) ') or 'Y'
# default value is 'yes' so if u want to continue,
#just press Enter
if next_step.lower() == 'y':
main(links_list[chosen_link_index])
else:
return
def scraping_call(url):
attempt = 1
while attempt < 6:
try:
page = requests.get(url)
if page.status_code == 200:
result = page.content
else:
result = ''
except Exception,e:
result = ''
print 'Failed attempt (',attempt,'):', e
attempt += 1
sleep(random.randint(2,4))
continue
return result
if __name__ == '__main__':
main('enter the starting URL here')
Some of the links in a certain webpage can appear in a form of relative address and we need to take this into account.
This should do the trick. Works for python 3.4.
from urllib.request import urlopen
from urllib.parse import urljoin, urlsplit
from bs4 import BeautifulSoup
addr = input('enter an initial url: ')
while True:
html = BeautifulSoup(urlopen(addr).read(), 'html.parser')
list_links = []
num = 0
for link in html.find_all('a'):
url = link.get('href')
if not urlsplit(url).netloc:
url = urljoin(addr, url)
if urlsplit(url).scheme in ['http', 'https']:
print("%d : %s " % (num, str(url)))
list_links.append(url)
num += 1
idx = int(input("enter an index between 0 and %d: " % (len(list_links) - 1)))
if not 0 <= idx < len(list_links):
raise ValueError('Number must be between 0 and %d: ' % len(list_links))
addr = list_links[idx]
I am new to python and just wanted to know if this is possible: I have scraped a url using urllib and want to edit different pages.
Example:
http://test.com/All/0.html
I want the 0.html to become 50.html and then 100.html and so on ...
found_url = 'http://test.com/All/0.html'
base_url = 'http://test.com/All/'
for page_number in range(0,1050,50):
url_to_fetch = "{0}{1}.html".format(base_url,page_number)
That should give you URLs from 0.html to 1000.html
If you want to use urlparse(as suggested in comments to your question):
import urlparse
found_url = 'http://test.com/All/0.html'
parsed_url = urlparse.urlparse(found_url)
path_parts = parsed_url.path.split("/")
for page_number in range(0,1050,50):
new_path = "{0}/{1}.html".format("/".join(path_parts[:-1]), page_number)
parsed_url = parsed_url._replace(path= new_path)
print parsed_url.geturl()
Executing this script would give you the following:
http://test.com/All/0.html
http://test.com/All/50.html
http://test.com/All/100.html
http://test.com/All/150.html
http://test.com/All/200.html
http://test.com/All/250.html
http://test.com/All/300.html
http://test.com/All/350.html
http://test.com/All/400.html
http://test.com/All/450.html
http://test.com/All/500.html
http://test.com/All/550.html
http://test.com/All/600.html
http://test.com/All/650.html
http://test.com/All/700.html
http://test.com/All/750.html
http://test.com/All/800.html
http://test.com/All/850.html
http://test.com/All/900.html
http://test.com/All/950.html
http://test.com/All/1000.html
Instead of printing in the for loop you can use the value of parsed_url.geturl() as per your need. As mentioned, if you want to fetch the content of the page you can use python requests module in the following manner:
import requests
found_url = 'http://test.com/All/0.html'
parsed_url = urlparse.urlparse(found_url)
path_parts = parsed_url.path.split("/")
for page_number in range(0,1050,50):
new_path = "{0}/{1}.html".format("/".join(path_parts[:-1]), page_number)
parsed_url = parsed_url._replace(path= new_path)
# print parsed_url.geturl()
url = parsed_url.geturl()
try:
r = requests.get(url)
if r.status_code == 200:
with open(str(page_number)+'.html', 'w') as f:
f.write(r.content)
except Exception as e:
print "Error scraping - " + url
print e
This fetches the content from http://test.com/All/0.html till http://test.com/All/1000.html and saves the content of each URL into its own file. The file name on disk would be the file name in URL - 0.html to 1000.html
Depending on the performance of the site you are trying to scrape from you might experience considerable time delays in running the script. If performance is of importance, you can consider using grequests
I'm trying to open multiple pages using urllib2. The problem is that some pages can't be opened. It returns urllib2.HTTPerror: HTTP Error 400: Bad Request
I'm getting hrefs of this pages from another web page (in the head of the page is charset = "utf-8").
The error is returned only then, when I'm trying to open a page containing 'č','ž' or 'ř' in url.
Here is the code:
def getSoup(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
page = response.read()
soup = BeautifulSoup(page, 'html.parser')
return soup
hovienko = getSoup("http://www.hovno.cz/hovna-az/a/1/")
lis = hovienko.find("div", class_="span12").find('ul').findAll('li')
for liTag in lis:
aTag = liTag.find('a')['href']
href = "http://www.hovno.cz"+aTag """ hrefs, I'm trying to open using urllib2 """
soup = getSoup(href.encode("iso-8859-2")) """ here occures errors when 'č','ž' or 'ř' in url """
Do anybody knows, what I have to do to avoid errors?
Thank you
This site is UTF-8. Why you need href.encode("iso-8859-2") ? I have taken the next code from http://programming-review.com/beautifulsoasome-interesting-python-functions/
import urllib2
import cgitb
cgitb.enable()
from BeautifulSoup import BeautifulSoup
from urlparse import urlparse
# print all links
def PrintLinks(localurl):
data = urllib2.urlopen(localurl).read()
print 'Encoding of fetched HTML : %s', type(data)
soup = BeautifulSoup(data)
parse = urlparse(localurl)
localurl = parse[0] + "://" + parse[1]
print "<h3>Page links statistics</h3>"
l = soup.findAll("a", attrs={"href":True})
print "<h4>Total links count = " + str(len(l)) + '</h4>'
externallinks = [] # external links list
for link in l:
# if it's external link
if link['href'].find("http://") == 0 and link['href'].find(localurl) == -1:
externallinks = externallinks + [link]
print "<h4>External links count = " + str(len(externallinks)) + '</h4>'
if len(externallinks) > 0:
print "<h3>External links list:</h3>"
for link in externallinks:
if link.text != '':
print '<h5>' + link.text.encode('utf-8')
print ' => [' + '<a href="' + link['href'] + '" >' + link['href'] + '</a>' + ']' + '</h5>'
else:
print '<h5>' + '[image]',
print ' => [' + '<a href="' + link['href'] + '" >' + link['href'] + '</a>' + ']' + '</h5>'
PrintLinks( "http://www.zlatestranky.cz/pro-mobily/")
The solution was very simple. I should used urllib2.quote().
EDITED CODE:
for liTag in lis:
aTag = liTag.find('a')['href']
href = "http://www.hovno.cz"+urllib2.quote(aTag.encode("utf-8"))
soup = getSoup(href)
Couple of things here.
First, you URIs can't contain non-ASCII. You have to replace them. See this:
How to fetch a non-ascii url with Python urlopen?
Secondly, save yourself a world of pain and use requests for HTTP stuff.