I am trying to do web scraping and I am getting this error; can someone help with this issue?
scripts= soup.find_all('script')
for script in scripts:
if 'preloadedData' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('=',1)[1].strip()
jsonStr = jsonStr.rsplit(";",1)[0]
jsonObj = json.loads(jsonStr)
print('%s\nHeadlines\n%s\n'%(url,now))
count = 1
for ele, v in jsonObj['initialState'].itmes():
try:
if v['headline'] and v['__typename']== "PromotionalProperties":
print('Headline %s: %s'%(count,v['headline']))
count +=1
except:
continue
Related
I'm scraping from the World Bank for a paper and I'm trying to make a loop of the web scraping of different indicators but I can't seem to make it work until a certain part of the code. Hope someone can help please?
#Single Code for each indicator
indcator = 'SP.POP.TOTL?date=2000:2020'
url = "http://api.worldbank.org/v2/countries/all/indicators/%s&format=json&per_page=5000" % indicator
response = requests.get(url)
print(response)
result = response.content
result = json.loads(result)
pop_total_df = pd.DataFrame.from_dict(result[1])
This is the loop i'm trying to build but I got an error in the last part of below code:
#indicator list
indicator = {'FP.CPI.TOTL.ZG?date=2000:2020','SP.POP.TOTL?date=2000:2020'}
#list of urls with the indicators
url_list = []
for i in indicator:
url = "http://api.worldbank.org/v2/countries/all/indicators/%s&format=json&per_page=5000" % i
url_list.append(url)
result_list = []
for i in url_list:
response = requests.get(i)
print(response)
result_list.append(response.content)
#Erroneous code
result_json = []
for i in range(3):
result_json.append(json.loads(result_list[i])))
As you are making 2 requests (FP.CPI.TOTL.ZG?date=2000:2020 and SP.POP.TOTL?date=2000:2020) your result_list length is 2, so its index are 0 and 1. Use range(2) or range(len(result_list)) instead:
import requests, json
#indicator list
indicator = {'FP.CPI.TOTL.ZG?date=2000:2020','SP.POP.TOTL?date=2000:2020'}
#list of urls with the indicators
url_list = []
for i in indicator:
url = "http://api.worldbank.org/v2/countries/all/indicators/%s&format=json&per_page=5000" % i
url_list.append(url)
result_list = []
for i in url_list:
response = requests.get(i)
print(response)
result_list.append(response.content)
#Erroneous code
result_json = []
for i in range(len(result_list)):
result_json.append(json.loads(result_list[i]))
Just to clarify from the beginning: I'm a total beginner (I wrote something in Python for the first time today). This was more applying from a guide and trying to remember what I did 7 years ago when I tried learning java than anything else.
I wanted to scrape the image tags from a website (to plot them later) but have to stay logged in to view all images. After I got the scraping down I noticed that there were some tags blocked so the issue with the login came up. I now managed to log in but it doesn't work outside of the session itself which makes the rest of my code useless. Can I get this to work or do I have to give up?
This is the working login:
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
login_data = {
'user' : 'theusername',
'pass' : 'thepassword',
'op' : 'Log in'
}
with requests.Session() as s:
url = "https://thatwebsite.com/index.php?page=account&s=login&code=00"
r = s.get(url)
r = s.post(url, data=login_data)
And what I had working before to scrape the website but with the login missing:
filename = "taglist.txt"
f = open(filename, "w", encoding="utf-8")
headers = "tags\n"
f.write(headers)
pid = 0
actual_page = 1
while pid < 150:
url = "https://thatwebsite.com/index.php?page=post&s=list&tags=absurdres&pid=" + str(pid)
print(url)
client = urlopen(url)
page_html = client.read()
client.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"thumbnail-preview"})
print("Current pid: " + str(pid))
for container in containers:
tags = container.span.a.img["title"]
f.write(tags.replace(" ", "\n") + "\n")
pid = pid + 42
print("Current page: " + str(actual_page))
actual_page += 1
print("Done.")
f.close()
Out comes a list of every tag used by high res images.
I hope I don't offend anyone with this.
Edit: The code is working now, had a cookie typo:
import requests
from bs4 import BeautifulSoup as soup
login_data = {
'user' : 'myusername',
'pass' : 'mypassword',
'op' : 'Log in'
}
s = requests.Session()
print("\n\n\n\n\n")
filename = "taglist.txt"
f = open(filename, "w", encoding="utf-8")
headers = "tags\n"
f.write(headers)
pid = 0
actual_page = 1
while pid < 42:
url2 = "https://thiswebsite.com/index.php?page=post&s=list&tags=rating:questionable&pid=" + str(pid)
r = s.get(url2, cookies={'duid' : 'somehash', 'user_id' : 'my userid', 'pass_hash' : 'somehash'})
page_html = str(r.content)
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"thumbnail-preview"})
for container in containers:
tags = container.span.a.img["title"]
f.write(tags.replace(" ", "\n") + "\n")
print("\nCurrent page: " + str(actual_page) + " Current pid: " + str(pid) + "\nDone.")
actual_page += 1
pid = pid + 42
f.close()
You use two different libraries for doing web requests right now. requests and urllib. I would opt for using only requests.
Also don't use the Session() context manager. Context manager are used to do some cleanup after leaving the indented block and have that with ... as x syntax you use on the requests.Session() object. In context of requests this will clear the cookies as you leave the session. (I assume login is managed by cookies at this site).
Keep the session in a variable instead that you can use for subsequent requests as this stores your cookies at login. You need them for subsequent requests.
s = requests.Session()
url = "https://thatwebsite.com/index.php?page=account&s=login&code=00"
r = s.get(url) # do you need this request?
r = s.post(url, data=login_data)
Also make the subsequent call in the loop with requests:
client = s.get(url)
I am trying to make a sitemap for the following website:
http://aogweb.state.ak.us/WebLink/0/fol/12497/Row1.aspx
The code goes through and first determines how many pages are on the top directory level, then it stores the each page number and its corresponding link. Then it goes through each page and creates a dictionary that contains each 3 digit file value and the corresponding link for that value. From there the code takes creates another dictionary of the pages and links for each 3 digit directory (this is the point at which I am stuck). Once this is complete the goal is to create a dictionary that contains each 6 digit file number and its corresponding link.
However, the code randomly fails at certain points throughout the scraping process and gives the following error message:
Traceback (most recent call last):
File "C:\Scraping_Test.py", line 76, in <module>
totalPages = totalPages.text
AttributeError: 'NoneType' object has no attribute 'text'
Sometimes the code does not even run and automatically skips to the end of the program without any errors.
I am currently running python 3.6.0 and using all updated libraries on Visual Studio Community 2015. Any help will be appreciated as I am new to programming.
import bs4 as bs
import requests
import re
import time
def stop():
print('sleep 5 sec')
time.sleep(5)
url0 = 'http://aogweb.state.ak.us'
url1 = 'http://aogweb.state.ak.us/WebLink/'
r = requests.get('http://aogweb.state.ak.us/WebLink/0/fol/12497/Row1.aspx')
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
pagesTopDic = {}
pagesTopDic['1'] = '/WebLink/0/fol/12497/Row1.aspx'
dig3Dic = {}
for link in soup.find_all('a'): #find top pages
if not link.get('title') is None:
if 'page' in link.get('title').lower():
page = link.get('title')
page = page.split(' ')[1]
#print(page)
pagesTopDic[page] = link.get('href')
listKeys = pagesTopDic.keys()
for page in listKeys: #on each page find urls for beggining 3 digits
url = url0 + pagesTopDic[page]
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
for link in soup.find_all('a'):
if not link.get("aria-label") is None:
folder = link.get("aria-label")
folder = folder.split(' ')[0]
dig3Dic[folder] = link.get('href')
listKeys = dig3Dic.keys()
pages3Dic = {}
for entry in listKeys: #pages for each three digit num
print(entry)
url = url1 + dig3Dic[entry]
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
tmpDic = {}
tmpDic['1'] = '/Weblink/' + dig3Dic[entry]
totalPages = soup.find('div',{"class": "PageXofY"})
print(totalPages)
totalPages = totalPages.text
print(totalPages)
totalPages = totalPages.split(' ')[3]
print(totalPages)
while len(tmpDic.keys()) < int(totalPages):
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
for link in soup.find_all('a'): #find top pages
if not link.get('title') is None:
#print(link.get('title'))
if 'Page' in link.get('title'):
page = link.get('title')
page = page.split(' ')[1]
tmpDic[page] = link.get('href')
num = len(tmpDic.keys())
url = url0 + tmpDic[str(num)]
print()
pages3Dic[entry] = tmpDic
I'm trying to scrap some URL with BeautifulSoup. The URL I'm scraping are coming from a google analytics API call, some of then aren't working properly so I need to find a way to skip them.
I tried to add this:
except urllib2.HTTPError:
continue
But I got the following syntax error :
except urllib2.HTTPError:
^
SyntaxError: invalid syntax
Here is my full code:
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row,share))
print(sharelist)
Your except statement is not preceded by a try statement. You should use the following pattern:
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
Also note the indentation levels. Code executed under the try clause must be indented, as well as the except clause.
Two errors:
1. No try statement
2. No indentation
Use this:
for row in urllist:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
If you just want to catch a 404, you need to check the code returned or raise the error or else you will catch and ignore more than just the 404:
import urllib2
from bs4 import BeautifulSoup
from urlparse import urljoin
def print_results(results):
base = 'http://www.konbini.com'
rawdata = []
sharelist = []
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
# use urljoin to join to the base url
urllist = [urljoin(base, h) for h in rawdata]
for url in urllist:
# query the website and return the html to the variable 'page'
try: # need to open with try
page = urllib2.urlopen(url)
except urllib2.HTTPError as e:
if e.getcode() == 404: # check the return code
continue
raise # if other than 404, raise the error
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((url, share))
print(sharelist)
As already mentioned by others,
try statement missing
Proper indentation missing.
You should use IDE or Editors so that you won't face such problems, Some good IDE and Editors are
IDE - Eclipse Use Pydev plugin
Editors - Visual Studio Code
Anyways, Code after try and indent
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
# Print data nicely for the user.
if results:
for row in results.get('rows'):
rawdata.append(row[0])
else:
print 'No results found'
urllist = [mystring + x for x in rawdata]
for row in urllist:
# query the website and return the html to the variable 'page'
try:
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find(attrs={'class': 'nb-shares'})
if name_box is None:
continue
share = name_box.text.strip() # strip() is used to remove starting and trailing
# save the data in tuple
sharelist.append((row, share))
print(sharelist)
Your syntax error is due to the fact that you're missing a try with your except statement.
try:
# code that might throw HTTPError
page = urllib2.urlopen(row)
except urllib2.HTTPError:
continue
I'm trying to extract data from BBB but I get no response. I don't get any error messages, just a blinking cursor. Is it my regex that is the issue? Also, if you see anything that I can improve on in terms of efficiency or coding style, I
am open to your advice!
Here is the code:
import urllib2
import re
print "Enter an industry keyword."
print "Example: florists, construction, tiles"
keyword = raw_input('> ')
print "How many pages to dig through BBB?"
total_pages = raw_input('> ')
print "Working..."
page_number = 1
address_list = []
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
address_pattern = r'<address>(.*?)<\/address>'
while page_number <= total_pages:
business_address = re.findall(address_pattern,str(respData))
for each in business_address:
address_list.append(each)
page_number += 1
for each in address_list:
print each
print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')
file = open('export.txt','w')
for each in address_list:
file.write('%r \n' % each)
file.close()
print 'File saved!'
EDITED, but still don't get any results:
import urllib2
import re
print "Enter an industry keyword."
print "Example: florists, construction, tiles"
keyword = raw_input('> ')
print "How many pages to dig through BBB?"
total_pages = int(raw_input('> '))
print "Working..."
page_number = 1
address_list = []
for page_number in range(1,total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
address_pattern = r'<address>(.*?)<\/address>'
business_address = re.findall(address_pattern,respData)
address_list.extend(business_address)
for each in address_list:
print each
print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')
file = open('export.txt','w')
for each in address_list:
file.write('%r \n' % each)
file.close()
print 'File saved!'
Convert total_pages using int and use range instead of your while loop:
total_pages = int(raw_input('> '))
...............
for page_number in range(2, total_pages+1):
That will fix your issue but the loop is redundant, you use the same respData and address_pattern in the loop so you will keep adding the same thing repeatedly, if you want to crawl multiple pages you need to move the urllib code inside the for loop so you crawl using each page_number:
for page_number in range(1, total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
business_address = re.findall(address_pattern, respData)
# use extend to add the data from findall
address_list.extend(business_address)
respData is also already a string so you don't need to call str on it, also using requests can simplify your code further:
import requests
for page_number in range(1,total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
respData = requests.get(url).content
business_address = re.findall(address_pattern,str(respData))
address_list.extend(business_address)
The main issue I see in your code, that is causing the infinite loop is that total_pages is defined as a string in lines -
total_pages = raw_input('> ')
But page_number is defined as an int.
Hence , the while loop -
while page_number <= total_pages:
would not end unless some exception occurs from within it, since str is always larger than int in Python 2.x .
You would most probably need to convert the raw_input() to int() since you are only using total_pages in the condition in the while loop. Example -
total_pages = int(raw_input('> '))
I have not checked whether the rest of your logic is correct or not, but I believe the above is the reason you are getting the infinite loop.