My code:
number = 0
while True:
number +=1
url = url + '?curpage={}'.format(number)
html = urllib2.urlopen(url).read()
My issue: I have a while loop and within the while loop, I have a URL. For each step, I want URL to change to:
url?curpage=1
url?curpage=2
...
What I am getting:
url?curpage=1
url?curpage=1?curpage=2
...
Any suggestions on how to resolve this issue?
Don't modify url in the loop. For example:
url = "<base url>"
number = 0
while True:
number +=1
html = urllib2.urlopen('{}?curpage={}'.format(url, number)).read()
url = url + ...
says to add to the end of url, making it longer with each iteration
From your expected output, you would seem to want:
html = urllib2.urlopen(url+'?curpage={}'.format(number)).read()
Also, your loop will never end.
number = 0
for i in range(10):
url = 'http://www.example.com/'
number = number + 1
url = url + '?curpage={}'.format(number)
print (url)
Related
While using below snippet it is not returning values of Page, Total page and data.
Also not returning the value of function "getMovieTitles".
import request
import json
def getMovieTitles(substr):
titles = []
url = "https://jsonmock.hackerrank.com/api/movies/search/?Title={}'.format(substr)"
data = requests.get(url)
print(data)
response = json.loads(data.content.decode('utf-8'))
print(data.content)
for page in range(0, response['total_pages']):
page_response = requests.get("https://jsonmock.hackerrank.com/api/movies/search/?Title={}}&page={}".format(substr, page + 1))
page_content = json.loads(page_response.content.decode('utf-8'))
print ('page_content', page_content, 'type(page_content)', type(page_content))
for item in range(0, len(page_content['data'])):
titles.append(str(page_content['data'][item]['Title']))
titles.sort()
return titles
print(getMovieTitles('Superman'))
You're not formatting the url string correctly.
url = "https://jsonmock.hackerrank.com/api/movies/search/?Title={}'.format(substr)"
format() is a method of string and you've put it inside of the url string, instead do:
url = "https://jsonmock.hackerrank.com/api/movies/search/?Title={}".format(substr)
First, import
import requests
The problem is in your string formatting
' instead of "
url = "https://jsonmock.hackerrank.com/api/movies/search/?Title={}".format(substr)
and one } too much
page_response = requests.get("https://jsonmock.hackerrank.com/api/movies/search/?Title={}&page={}".format(substr, page + 1))
Use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
This is HTML link for data http://py4e-data.dr-chuck.net/known_by_Caragh.html
So I have to find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
Can someone explain me line by line in detail how these 2 loops work("While", and "for").
So when I enter positi 18 is it extracts 18th line of href tag and then next 18th so on 7 times ? Because even if I Enter different number I'm still getting same answer. Thank you so much in advance.
Code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
n = 0
count = 0
url = input("Enter URL:")
numbers = input("Enter count:")
position = input("Enter position:")
while n < 7:
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
for tag in tags:
count = count + 1
if count == 18:
url = tag.get('href', None)
print("Retrieving:" , url)
count = 0
break
n = n + 1
Because even if I Enter different number I'm still getting same
answer.
You're getting the same answer because you've hard coded that in with:
while n < 7
and
if count == 18
I think you've meant to have those as your variable/input. With that, you'll also need those inputs as an int, as currently, they get stored as as str. Also just note, I didn't want to type in the url each time, so hard coded that, but you can uncomment your input there, and then comment out the url = 'http://py4e-data.dr-chuck.net/known_by_Caragh.html'
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
n = 0
count = 0
url = 'http://py4e-data.dr-chuck.net/known_by_Caragh.html'
#url = input("Enter URL:")
numbers = int(input("Enter count:"))
position = int(input("Enter position:"))
while n < numbers: #<----- there's your variable of how many times to try
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
for tag in tags:
count = count + 1
if count == position: #<------- and the variable to get the position
url = tag.get('href', None)
print("Retrieving:" , url)
count = 0
break
n = n + 1 #<---- I fixed your indentation. The way it was previously would never get yourself out of the while loop because n will never increment.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"col _2-gKeQ"})
page_nr=soup.find_all("a",{"class":"_33m_Yg"})[-1].text
print(page_nr,"number of pages were found")
#all[0].find("div",{"class":"_1vC4OE _2rQ-NK"}).text
l=[]
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(0,int(page_nr)*10,10):
print( )
r=requests.get(base_url+str(page)+".html")
c=r.content
#c=r.json()["list"]
soup=BeautifulSoup(c,"html.parser")
for item in all:
d ={}
#price
d["Price"] = item.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
#Name
d["Name"] = item.find("div",{"class":"_3wU53n"}).text
for li in item.find_all("li",{"class":"_1ZRRx1"}):
if " EMI" in li.text:
d["EMI"] = li.text
else:
d["EMI"] = None
for li1 in item.find_all("li",{"class":"_1ZRRx1"}):
if "Special " in li1.text:
d["Special Price"] = li1.text
else:
d["Special Price"] = None
for val in item.find_all("li",{"class":"tVe95H"}):
if "Display" in val.text:
d["Display"] = val.text
elif "Warranty" in val.text:
d["Warrenty"] = val.text
elif "RAM" in val.text:
d["Ram"] = val.text
l.append(d)
import pandas
df = pandas.DataFrame(l)
This might work on standard pagination
i = 1
items_parsed = set()
loop = True
base_url = "https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page={}&q=laptop&sid=6bo%2Fb5g&viewType=list"
while True:
page = requests.get(base_url.format(i))
items = requests.get(#yourelements#)
if not items:
break
for item in items:
#Scrap your item and once you sucessfully done the scrap, return the url of the parsed item into url_parsed (details below code) for example:
url_parsed = your_stuff(items)
if url_parsed in items_parsed:
loop = False
items_parsed.add(url_parsed)
if not loop:
break
i += 1
I formatted your URL where ?page=X with base_url.format(i) so it can iterate until you have no items found on the page OR sometimes you return on page 1 when you reached max_page + 1.
If above the maximum page you get the items you already parsed on the first page you can declare a set() and put the URL of every items you parsed and then check if you already parsed them.
Note that this is just an idea.
Since the page number in the URL is almost in the middle I'd apply a similar change to your code:
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page="
end_url ="&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(1, page_nr + 1):
r=requests.get(base_url+str(page)+end_url+".html")
You have access to only first 10 pages from initial URL.
You can make a loop from "&page=1" to "&page=26".
I need help with turning a for loop into a while loop, which only prints/logs in differences / changes to an xml.
this is the current code i have thus far.
import requests
from bs4 import BeautifulSoup
url = "https://www.ruvilla.com/media/sitemaps/sitemap.xml"
r = requests.get(url)
soup = BeautifulSoup(r.content)
for url in soup.find_all("url"):
titlenode = url.find("loc")
if titlenode:
title = titlenode.text
loc = url.find("loc").text
lastmod = url.find("lastmod").text
print title + "\n" + lastmod
For your current use case, a for loop works best. However, if you really want to make into a while loop, you can do that like so:
urls = soup.find_all("url")
counter = 0
while counter < len(urls)-1:
counter += 1
url = urls[counter]
#Your code here
If I understood your question properly, you are trying to log only the urls which has lastmod attribute associated. For this case for loop works best instead of while because it automatically ends iteration when the end of the list is reached. As in case of while loop you have to explicitly handle with check like i < len(size). You can consider the below:
while True:. # Loop infinitely
r = requests.get(url)
soup = BeautifulSoup(r.content)
for url in soup.find_all('url'):
lastmod = url.find("lastmod").text
if not lastmod:
continue
loc = url.find("loc").text
titlenode = url.find("loc")
if titlenode:
title = titlenode.text
time.sleep(1)
The try-except block is to ensure that the lastmod if exists print the details. Else just ignore and go to next URL. Hope this helps. Cheers.
I'm trying to extract data from BBB but I get no response. I don't get any error messages, just a blinking cursor. Is it my regex that is the issue? Also, if you see anything that I can improve on in terms of efficiency or coding style, I
am open to your advice!
Here is the code:
import urllib2
import re
print "Enter an industry keyword."
print "Example: florists, construction, tiles"
keyword = raw_input('> ')
print "How many pages to dig through BBB?"
total_pages = raw_input('> ')
print "Working..."
page_number = 1
address_list = []
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
address_pattern = r'<address>(.*?)<\/address>'
while page_number <= total_pages:
business_address = re.findall(address_pattern,str(respData))
for each in business_address:
address_list.append(each)
page_number += 1
for each in address_list:
print each
print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')
file = open('export.txt','w')
for each in address_list:
file.write('%r \n' % each)
file.close()
print 'File saved!'
EDITED, but still don't get any results:
import urllib2
import re
print "Enter an industry keyword."
print "Example: florists, construction, tiles"
keyword = raw_input('> ')
print "How many pages to dig through BBB?"
total_pages = int(raw_input('> '))
print "Working..."
page_number = 1
address_list = []
for page_number in range(1,total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
address_pattern = r'<address>(.*?)<\/address>'
business_address = re.findall(address_pattern,respData)
address_list.extend(business_address)
for each in address_list:
print each
print "\n Save to text file? Hit ENTER if so.\n"
raw_input('>')
file = open('export.txt','w')
for each in address_list:
file.write('%r \n' % each)
file.close()
print 'File saved!'
Convert total_pages using int and use range instead of your while loop:
total_pages = int(raw_input('> '))
...............
for page_number in range(2, total_pages+1):
That will fix your issue but the loop is redundant, you use the same respData and address_pattern in the loop so you will keep adding the same thing repeatedly, if you want to crawl multiple pages you need to move the urllib code inside the for loop so you crawl using each page_number:
for page_number in range(1, total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
req = urllib2.Request(url)
req.add_header('User-agent', 'Mozilla/5.0')
resp = urllib2.urlopen(req)
respData = resp.read()
business_address = re.findall(address_pattern, respData)
# use extend to add the data from findall
address_list.extend(business_address)
respData is also already a string so you don't need to call str on it, also using requests can simplify your code further:
import requests
for page_number in range(1,total_pages):
url = 'https://www.bbb.org/search/?type=category&input=' + keyword + '&filter=business&page=' + str(page_number)
respData = requests.get(url).content
business_address = re.findall(address_pattern,str(respData))
address_list.extend(business_address)
The main issue I see in your code, that is causing the infinite loop is that total_pages is defined as a string in lines -
total_pages = raw_input('> ')
But page_number is defined as an int.
Hence , the while loop -
while page_number <= total_pages:
would not end unless some exception occurs from within it, since str is always larger than int in Python 2.x .
You would most probably need to convert the raw_input() to int() since you are only using total_pages in the condition in the while loop. Example -
total_pages = int(raw_input('> '))
I have not checked whether the rest of your logic is correct or not, but I believe the above is the reason you are getting the infinite loop.