I need help with turning a for loop into a while loop, which only prints/logs in differences / changes to an xml.
this is the current code i have thus far.
import requests
from bs4 import BeautifulSoup
url = "https://www.ruvilla.com/media/sitemaps/sitemap.xml"
r = requests.get(url)
soup = BeautifulSoup(r.content)
for url in soup.find_all("url"):
titlenode = url.find("loc")
if titlenode:
title = titlenode.text
loc = url.find("loc").text
lastmod = url.find("lastmod").text
print title + "\n" + lastmod
For your current use case, a for loop works best. However, if you really want to make into a while loop, you can do that like so:
urls = soup.find_all("url")
counter = 0
while counter < len(urls)-1:
counter += 1
url = urls[counter]
#Your code here
If I understood your question properly, you are trying to log only the urls which has lastmod attribute associated. For this case for loop works best instead of while because it automatically ends iteration when the end of the list is reached. As in case of while loop you have to explicitly handle with check like i < len(size). You can consider the below:
while True:. # Loop infinitely
r = requests.get(url)
soup = BeautifulSoup(r.content)
for url in soup.find_all('url'):
lastmod = url.find("lastmod").text
if not lastmod:
continue
loc = url.find("loc").text
titlenode = url.find("loc")
if titlenode:
title = titlenode.text
time.sleep(1)
The try-except block is to ensure that the lastmod if exists print the details. Else just ignore and go to next URL. Hope this helps. Cheers.
Related
I am working on a web scraper at the moment, right now I have it so it grabs a list of url's. I need it to use each of the url's in the list it makes one at a time into the soup function, to get my desired html output from each individual page.
Example:
my_list = ['www.google1213.com', 'www.yahoo123.com', 'www.apples123.com']
def main():
url = input('URL: ') #List goes here
currentDT = datetime.datetime.now()
scraper = cfscrape.create_scraper()
response = scraper.get(url).content
soup = BeautifulSoup(response,"lxml")
#etc...#
while True:
main()
If anyone can help me get my list to send its contents so I scrape each url one at a time, I would be very greatful!
def main():
for url in my_list:
currentDT = datetime.now()
scraper = cfscrape.create_scraper()
response = scraper.get(url).content
soup = BeautifulSoup(response,"lxml")
You can use a simple for loop:
for url in my_list:
print(url)
# do your scrapping stuff...
Ps: maybe you should also limit your requests per second. Otherwise some websites will block you after a few tries.
My problem is related to this answer.
I have following code:
import urllib.request
from bs4 import BeautifulSoup
time = 0
html = urllib.request.urlopen("https://www.kramerav.com/de/Product/VM-2N").read()
html2 = urllib.request.urlopen("https://www.kramerav.com/de/Product/SDIA-IN2-F16").read()
try:
div = str(BeautifulSoup(html).select("div.large-image")[0])
if(str(BeautifulSoup(html).select("div.large-image")[1]) != ""):
div += str(BeautifulSoup(html).select("div.large-image")[1])
time = time + 1
except IndexError:
div = ""
time = time + 1
finally:
print(str(time) + div)
The site of the variable html has 2 div-classes named "large-image". The site of the variable html2 only has 1.
With html the program works as intended. But if I switch to html2 the variable div is going to be completely empty.
I would like to save the 1 div-class rather than saving nothing. How could I archieve this?
the variable div is going to be completely empty.
That's because your error handler assigned it the empty string.
Please don't use subscripts, conditionals, and handlers in that way. It would be more natural to iterate over the results of select() with for, building up a result list (or string).
Also, you should create soup = BeautifulSoup(html) just once, as that can be a fairly expensive operation, since it carefully parses a potentially long web page. With that, you could build up a list of HTML fragments with:
images = [image
for image in soup.select('div.large-image')]
Or if for some reason you're not fond list comprehensions, you could equivalently write:
images = []
for image in soup.select('div.large-image'):
images.append(image)
and then get the required html with div = '\n'.join(images).
You can concatenate all items inside for loop
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
or using join()
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
You can also write in file directly inside for loop and you get to row
for item in all_divs:
csv_writer.writerow( [str(item).strip()] )
time += 1
Working example
import urllib.request
from bs4 import BeautifulSoup
import csv
div = ""
time = 0
f = open('output.csv', 'w')
csv_writer = csv.writer(f)
all_urls = [
"https://www.kramerav.com/de/Product/VM-2N",
"https://www.kramerav.com/de/Product/SDIA-IN2-F16",
]
for url in all_urls:
print('url:', url)
html = urllib.request.urlopen(url).read()
try:
soup = BeautifulSoup(html)
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
# or
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
# or
for item in all_divs:
#div += str(item)
#time += 1
csv_writer.writerow( [time, str(item).strip()] )
except IndexError as ex:
print('Error:', ex)
time += 1
finally:
print(time, div)
f.close()
My code:
number = 0
while True:
number +=1
url = url + '?curpage={}'.format(number)
html = urllib2.urlopen(url).read()
My issue: I have a while loop and within the while loop, I have a URL. For each step, I want URL to change to:
url?curpage=1
url?curpage=2
...
What I am getting:
url?curpage=1
url?curpage=1?curpage=2
...
Any suggestions on how to resolve this issue?
Don't modify url in the loop. For example:
url = "<base url>"
number = 0
while True:
number +=1
html = urllib2.urlopen('{}?curpage={}'.format(url, number)).read()
url = url + ...
says to add to the end of url, making it longer with each iteration
From your expected output, you would seem to want:
html = urllib2.urlopen(url+'?curpage={}'.format(number)).read()
Also, your loop will never end.
number = 0
for i in range(10):
url = 'http://www.example.com/'
number = number + 1
url = url + '?curpage={}'.format(number)
print (url)
I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
(Full disclosure, this questions pertains to an assignment for a Coursera course)
Here's my code
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
count=count + 1
if count== rpt:
break
counts=counts + 1
if counts==cnt:
x==1
else: continue
print url
Based on DJanssens' response, I found the solution;
url = tags[position-1].get('href')
did the trick for me!
Thanks for the assistance!
I also worked on that course, and help with a friend, I got this worked out:
import urllib
from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18
count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags=soup.findAll('a')
url= tags[position-1].get('href')
count=count + 1
if count == rpt:
break
print url
I believe this is what you are looking for:
import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))
#perform the loop "count" times.
for _ in xrange(0,count):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
tags=soup.findAll('a')
# if the link does not exist at that position, show error.
if not tags[position-1]:
print "A link does not exist at that position."
# if the link at that position exist, overwrite it so the next search will use it.
url = tags[position-1].get('href')
print url
The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.
I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.
Here is my 2-cents:
import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))
print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())
for x in range(1, count + 1):
link = list()
for tag in soup('a'):
link.append(tag.get('href', None))
print('Retrieving: ' + link[position - 1])
soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())
My code:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "https://realpython.com/practice/profiles.html"
html_page = urlopen(url)
html_text = html_page.read()
soup = BeautifulSoup(html_text)
links = soup.find_all('a', href = True)
files = []
base = "https://realpython.com/practice/"
def page_names():
for a in links:
files.append(base + a['href'])
page_names()
for i in files:
all_page = urlopen(i)
all_text = all_page.read()
all_soup = BeautifulSoup(all_text)
print all_soup
The first half of the parsing collects three links, the second half is supposed to print out all of their html.
Sadly, it only prints the last link's html.
Possibly because of
for i in files:
all_page = urlopen(i)
It was working previously with 8 lines of code serving the for i in files: purpose but I wanted to clean it up and got it down to those two. Well, clearly not because it doesn't work.
No error though!
You only store the last value in your loop, you need to move all the assignments and the print inside the loop:
for i in files:
all_page = urlopen(i)
all_text = all_page.read()
all_soup = BeautifulSoup(all_text)
print all_soup
If you are going to use functions I would pass parameters and create the list otherwise you might get unexpected output:
def page_names(b,lnks):
files = []
for a in lnks:
files.append(b + a['href'])
return files
for i in page_names(base,links):
all_page = urlopen(i)
all_text = all_page.read()
all_soup = BeautifulSoup(all_text)
print all_s
Your function can then return a list comprehension:
def page_names(b,lnks):
return [b + a['href'] for a in lnks]
In your for loop you are assinging to all_page, which will overwrite it on each loop through, so it will only ever have the value of the last iteration.
If you want it to print the all_soup for each page you could just indent those 3 lines to be inside the for loop as well, then they would be executed each time through the loop.
It seems to be jsut a formatting issue, you probably meant to print it in the loop, right?
for i in files:
all_page = urlopen(i)
all_text = all_page.read()
all_soup = BeautifulSoup(all_text)
print all_soup