Use BeautifulSoup to loop through and retrieve specific URLs - python

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. You may imagine that there are 4 different URL lists each containing 100 different URL links.
I need to get and print always the 3rd URL on every list, while the previous URL (e.g. the 3rd URL on the first list) will lead to the 2nd list (and then need to get and print the 3rd URL and so on till the 4th retrieval).
Yet, my loop only achieves the first result (3rd URL on list 1), and I don't know how to loop the new URL back to the while loop and continue the process.
Here is my code:
import urllib.request
import json
import ssl
from bs4 import BeautifulSoup
num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))
url='https://pr4e.dr-chuck.com/tsugi/mod/python-
data/data/known_by_Fikret.html'
print (url)
count=0
order=0
while count<num:
context = ssl._create_unverified_context()
htm=urllib.request.urlopen(url, context=context).read()
soup=BeautifulSoup(htm)
for i in soup.find_all('a'):
order+=1
if order ==position:
x=i.get('href')
print (x)
count+=1
url=x
print ('done')

This is a good problem to use recursion. Try to call a recursive function to do this:
def retrieve_urls_recur(url, position, index, deepness):
if index >= deepness:
return True
else:
plain_text = requests.get(url)
soup = BeautifulSoup(plain_text)
links = soup.find_all('a'):
desired_link = links[position].get('href')
print desired_link
return retrieve_urls_recur(desired_link, index+1, deepness)
and then call it with the desired parameters, in your case:
retrieve_urls_recur(url, 2, 0, 4)
2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively
ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess

Just get the link from find_all() by index:
while count < num:
context = ssl._create_unverified_context()
htm = urllib.request.urlopen(url, context=context).read()
soup = BeautifulSoup(htm)
url = soup.find_all('a')[position].get('href')
count += 1

Related

While true try except loop giving different output each time in webscraping - repeats or omits elements while iterating

I am trying to scrape some pages and count occurrences of a word in the page. I have to go through different set of links to reach the final set of pages and I used for loops to collect and iterate through the links.
As the website is slow, I put the final iteration inside a while True loop. But each time I run the code, it loops through the final set of links in different ways. For example, it goes through 20 links and then repeats those 20 links again while ignoring another 20 links. Every time the number varies, sometimes within each iteration, repeating and omitting random number of links.
The website is really slow. So unless I put a while True loop, the program stops in the middle. Could someone please look through the code and point out what I am doing wrong?
from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen
import re
import pandas as pd
import io
import requests
import time
import csv
d=open('Wyd 20-21.csv','w')
writer=csv.writer(d,lineterminator='\n')
URL = "http://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/MAT_DTL_1603_MATD_eng2021.html"
soup = bs(requests.get(URL).content, "html.parser")
base_url = "http://mnregaweb4.nic.in/netnrega/"
linksblocks = []
for tag in soup.select("td:nth-of-type(2) a"):
linksblocks.append(tag["href"])
print(linksblocks)
Allblocks = [base_url+e[6:] for e in linksblocks]
print(Allblocks)#This is the first set of links. I have to iterate through each one of them to get to the second set of links
links = []
for each in Allblocks:
soup=bs(requests.get(each).content,"html.parser")
for tag in soup.select("td:nth-of-type(2) a"):
links.append(tag["href"])
AllGPs = [base_url+e[6:] for e in links]
print(AllGPs)#This is the second set of links. I have to iterate through each one of them to get to the final set of links
gp=0
for each in AllGPs:
res=requests.get(each)
soup=bs(res.text,'html.parser')
urls=[]
for link in soup.find_all('a'):
urls.append(link.get('href'))
inte=urls[1:-1]
each_bill=[base_url+e[6:] for e in inte] #This is the final set of links. I have to iterate through each one of them to get to the final pages and look for the occurrence of the word in each of the page.
q=len(each_bill)
print("no of bills is: ",q)
gp+=1
x=0
while True:
try:
for each in each_bill:
r=requests.get(each)
y=r.text.count('Display Board')
print(y)
soup=bs(r.text,'html.parser')
table_soup=soup.findAll('table')
trow=[]
for tr in table_soup[3]:
trow.append(tr)
text=trow[1].text
b=text[13:]
print(b)
writer.writerow((y,b))
x+=1
print("Now Wyd ",x,"th bill in",gp," th GP")
if x==q:
break
if x==q:
break
except requests.exceptions.RequestException as e:
print("exception error: ",e)
time.sleep(5)
continue
d.close()

How to find sum of a for loop in BeautifulSoup

Please check my code here:
Sample url: http://py4e-data.dr-chuck.net/comments_42.html
Sum of the digits found in following url should be (2553).
I have to tried to sum up using several techs but can't find the correct one use the url provided at the top of the code. I need to sum up the strings numbers.
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# To read the file from the url
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
# To search for specific area of the file
tags = soup('span')
#print(tags)
sum = 0
# Filters your search further and prints the specific part as
#string
for tag in tags:
print(tag.contents[0])
#ChangeToInt = int(tag.contents[0])
#sum =+ ChangeToInt
#print(sum)
a few pointers, sum is a python builtin method for summing up lists of numbers so best not to use it as a variable name. also the syntax for adding to a variable is += but in your code you have =+. Your code works with just a change to that syntax ( i have also updated the variable name from sum to total and print only the total after the loop.
total = 0
for tag in tags:
print(tag.contents[0])
ChangeToInt = int(tag.contents[0])
total += ChangeToInt
print(total)
Alternatively you could write this using pythons sum method and a list comprehension to generate the numbers.
total = sum([int(tag.contents[0]) for tag in tags])
print(total)
additionally you can check this question for the differnece between += and =+
You simply have your increment syntax wrong:
sum =+ ChangeToInt
should instead be:
sum += ChangeToInt
Your code worked just fine for me after I fixed that.

Scraping returning only one value

I wanted to scrape something as my first program, just to learn the basics really but I'm having trouble showing more than one result.
The premise is going to a forum (http://blackhatworld.com), scrape all thread titles and compare with a string. If it contains the word "free" it will print, otherwise it won't.
Here's the current code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
n=0
for x in range(len(threadtitles)):
test = list(threadtitles)[n]
test2 = list(test)[0]
if test2.find('free') == -1:
n=n+1
else:
print(test2)
n=n+1
This is the result of running the program:
https://i.gyazo.com/6cf1e135b16b04f0807963ce21b2b9be.png
As you can see it's checking for the word "free" and it works but it only shows first result while there are several more in the page.
By default, strings comparison is case sensitive (FREE != free). To solve your problem, first you need to put test2 in lowercase:
test2 = list(test)[0].lower()
To solve your problem and simplify your code try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
count = 0
for title in threadtitles:
if "free" in title.get_text().lower():
print(title.get_text())
else:
count += 1
print(count)
Bonus: Print value of href:
for title in threadtitles:
print(title["href"])
See also this.

Returns a list with only 20 entries. Does not go beyond that

#importing the libraries
import urllib.request as urllib2
from bs4 import BeautifulSoup
#getting the page url
quote_page="https://www.quora.com/What-is-the-best-advice-you-can-give-to-a-junior-programmer"
page=urllib2.urlopen(quote_page)
#parsing the html
soup = BeautifulSoup(page,"html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})
#finding all the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)
#separating the answers into lists
for i in range(0, len(ans), 100):
chunk = ans[i:i+100]
#extracting all the answers and putting into a list
finalans=[]
l=0
for i in chunk:
stri=chunk[l]
finalans.append(stri.text)
l+=1
continue
final_string = '\n'.join(finalans)
#final output
print(final_string)
I am not able to get more than 20 entries into this list. What is wrong with this code? (I am a beginner and I have used some references to write this program)
Edit: I have added the URL I want to scrape.
You try to break ans into smaller chunks, but notice that each iteration of this loop discards the previous content of chunks so you loose all but the last chunk of data.
#separating the answers into lists
for i in range(0, len(ans), 100):
chunk = ans[i:i+100] # overwrites previous chunk
This is why you only get 20 items in the list... its only the final chunk. Since you want final_string to hold all of the text nodes, there is no need to chunk and I just removed it.
Next, and this is just tightening up the code, you don't need to both iterate the values of the list and track an index just to get the same value you are indexing. Working on ans because we are no longer chunking,
finalans=[]
l=0
for i in ans:
stri=ans[l]
finalans.append(stri.text)
l+=1
continue
becomes
finalans=[]
for item in ans:
finalans.append(item.text)
or more susinctly
finalans = [item.text for item in ans]
So the program is
#importing the libraries
import urllib.request as urllib2
from bs4 import BeautifulSoup
#getting the page url
quote_page="https:abcdef.com"
page=urllib2.urlopen(quote_page)
#parsing the html
soup = BeautifulSoup(page,"html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})
#finding all the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)
#extracting all the answers and putting into a list
finalans = [item.text for item in ans]
final_string = '\n'.join(finalans)
#final output
print(final_string)

Extracting web elements using Pyquery, Requests and Gadget selector

I am able to extract table values from this website with the following code.
from pyquery import PyQuery as pq
import requests
url = "https://finviz.com/screener.ashx"
content = requests.get(url).content
doc = pq(content)
Tickers = doc(".screener-link-primary").text()
print(Tickers)
But I am able to extract only the first 20 values. There is a 'next' button at the end of the page which has the link to the next set of values.
How can I extract this link automatically, fetch the new page and extract the new set of values and append to my existing list?
You can iterate through all pages like:
counter = 1
while True:
url = "https://finviz.com/screener.ashx?v=111&r=%d" % counter
content = requests.get(url).content
counter += 20
Note that for the first page r parameter (which I guess stands for starting entry index) will be 1 for the second - 21, for the third -41... So I used + 20 increment for counter
You should also add break for the moment when the last page reached. Usually one make a check whether new data to scrape available and if not - break

Categories