Extracting web elements using Pyquery, Requests and Gadget selector

Extracting web elements using Pyquery, Requests and Gadget selector - python

I am able to extract table values from this website with the following code.
from pyquery import PyQuery as pq
import requests
url = "https://finviz.com/screener.ashx"
content = requests.get(url).content
doc = pq(content)
Tickers = doc(".screener-link-primary").text()
print(Tickers)
But I am able to extract only the first 20 values. There is a 'next' button at the end of the page which has the link to the next set of values.
How can I extract this link automatically, fetch the new page and extract the new set of values and append to my existing list?

You can iterate through all pages like:
counter = 1
while True:
url = "https://finviz.com/screener.ashx?v=111&r=%d" % counter
content = requests.get(url).content
counter += 20
Note that for the first page r parameter (which I guess stands for starting entry index) will be 1 for the second - 21, for the third -41... So I used + 20 increment for counter
You should also add break for the moment when the last page reached. Usually one make a check whether new data to scrape available and if not - break

Related

While true try except loop giving different output each time in webscraping - repeats or omits elements while iterating

I am trying to scrape some pages and count occurrences of a word in the page. I have to go through different set of links to reach the final set of pages and I used for loops to collect and iterate through the links.
As the website is slow, I put the final iteration inside a while True loop. But each time I run the code, it loops through the final set of links in different ways. For example, it goes through 20 links and then repeats those 20 links again while ignoring another 20 links. Every time the number varies, sometimes within each iteration, repeating and omitting random number of links.
The website is really slow. So unless I put a while True loop, the program stops in the middle. Could someone please look through the code and point out what I am doing wrong?
from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen
import re
import pandas as pd
import io
import requests
import time
import csv
d=open('Wyd 20-21.csv','w')
writer=csv.writer(d,lineterminator='\n')
URL = "http://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/MAT_DTL_1603_MATD_eng2021.html"
soup = bs(requests.get(URL).content, "html.parser")
base_url = "http://mnregaweb4.nic.in/netnrega/"
linksblocks = []
for tag in soup.select("td:nth-of-type(2) a"):
linksblocks.append(tag["href"])
print(linksblocks)
Allblocks = [base_url+e[6:] for e in linksblocks]
print(Allblocks)#This is the first set of links. I have to iterate through each one of them to get to the second set of links
links = []
for each in Allblocks:
soup=bs(requests.get(each).content,"html.parser")
for tag in soup.select("td:nth-of-type(2) a"):
links.append(tag["href"])
AllGPs = [base_url+e[6:] for e in links]
print(AllGPs)#This is the second set of links. I have to iterate through each one of them to get to the final set of links
gp=0
for each in AllGPs:
res=requests.get(each)
soup=bs(res.text,'html.parser')
urls=[]
for link in soup.find_all('a'):
urls.append(link.get('href'))
inte=urls[1:-1]
each_bill=[base_url+e[6:] for e in inte] #This is the final set of links. I have to iterate through each one of them to get to the final pages and look for the occurrence of the word in each of the page.
q=len(each_bill)
print("no of bills is: ",q)
gp+=1
x=0
while True:
try:
for each in each_bill:
r=requests.get(each)
y=r.text.count('Display Board')
print(y)
soup=bs(r.text,'html.parser')
table_soup=soup.findAll('table')
trow=[]
for tr in table_soup[3]:
trow.append(tr)
text=trow[1].text
b=text[13:]
print(b)
writer.writerow((y,b))
x+=1
print("Now Wyd ",x,"th bill in",gp," th GP")
if x==q:
break
if x==q:
break
except requests.exceptions.RequestException as e:
print("exception error: ",e)
time.sleep(5)
continue
d.close()

Why is this search not working?

My goal here is for this Python script to open a block number page on blockchain.info, and take the correct table. This table is then searched for a range of values and results printed.
This one works on https://blockchain.info/search?search=422407, finding the associated "0.02269362":
import numpy as np
from bs4 import BeautifulSoup
import urllib2
#create a list of numbers that will be used in search
list = np.arange(0.9749999312,0.9793780897,0.000001)
#open webpage, get table
web = urllib2.urlopen("https://blockchain.info/search?search=422407").read()
#whole page
soup = BeautifulSoup(web, "lxml")
table = soup.findAll("table", {'class':'table table-striped'}) #Correct table
#Go through all numbers created; check if found; print found
for i in list:
j = str(round((i * 0.023223),8))
for line in table: #if you see a line in the table
if line.get_text().find(j) > -1 : #and you find the specific string
print(line.prettify().encode("utf-8")) #print it
print j
I'm having difficulties doing this for others. This is supposed to go to block 422245 and find "0.02972821". It does not print off anything. Ideally it would be printing anything that matches [x.xxxx]yz and so on.
import numpy as np
from bs4 import BeautifulSoup
import urllib2
#create a list of numbers that will be used in search
list = np.arange(0.9749999312,0.9793780897,0.000001)
#open webpage, get table
web = urllib2.urlopen("https://blockchain.info/search?search=422245").read() #whole page
soup = BeautifulSoup(web, "lxml")
table = soup.findAll("table", {'class':'table table-striped'}) #Correct table
#Go through all numbers created; check if found; print found
for i in list:
j = str(round((i * 0.03044589),8))
for line in table: #if you see a line in the table
if line.get_text().find(j) > -1 : #and you find the specific string
print(line.prettify().encode("utf-8")) #print it
print j
When I tried testing the finding part of the script using the code below it also does not work. But if you go to https://blockchain.info/search?search=422245 and search on page "0.02972821" the value is there. I am confused at why this is not working.
import numpy as np
from bs4 import BeautifulSoup
import urllib2
#create a list of numbers that will be used in search
list = np.arange(0.9749999312,0.9793780897,0.000001)
#open webpage, get table
web = urllib2.urlopen("https://blockchain.info/search?search=422245").read() #whole page
soup = BeautifulSoup(web, "lxml")
table = soup.findAll("table", {'class':'table table-striped'}) #Correct table
#Go through all numbers created; check if found; print found
j = "0.02972821"
for line in table: #if you see a line in the table
if line.get_text().find(j) > -1 : #and you find the specific string
print(line.prettify().encode("utf-8")) #print it
print j

get last page number - web scraping

I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.

Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.

Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as

The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30

Python script extract data from HTML page

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!

First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',

A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

Use BeautifulSoup to loop through and retrieve specific URLs

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. You may imagine that there are 4 different URL lists each containing 100 different URL links.
I need to get and print always the 3rd URL on every list, while the previous URL (e.g. the 3rd URL on the first list) will lead to the 2nd list (and then need to get and print the 3rd URL and so on till the 4th retrieval).
Yet, my loop only achieves the first result (3rd URL on list 1), and I don't know how to loop the new URL back to the while loop and continue the process.
Here is my code:
import urllib.request
import json
import ssl
from bs4 import BeautifulSoup
num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))
url='https://pr4e.dr-chuck.com/tsugi/mod/python-
data/data/known_by_Fikret.html'
print (url)
count=0
order=0
while count<num:
context = ssl._create_unverified_context()
htm=urllib.request.urlopen(url, context=context).read()
soup=BeautifulSoup(htm)
for i in soup.find_all('a'):
order+=1
if order ==position:
x=i.get('href')
print (x)
count+=1
url=x
print ('done')

This is a good problem to use recursion. Try to call a recursive function to do this:
def retrieve_urls_recur(url, position, index, deepness):
if index >= deepness:
return True
else:
plain_text = requests.get(url)
soup = BeautifulSoup(plain_text)
links = soup.find_all('a'):
desired_link = links[position].get('href')
print desired_link
return retrieve_urls_recur(desired_link, index+1, deepness)
and then call it with the desired parameters, in your case:
retrieve_urls_recur(url, 2, 0, 4)
2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively
ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess

Just get the link from find_all() by index:
while count < num:
context = ssl._create_unverified_context()
htm = urllib.request.urlopen(url, context=context).read()
soup = BeautifulSoup(htm)
url = soup.find_all('a')[position].get('href')
count += 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting web elements using Pyquery, Requests and Gadget selector - python

Related

While true try except loop giving different output each time in webscraping - repeats or omits elements while iterating

Why is this search not working?

get last page number - web scraping

Python script extract data from HTML page

Use BeautifulSoup to loop through and retrieve specific URLs

Categories

Resources