Python script extract data from HTML page - python

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!

First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',

A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

Related

Webscraping - Beautifulsoup4 - Accessing indexed item in a find_all loop

How do I make it so that I can choose an item in the list in that for loop?
When I print it without brackets, I get the full list and every index seems to be the proper item that I need
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname)
However, when I print it by specifying an index, whether it is inside the loop or outside, it returns "list index out of range"
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname[2])
What's my problem here? How do I change my code so that I can scrape all the h3 names, yet at the same time be able to choose specific indexed h3 names when I want to?
Here's the entire code:
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ca1lib.org/s/ginger") #gets the source of the site and returns it
soup = BeautifulSoup(source.text, 'html5lib')
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname[2])
At a first glance, assuming that your h3 element contains more book names ("book1" \n "book2" \n "book3"), your problem could be that certain h3 elements have less than 3 elements, so the bookname[2] part can't access an element from a shorter list.
On the other hand, if your h3 element has only 1 item (h3 book1 h3), you are iterating all the h3 tags, so you are basically taking each one of them (so in your first iteration you'll have "h3 book1 h3", in your second iteration "h3 book2 h3"), in which case you should make a list with all the h3.a.text elements, then access the desired value.
Hope this helps!
I forgot to append. I figured it out.
Here's my final code:
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ca1lib.org/s/ginger") #gets the source of the site and returns it
soup = BeautifulSoup(source.text, 'html.parser')
liste = []
for h3_tag in soup.find_all('h3', itemprop="name"):
liste.append(h3_tag.a.text.split("\n"))
#bookname = h3.a.text #string
#bookname = bookname.split('\n') #becomes list
print(liste[5])

Scraping returning only one value

I wanted to scrape something as my first program, just to learn the basics really but I'm having trouble showing more than one result.
The premise is going to a forum (http://blackhatworld.com), scrape all thread titles and compare with a string. If it contains the word "free" it will print, otherwise it won't.
Here's the current code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
n=0
for x in range(len(threadtitles)):
test = list(threadtitles)[n]
test2 = list(test)[0]
if test2.find('free') == -1:
n=n+1
else:
print(test2)
n=n+1
This is the result of running the program:
https://i.gyazo.com/6cf1e135b16b04f0807963ce21b2b9be.png
As you can see it's checking for the word "free" and it works but it only shows first result while there are several more in the page.
By default, strings comparison is case sensitive (FREE != free). To solve your problem, first you need to put test2 in lowercase:
test2 = list(test)[0].lower()
To solve your problem and simplify your code try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
count = 0
for title in threadtitles:
if "free" in title.get_text().lower():
print(title.get_text())
else:
count += 1
print(count)
Bonus: Print value of href:
for title in threadtitles:
print(title["href"])
See also this.

Extracting only the bullet points after a 'strong' title from a website using python

I want to extract only the points listed as bullets under the title 'WHAT RESPONDENTS ARE SAYING …' in this webpage.
I am able to achieve it with this code:
import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
strong_el.find_all_next('li')[9]
But the problem here is that I have to know how many bullet points are listed (There are 10 in this case. Hence, it returns valid values until [9]). What is the best way extract all of the bullet points even without knowing how many of them are listed? Also, I need only the text and not the html.
You can use find_next_sibling to get the ul element next to strong which contains these li elements. Then get all the children of ul which are li elements:
ul_tag = strong_el.find_next_sibling('ul')
for li_tag in ul_tag.children:
print li_tag.string
you should find the ul tag first, it contains all the li tags
In [3]: ul = strong_el.find_next('ul')
In [4]: for li in ul.find_all('li'):
...: print(li.text)
out:
“Demand very steady to start the year.” (Chemical Products)
“January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products)

get last page number - web scraping

I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.
Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.
Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as
The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30

Python webscraping and getting contents of first div tag of its class

I'm working with Python 3.3 and this website:
http://www.nasdaq.com/markets/ipos/
My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.
Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
I'd like it to return only
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.
EDIT: Here is the updated code according to warunsl solution. This works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
You mentioned that there are two elements that fit the 'class':'genTable thin floatL' criteria. So running a for loop for it's first element does not make sense.
So replace your outer for loop with
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
Now you need not do a soup.find_all again. Doing so will search the entire document. You need to restrict the search to the divparent. So, you do:
table = divparent.find('table')
The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table variable.
for row in table.find_all('tr'):
for data in row.find_all('td'):
print data.string
Hope it helps.

Categories