Webscraping - Beautifulsoup4 - Accessing indexed item in a find_all loop

Webscraping - Beautifulsoup4 - Accessing indexed item in a find_all loop - python

How do I make it so that I can choose an item in the list in that for loop?
When I print it without brackets, I get the full list and every index seems to be the proper item that I need
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname)
However, when I print it by specifying an index, whether it is inside the loop or outside, it returns "list index out of range"
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname[2])
What's my problem here? How do I change my code so that I can scrape all the h3 names, yet at the same time be able to choose specific indexed h3 names when I want to?
Here's the entire code:
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ca1lib.org/s/ginger") #gets the source of the site and returns it
soup = BeautifulSoup(source.text, 'html5lib')
for h3 in soup.find_all('h3', itemprop="name"):
bookname = h3.a.text
bookname = bookname.split('\n')
print(bookname[2])

At a first glance, assuming that your h3 element contains more book names ("book1" \n "book2" \n "book3"), your problem could be that certain h3 elements have less than 3 elements, so the bookname[2] part can't access an element from a shorter list.
On the other hand, if your h3 element has only 1 item (h3 book1 h3), you are iterating all the h3 tags, so you are basically taking each one of them (so in your first iteration you'll have "h3 book1 h3", in your second iteration "h3 book2 h3"), in which case you should make a list with all the h3.a.text elements, then access the desired value.
Hope this helps!

I forgot to append. I figured it out.
Here's my final code:
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ca1lib.org/s/ginger") #gets the source of the site and returns it
soup = BeautifulSoup(source.text, 'html.parser')
liste = []
for h3_tag in soup.find_all('h3', itemprop="name"):
liste.append(h3_tag.a.text.split("\n"))
#bookname = h3.a.text #string
#bookname = bookname.split('\n') #becomes list
print(liste[5])

Related

Parsing diferent bs4.element.Tag with beautifulSoup

I want to parse the table in this url and export it as a csv:
http://www.bde.es/webbde/es/estadis/fi/ifs_es.html
if i do this:
sauce = urlopen(url_bank).read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
and then this:
resto = soup.find_all('td')
lista_text = []
for elements in resto:
lista_text = lista_text + [elements.string]
I get all the elements well parsed except the last column 'Códigos Isin'
and this is because there is a break on html code '. I do not know
what to do with, i have tried this part but still does not work:
lista_text = lista_text + [str(elements.string).replace('<br/>','')]
After that I take the list to a np.array an then to a dataframe to export it as .csv. That part is already done, I only have to fix that issue.
Thanks in advance!

It's just that you need to be careful about what .string does - if there are multiple children elements, it would return None - as in the case with <br>:
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None
Use .get_text() instead:
for elements in resto:
lista_text = lista_text + [elements.get_text(strip=True)]

Extracting only the bullet points after a 'strong' title from a website using python

I want to extract only the points listed as bullets under the title 'WHAT RESPONDENTS ARE SAYING …' in this webpage.
I am able to achieve it with this code:
import requests
URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
r = requests.get(URL)
page = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, 'lxml')
strong_el = soup.find('strong',text='WHAT RESPONDENTS ARE SAYING …')
strong_el.find_all_next('li')[9]
But the problem here is that I have to know how many bullet points are listed (There are 10 in this case. Hence, it returns valid values until [9]). What is the best way extract all of the bullet points even without knowing how many of them are listed? Also, I need only the text and not the html.

You can use find_next_sibling to get the ul element next to strong which contains these li elements. Then get all the children of ul which are li elements:
ul_tag = strong_el.find_next_sibling('ul')
for li_tag in ul_tag.children:
print li_tag.string

you should find the ul tag first, it contains all the li tags
In [3]: ul = strong_el.find_next('ul')
In [4]: for li in ul.find_all('li'):
...: print(li.text)
out:
“Demand very steady to start the year.” (Chemical Products)
“January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products)

Finding the number of divs with a certain id in BeautifulSoup?

I am trying to find a way to count the number of divs with the id "blue". Is this possible in BeautifulSoup? Here is my code:
import BeautifulSoup
scanning = True
soup = BeautifulSoup.BeautifulSoup("<html><body><div id='blue'></div><div id='blue'></div><div id='purple'></div></body></html>")
blues = []
blues.append(soup.find("div", {"id": "blue"}))
print len(blues)

the find method will only fetch the first occurence, hence the output of 1. If you use find_all, it will literally find all of the occurences, saving the results to a list on your behalf. In this case 'divs' becomes a list of every div id=blue, and you can check the length of that.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup("<html><body><div id='blue'></div><div id='blue'></div><div id='purple'></div></body></html>", 'html.parser')
divs = soup.find_all("div", {"id": "blue"})
print(len(divs))

If divs only have the id blue, then you could just use:
divs = soup.find_all("#blue")
blues = len(divs) if divs else 0

get last page number - web scraping

I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.

Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.

Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as

The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30

Python script extract data from HTML page

I'm trying to do a massive data accumulation on college basketball teams. This link: https://www.teamrankings.com/ncb/stats/ has a TON of team stats.
I have tried to write a script that scans all the desired links (all Team Stats) from this page, finds the rank of the specified team (an input), then returns the sum of that teams ranks from all links.
I graciously found this: https://gist.github.com/phillipsm/404780e419c49a5b62a8
...which is GREAT!
But I must have something wrong because I'm getting 0
Here's my code:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for table_row in soup.select(".expand-section li"):
table_cells = table_row.findAll('li')
if len(table_cells) > 0:
link = table_cells[0].find('a')['href']
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeaultifulSoup(r.text)
team_rows = soup.select(".tr-table datatable scrollable dataTable no-footer tr")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
print total_rank
Check out that link to double check I have the correct class specified. I have a feeling the problem might be in the first for loop where I select an li tag then select all li tags within that first tag, I dunno.
I don't use Python so I'm unfamiliar with any debugging tools. So if anyone wants to forward me to one of those that would be great!

First, the team stats and player stats sections are contained in a 'div class='large column-2'. The team stats are in the first occurrence. Then you can find all of the href tags within it. I've combined both in a one-liner.
teamstats = soup(class_='column large-2')[0].find_all(href=True)
The teamstats list contains all of the 'a' tags. Use a list comprehension to extract the links. A few of the hrefs contained "#" (part of navigation links) so I excluded them.
links = [a['href'] for a in teamstats if a['href'] != '#']
Here is a sample of output:
links
Out[84]:
['/ncaa-basketball/stat/points-per-game',
'/ncaa-basketball/stat/average-scoring-margin',
'/ncaa-basketball/stat/offensive-efficiency',
'/ncaa-basketball/stat/floor-percentage',
'/ncaa-basketball/stat/1st-half-points-per-game',

A ran your code on my machine and the line --> table_cells = table_row.findAll('li') , always returns an empty list, so stat_links ends up being an empty array, therefore the iteration over stat_links never gets carried out and total_rank will not get incremented. I suggest you fiddle around with the way you find all the list elements.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping - Beautifulsoup4 - Accessing indexed item in a find_all loop - python

Related

Parsing diferent bs4.element.Tag with beautifulSoup

Extracting only the bullet points after a 'strong' title from a website using python

Finding the number of divs with a certain id in BeautifulSoup?

get last page number - web scraping

Python script extract data from HTML page

Categories

Resources