Printing all occurence of certain document elements of a webpage - python

So I was scraping this particular webpage https://www.zomato.com/srijata , for all the "restaurant reviews"(not the self comments on her own reviews) posted by user "Sri".
zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata')
zomato_info = zomato_ind.read()
open('zomato_info.html', 'w').write(zomato_info)
soup = BeautifulSoup(open('zomato_info.html'))
soup.find('div','mtop0 rev-text').text
This prints her first restaurant review i.e. - "Sri reviewed Big Straw - Chew On This" as :-
u'Rated This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..'
I also tried another selector :-
I have this questions :-
How can I print the next restaurant review ? I tried findNextSiblings etc and all but none seem to work.

First of all, you don't need to write the output to the file, pass the result of urlopen() call to the BeautifulSoup constructor.
To get the reviews, you need to iterate over all div tags with class rev-text, and get the .next_sibling of the div element inside:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('https://www.zomato.com/srijata'))
for div in soup.find_all('div', class_='rev-text'):
print div.div.next_sibling
Prints:
This is situated right in the heart of the city. The items on the menu are alright and I really had to compromise for bubble tea. The tapioca was not fresh. But the latte and the soda pop my friends tried was good. Another issue which I faced was mosquitos... They almost had me.. Lol..
The ambience is good. The food quality is good. I Didn't find anything to complain. I wanted to visit the place fir a very long time and had dinner today. The meals are very good and if u want the better quality compared to other Andhra restaurants then this is the place. It's far better than nandhana. The staffs are very polite too.
...

You should make a for loop and use find_all instead of find:
zomato_ind = urllib2.urlopen('https://www.zomato.com/srijata')
zomato_info = zomato_ind.read()
open('zomato_info.html', 'w').write(zomato_info)
soup = BeautifulSoup(open('zomato_info.html'))
for div in soup.find_all('div','rev-text'):
print div.text
Also one question: Why are saving the html in a file and then reading the file into a soup object?

Related

Can I search multiple HTML elements within the soup.find_all() function?

I'm trying to scrape a website for most viewed headlines. The class selector of the text I want shares common words with other items on the page. For example, I want the text between the tag and class "black_color". Other items use the tag and have the class "color_black hover_color_gray_90" and I don't want these included. I was thinking I could use more HTML elements to be more specific but I'm not sure how to incorporate them.
from bs4 import BeautifulSoup
def getHeadlines():
url = "https://www.bostonglobe.com/"
source_code = requests.get(url)
plainText = source_code.text
soup = BeautifulSoup(plainText, "html.parser")
#results = soup.find_all("h2",{"class":"headline"})
results = soup.find_all("a",{"class":"black_color"})
with open("headlines.txt", "w", encoding="utf-8") as f:
for i in results:
f.write(str(i.text + ' \n' + '\n'))
getHeadlines()
I think looking at the <a> tag may actually be harder than using at the matching <h2>, which has a 'headline' class.
Try this:
soup = BeautifulSoup(source_code.text, "html.parser")
for headline in soup.find_all("h2", class_="headline"):
print(headline.text)
Output:
Boston College outbreak worries epidemiologists, students, community
Nantucket finds ‘community spread’ of COVID-19 among tradespeople The town’s Select Board and Board of Health will hold an emergency meeting at 10 a.m. Monday to consider placing restrictions on some of those trades, officials said Friday.
Weddings in a pandemic: Welcome to the anxiety vortexNewlyweds are spending their honeymoons praying they don’t hear from COVID‐19 contact tracers. Relatives are agonizing over “damned if we RSVP yes, damned if we RSVP no” decisions. Wedding planners are adding contract clauses specifying they’ll walk off the job if social distancing rules are violated.
Fauci says US should plan to ‘hunker down’ for fall and winter
Of struggling area, Walsh says, ‘We have to get it better under control’
...
After looking at it for awhile, I think the <a> tag may actually have some classes added dynamically, which BS will not pick up. Just searching for the color_black class and excluding the color_black hover_color_gray_90 class yields the headline classes that you don't want (e.g. 'Sports'), even though when I look at the actual web page source code, I see it's differentiated in the way you've indicated.
That's usually a good sign that there are post-load CSS changes being made to a page. (I may be wrong about that, but in either case I hope the <h2> approach gets you what you need.)

How to scrape text in title attribute using beautiful soup?

I am scraping this website using beautiful soup and I need to get the product name completely.
When I use the h2 tag, I end up getting names such as "NIVEA Soft Light Moisturizing Cream Berry Blossom Fragrance ...".
I don't want these dots at the end, just the complete name.
Here is my code snippet for scraping the data:
div_soup=data_soup.findAll('div',{'class':'product-list-box card desktop-cart'})
table_rows=[]
for div in div_soup:
current_row=[]
product_name=div.findAll('h2',{})
product_price=div.findAll('span',{'class':'post-card__content-price-offer'})
for idx,data in enumerate(product_name):
current_row.append(data.text)
for idx,data in enumerate(product_price):
current_row.append(data.text)
table_rows.append(current_row)
I can't figure out the appropriate tag to use and also if I should pass something in the dictionary.
URL of the website I am scraping: https://www.nykaa.com/skin/moisturizers/face-moisturizer-day-cream/c/8394?root=nav_3
for idx,data in enumerate(product_name):
if data.get('title') is not None:
current_row.append(data['title'])
Should do what you want
Also might be best to refactor your code as
product_name=div.find('h2', {'title': True).get('title')
So you will just look for a h2 tag with title attribute and you can avoid the for loop

scraping part of website in python

I want to scrape part of this website http://warframe.wikia.com/wiki/ , expecially the description, command should reply as
"Equipped with a diverse selection of arrows, deft hands, shroud of stealth and an exalted bow, the crafty Ivara infiltrates hostile territory with deception and diversion, and eliminates threats with a shot from beyond. Ivara emerged in Update 18.0."
and nothing else, maybe can be useful set a sort of <p> I want to print.
For now I have this, but it doesn't reply what I want.
import requests
from bs4 import BeautifulSoup
req = requests.get('http://warframe.wikia.com/wiki/Ivara')
soup = BeautifulSoup(req.text, "lxml")
for sub_heading in soup.find_all('p'):
print(sub_heading.text)
You can use index of target paragraph and get required text as
print(soup.select('p')[4].text.strip())
or fetch by previous paragraph with text "Release Date:":
print(soup.findAll('b', text="Release Date:")[0].parent.next_sibling.text.strip())
Using the solution #Andersson provided (it won't work for all heroes as there is no Release date for everyone) and #SIM's comment, I'm giving you a generalized solution for any hero/champion (or whatever you call them in that game).
name = 'Ivara'
url = 'http://warframe.wikia.com/wiki/' + name
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
main_text = soup.find('div', class_='tabbertab')
print(main_text.find('b', text=name).parent.text.strip())
Output:
Equipped with a diverse selection of arrows, deft hands, shroud of stealth and an exalted bow, the crafty Ivara infiltrates hostile territory with deception and diversion, and eliminates threats with a shot from beyond. Ivara emerged in Update 18.0.
For other heroes, just change the name variable.
One more example with name = 'Volt', Output:
Volt has the power to wield and bend electricity. He is highly versatile, armed with powerful abilities that can damage enemies, provide cover and complement the ranged and melee combat of his cell. The electrical nature of his abilities make him highly effective against the Corpus, with their robots in particular. He is one of three starter options for new players.
Explanation:
If you inspect the page, you can see that there is only 1 <b>hero-name</b> inside the <div class="tabbertab" ... > tag. So you can use the <b>...</b> to find the text you need.

How to select a specific tag within this html?

How would I select all of the titles in this page
http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext
for example: I'm trying to get all the lines similar to this:
AFAS C1001 Introduction to African-American Studies. 3 points.
main_page is iterating through all of the school classes from here so I can grab all of the titles like above:
http://bulletin.columbia.edu/columbia-college/departments-instruction/
for page in main_page:
sub_abbrev = page.find("div", {"class": "courseblock"})
I have this code but I can't figure out exactly how to select all of the ('strong') tags of the first child.
Using latest python and beautiful soup 4 to web-scrape.
Lmk if there is anything else that is needed.
Thanks
Iterate over elements with courseblock class, then, for every course, get the element with courseblocktitle class. Working example using select() and select_one() methods:
import requests
from bs4 import BeautifulSoup
url = "http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for course in soup.select(".courseblock"):
title = course.select_one("p.courseblocktitle").get_text(strip=True)
print(title)
Prints:
AFAS C1001 Introduction to African-American Studies.3 points.
AFAS W3030 African-American Music.3 points.
AFAS C3930 (Section 3) Topics in the Black Experience: Concepts of Race and Racism.4 points.
AFAS C3936 Black Intellectuals Seminar.4 points.
AFAS W4031 Protest Music and Popular Culture.3 points.
AFAS W4032 Image and Identity in Contemporary Advertising.4 points.
AFAS W4035 Criminal Justice and the Carceral State in the 20th Century United States.4 points.
AFAS W4037 (Section 1) Third World Studies.4 points.
AFAS W4039 Afro-Latin America.4 points.
A good follow-up question from #double_j:
In the OPs example, he has a space between the points. How would you keep that? That's how the data shows on the site, even thought it's not really in the source code.
I though about using separator argument of the get_text() method, but that would also add an extra space before the last dot. Instead, I would join the strong element texts via str.join():
for course in soup.select(".courseblock"):
title = " ".join(strong.get_text() for strong in course.select("p.courseblocktitle > strong"))
print(title)

How can I extract all text between tags?

I would like to extract a random poem from this book.
Using BeautifulSoup, I have been able to find the title and prose.
print soup.find('div', class_="pre_poem").text
print soup.find('table', class_="poem").text
But I would like to find all the poems and pick one.
Should I use a regex and match all between
<h3> and </span></p> ?
Use an html document parser instead. It's safer in terms of the unintended consquences.
The reason why all programmers discourage parsing HTML with regex is that HTML mark-up of the page is not static especially if your souce HTML is a webpage. Regex is better suited for strings.
Use regex at your own risk.
Assuming you already have a suitable soup object to work with, the following might help you get started:
poem_ids = []
for section in soup.find_all('ol', class_="TOC"):
poem_ids.extend(li.find('a').get('href') for li in section.find_all('li'))
poem_ids = [id[1:] for id in poem_ids[:-1] if id]
poem_id = random.choice(poem_ids)
poem_start = soup.find('a', id=poem_id)
poem = poem_start.find_next()
poem_text = []
while True:
poem = poem.next_element
if poem.name == 'h3':
break
if poem.name == None:
poem_text.append(poem.string)
print '\n'.join(poem_text).replace('\n\n\n', '\n')
This first extracts a list of the poems from the table of contents at the top of the page. These contain unique IDs to each of the poems. Next a random ID is chosen and the matching poem is then extracted based on that ID.
For example, if the first poem was selected, you would see the following output:
"The Arrow and the Song," by Longfellow (1807-82), is placed first in
this volume out of respect to a little girl of six years who used to
love to recite it to me. She knew many poems, but this was her
favourite.
I shot an arrow into the air,
It fell to earth, I knew not where;
For, so swiftly it flew, the sight
Could not follow it in its flight.
I breathed a song into the air,
It fell to earth, I knew not where;
For who has sight so keen and strong
That it can follow the flight of song?
Long, long afterward, in an oak
I found the arrow, still unbroke;
And the song, from beginning to end,
I found again in the heart of a friend.
Henry W. Longfellow.
This is done by using BeautifulSoup to extract all of the text from each element until the next <h3> tag is found, and then removing any extra line breaks.

Categories