Scraping certain text from a web page using beautiful soup

Scraping certain text from a web page using beautiful soup - python

I am downloading some data to help improve my Spanish. On this webpage I am able to download the table of conjugations, however I can't seem to get the English translation & the box beneath it.
At the top of the page their is a Spanish flag & to the right of it is a Union Jack flag, I'm trying to get that text which is "laugh; smile; giggle;..."
Beneath, there's a box, which has the following values I'm also trying to get,
Infinitivo reír Gerundo riendo Participio Pasado reído
The code I have used to get the other tables is below. I'm not sure how to find the other elements mentioned above?
URL = 'https://conjugator.reverso.net/conjugation-spanish-verb-reír.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='ch_divSimple')
verb_tbls = results.find_all('ul', class_='wrap-verbs-listing')

You might want to try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://conjugator.reverso.net/conjugation-spanish-verb-reír.html')
soup = BeautifulSoup(page.content, 'html.parser')
conjugations = soup.find_all('div', class_='blue-box-wrap')
for form in conjugations:
print(form.find("p").getText().upper() if form.find("p") else "N/A")
for row in form.find_all("li"):
print(row.getText())
print("-" * 80)
Output:
PRESENTE
yo río
tú ríes
él/ella/Ud. ríe
nosotros reímos
vosotros reís
ellos/ellas/Uds. ríen
--------------------------------------------------------------------------------
FUTURO
yo reiré
tú reirás
él/ella/Ud. reirá
nosotros reiremos
vosotros reiréis
ellos/ellas/Uds. reirán
--------------------------------------------------------------------------------
and so on...
As for the English words, these are generated dynamically and BeautifulSoup won't see them.

Related

How do I use BeautifulSoup to extract all the news article urls from this webpage - https://indianexpress.com/article/technology/

Noob trying to work on a news aggregation mini assignment using python.I use VSCode.
I'm at the step where i'm supposed to grab all the news article urls from this link - https://indianexpress.com/article/technology/
This is the basic code which is giving me every single link, but I only need the news article urls pls help
from bs4 import BeautifulSoup
page = requests.get('https://indianexpress.com/section/technology/')
bSoup = BeautifulSoup(page.content, 'html.parser')
links_list = bSoup.find_all('a')
for link in links_list:
if 'href' in link.attrs:
print(str(link.attrs['href']) + "\n")

You can filter the links so that only the ones containing article/ remain. (Python's list comprehension is my favorite syntax for filtering.)
A possible solution:
from bs4 import BeautifulSoup
page = requests.get('https://indianexpress.com/section/technology/')
bSoup = BeautifulSoup(page.content, 'html.parser')
links_list = bSoup.find_all('a')
#filter links for articles:
article_links_all = [
l.get('href') for l in links_list
if 'href' in l.attrs and 'article/' in l.attrs['href']
]
# get rid of duplicates:
article_links = list(set(article_links_all))
for link in article_links:
print(link)
Then the output will be:
https://indianexpress.com/article/technology/mobile-tabs/older-apple-ipads-to-get-stage-manager-support-with-ipados-16-1-beta-8177892/
https://indianexpress.com/article/sports/sport-others/in-gujarat-the-national-games-garba-and-the-start-of-election-fever-8179176/
https://indianexpress.com/article/lifestyle/health/covid-may-increase-heart-ailments-if-it-becomes-endemic-expert-8179186/
https://indianexpress.com/article/lifestyle/health/patient-administer-self-cpr-heart-attack-cardiac-arrest-cardiopulmonary-resuscitation-8178675/
https://indianexpress.com/article/technology/tech-news-technology/google-search-new-features-multisearch-food-shopping-google-maps-8179203/
https://indianexpress.com/article/technology/tech-news-technology/erbium-is-a-malware-that-steals-credit-card-details-passwords-and-hacks-crypto-wallets-8177544/
https://indianexpress.com/article/opinion/columns/what-the-rajasthan-political-crisis-highlights-congress-effectively-has-no-high-command-8176760/
https://indianexpress.com/article/cities/thiruvananthapuram/home-ministry-lists-murders-and-attacks-by-pfi-cadres-in-ban-notification-8178553/
https://indianexpress.com/article/technology/tech-news-technology/tech-indepth-understanding-nvidia-dlss-and-dlss-3-8178684/
https://indianexpress.com/article/horoscope/horoscope-today-26-september-2022-check-astrological-prediction-for-scorpio-sagittarius-cancer-aries-and-other-signs-8162319/
https://indianexpress.com/article/what-is/what-is-the-protection-of-children-from-sexual-offences-act-2012/
https://indianexpress.com/article/technology/tech-news-technology/apple-ditches-iphone-production-increase-after-demand-falters-8177169/
https://indianexpress.com/article/technology/science/ancient-armoured-bristly-worm-wufengella-phylum-evolution-lophophore-8178150/
https://indianexpress.com/article/technology/tech-news-technology/intel-and-samsung-display-showcase-slidable-pc-during-intels-innovation-keynote-8177212/
https://indianexpress.com/article/technology/tech-reviews/realme-9i-5g-review-8159696/
https://indianexpress.com/article/sports/cricket/watch-suresh-raina-takes-a-stunner-against-australia-legends-in-road-safety-world-series-semi-8179194/
https://indianexpress.com/article/technology/techook/amazon-flipkart-sales-best-budget-tws-earbuds-to-buy-under-rs-3000-8177816/
https://indianexpress.com/article/entertainment/bollywood/ranbir-kapoor-trade-decodes-the-stars-box-office-journey-the-good-the-bad-the-blockbuster-8176244/
https://indianexpress.com/article/opinion/editorials/after-the-ban-the-political-challenge-posed-by-pfi-still-needs-tackling-8179138/
https://indianexpress.com/article/india/code-of-ethics-for-digital-news-websites-6758543/
https://indianexpress.com/article/sports/cricket/chahar-arshdeep-send-back-half-sa-side-in-15-balls-as-india-secure-victory-in-first-t20i-8179221/
https://indianexpress.com/article/technology/tech-news-technology/amazons-big-september-2022-echo-event-tonight-heres-what-to-expect-8177264/
https://indianexpress.com/article/technology/mobile-tabs/nothing-phone-1-gets-another-camera-centric-update-8178907/
https://indianexpress.com/article/technology/techook/nova-to-niagara-the-top-android-launchers-to-try-out-on-your-smartphone-8175729/
https://indianexpress.com/article/technology/science/italian-space-agency-releases-liciacube-images-of-dart-crash-8177728/
https://indianexpress.com/article/explained/pfi-terrorist-organisation-mha-ban-explained-8177409/
https://indianexpress.com/article/cities/delhi/delhi-ncr-news-live-updates-rains-aap-arvind-kejriwal-bjp-8173016/
https://indianexpress.com/article/technology/tech-reviews/oppo-enco-buds-2-review-8153028/
https://indianexpress.com/article/technology/tech-news-technology/intel-announces-13th-gen-intel-core-processor-heres-everything-you-need-to-know-8177290/
https://indianexpress.com/article/sports/cricket/ind-vs-sa-1st-t20-surya-woke-up-after-he-got-hit-8179164/
https://indianexpress.com/article/technology/tech-reviews/vivo-v25-pro-review-8177339/
https://indianexpress.com/article/trending/trending-in-india/rhinoceros-walks-on-nepal-road-unmindful-of-people-touching-it-8178819/
https://indianexpress.com/article/technology/gaming/netflix-launches-game-handles-for-its-games-all-you-need-to-know-8178368/
https://indianexpress.com/article/india/senior-advocate-r-venkataramani-is-new-attorney-general-of-india-8179072/
https://indianexpress.com/article/cities/delhi/delhi-yamuna-flowing-warning-level-heavy-rain-basin-states-8173217/
https://indianexpress.com/article/technology/tech-news-technology/amazons-september-2022-product-launch-event-announcements-8179069/
https://indianexpress.com/article/technology/tech-reviews/samsung-galaxy-watch-5-pro-review-8151008/
https://indianexpress.com/article/entertainment/movie-review/blonde-movie-review-ana-de-armas-imarilyn-monroe-biopic-netflix-8175733/
https://indianexpress.com/article/business/economy/rupee-value-us-dollar-falls-low-news-8173192/
https://indianexpress.com/article/technology/gadgets/sony-india-festive-sale-deals-discounts-on-audio-headphones-tws-speakers-tvs-8178107/
https://indianexpress.com/article/trending/trending-in-india/women-perform-garba-inside-mumbai-local-train-8178884/
https://indianexpress.com/article/india/jaishankar-sullivan-discuss-bilateral-ties-indo-pacific-8179191/
https://indianexpress.com/article/technology/tech-news-technology/dating-apps-thrive-in-china-but-not-just-for-romance-8177218/
https://indianexpress.com/article/technology/science/nasa-spacex-crew-5-launch-dragon-endurance-falcon-9-8178453/
https://indianexpress.com/article/cities/ahmedabad/two-firs-lodged-over-clashes-between-nsui-abvp-members-8179172/
https://indianexpress.com/article/technology/science/how-about-them-apples-research-orchards-chart-a-fruits-future-8178701/
https://indianexpress.com/article/cities/baroda/borsad-municipality-president-loses-no-confidence-motion-as-bjp-members-defy-party-whip-8179197/
https://indianexpress.com/article/technology/tech-reviews/viewsonic-xg2431-gaming-monitor-review-8140688/

bs4 remove None tag after decompose

I want to delete advertisment text from scraped data but after i decompose it i get error saying
list index out of range
I think its becouse after decompose is blank space or somthing. Without decompose loop works ok.
import requests
from bs4 import BeautifulSoup
url = 'https://www.marketbeat.com/insider-trades/ceo-share-buys-and-sales/'
companyName = 'title-area'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all('table')[0].tbody.find_all('tr')
# delete advertisment
soup.find("tr", class_="bottom-sort").decompose()
for el in table:
print(el.find_all('td')[0].text)

You can use tag.extract() to delete the tag. Also, delete the tag before you find all <tr> tags:
import requests
from bs4 import BeautifulSoup
url = "https://www.marketbeat.com/insider-trades/ceo-share-buys-and-sales/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# delete advertisment
for tr in soup.select("tr.bottom-sort"):
tr.extract()
table = soup.find_all("table")[0].tbody.find_all("tr")
for el in table:
print(el.find_all("td")[0].text)
Prints:
...
TZOOTravelzoo
NEOGNeogen Co.
RKTRocket Companies, Inc.
FINWFinWise Bancorp
WMPNWilliam Penn Bancorporation

There is nothing wrong using decompose() you only have to pay attention to the order in your process:
# first delete advertisment
soup.find("tr", class_="bottom-sort").decompose()
# then select the table rows
table = soup.find_all('table')[0].tbody.find_all('tr')

Is there a way to parse data from multiple pages from a parent webpage?

So I have been going to a website to get NDC codes https://ndclist.com/?s=Solifenacin and I need to get 10 digit NDC codes, but on the current webpage there is only 8 digit NDC codes shown like this picture below
So I click on the underlined NDC code. And get this webpage.
So I copy and paste these 2 NDC codes to an excel sheet, and repeat the process for the rest of the codes on the first webpage I've shown. But this process takes a good bit of time, and was wondering if there was a library in Python that could copy and paste the 10 digit NDC codes for me or store them in a list and then I could print the list once I'm finished with all the 8 digit NDC codes on the first page. Would BeautifulSoup work or is there a better library to achieve this process?
EDIT <<<<
I actually need to go another level deep and I've been trying to figure it out, but I've been failing, apparently the last level of webpage is this dumb html table, and I only need one element of the table. Here is the last webpage after you click on the 2nd level codes.
Here is the code that I have, but it's returning a tr and None object once I run it.
url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for b in soup2.select('#product-packages a'):
link_url2 = b['href']
print('Processing link {}... '.format(link_url2))
soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
for link in soup3.findAll('tr', limit=7)[1]:
print(link.name)
all_data.append(link.name)
print('Trospium')
print(all_data)

Yes, BeautifulSoup is ideal in this case. This script will print all 10 digits codes from the page:
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for link in soup2.select('#product-packages a'):
print(link.text)
all_data.append(link.text)
# In all_data you have all codes, uncoment to print them:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09
... and so on.
EDIT: (Version where I get the description too):
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
code = code.get_text(strip=True).split(maxsplit=1)[-1]
desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
print(code, desc)
all_data.append((code, desc))
# in all_data you have all codes:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE
...and so on.

What is the proper syntax for .find() in bs4?

I am trying to scrape the bitcoin price off of coinbase and cannot find the proper syntax. When I run the program (without the line with question marks) I get the block of html that I need, but I don't know how to narrow down and retrieve the price itself. Any help appreciated, thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
data = requests.get(url)
nicedata = data.text
soup = BeautifulSoup(nicedata, 'html.parser')
prettysoup = soup.prettify()
bitcoin = soup.find('h4', {'class':
'Header__StyledHeader-sc-1q6y56a-0 hZxUBM
TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
price = bitcoin.find('???')
print(price)
The attached image contains the html

To get text from item:
price = bitcoin.text
But this page has many items <h4> with this class but find() gets only first one and it has text Bitcoin, not price from your image. You may need find_all() to get list with all items and then you can use index [index] or slicing [start:end] to get some items, or you can use for-loop to work with every item on list.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_h4 = soup.find_all('h4', {'class': 'Header__StyledHeader-sc-1q6y56a-0 hZxUBM TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
for h4 in all_h4:
print(h4.text)
It can be easier to work with data if you keep it in list of list or array or DataFrame. But to create list of lists it would be easier to find rows <tr> and inside every row search <h4>
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
all_tr = soup.find_all('tr')
data = []
for tr in all_tr:
row = []
for h4 in tr.find_all('h4'):
row.append(h4.text)
if row: # skip empty row
data.append(row)
for row in data:
print(row)
It doesn't need class to get all h4.
BTW: This page uses JavaScript to append new rows when you scroll page but requests and BeautifulSoup can't run JavaScript - so if you will need all rows then you may need Selenium to control web browser which runs JavaScript

Extract and Format Site Data Python

This is for Python 3.5.x
What I'm looking for is to find the header, after a peice of the HTML code being
<h3 class = "title-link__title"><span class="title=link__text">News Here</span>
with urllib.request.urlopen('http://www.bbc.co.uk/news') as r:
HTML = r.read()
HTML = list(HTML)
for i in range(len(HTML)):
HTML[i] = chr(HTML[i])
How can I get it so I extract just the header as that's all I need. I'll try and help for detail in anyway i can.

Fetching information from webpages is called web scraping.
One of the best tools to do this job is the BeautifulSoup library.
from bs4 import BeautifulSoup
import urllib
#opening page
r = urllib.urlopen('http://www.bbc.co.uk/news').read()
#creating soup
soup = BeautifulSoup(r)
#useful for understanding the layout of your page info
#print soup.prettify()
#creating a ResultSet with all h3 tags that contains a class named 'title-link__title'
a = soup.findAll("h3", {"class":"title-link__title"})
#counting ocurrences
len(a)
#result = 44
#get text of first header
a[0].text
#result = u'\nMay v Leadsom to be next UK PM\n'
#get text of second header
a[1].text
#result = u'\nVideo shows US police shooting aftermath\n'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping certain text from a web page using beautiful soup - python

Related

How do I use BeautifulSoup to extract all the news article urls from this webpage - https://indianexpress.com/article/technology/

bs4 remove None tag after decompose

Is there a way to parse data from multiple pages from a parent webpage?

What is the proper syntax for .find() in bs4?

Extract and Format Site Data Python

Categories

Resources