I have a file which I am using to parse articles in the reference section of wikipedia. I currently have it set up in such a way that it returns the URLs of any item in the reference section.
I'm trying to get it to export a single line containing both the link (which it does currently) and the text of the link in either a single line:
https://this.is.the.url "And this is the article header"
or over consecutive lines:
https://this.is.the.url
"And this is the article header"
Link Sample
<a
rel="nofollow"
class="external text"
href="https://www.mmajunkie.usatoday.com/2020/08/gerald-meerschaert-tests-positive-covid-19-ed-herman-fight-off-ufc-on-espn-plus-31/amp">
"Gerald Meerschaert tests positive for COVID-19; Ed Herman fight off UFC on ESPN+ 31"
</a>
Scraper
import requests
import sys
from bs4 import BeautifulSoup
session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"
if "wikipedia" in selectWikiPage:
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
references = bsObj.find('ol', {'class': 'references'})
href = BeautifulSoup(str(references), "html.parser")
links = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
title = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
for link in links:
print(link)
else:
print("Error: Please enter a valid Wikipedia URL")
Fixed it:
import requests
import sys
from bs4 import BeautifulSoup
session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"
if "wikipedia" in selectWikiPage:
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
references = bsObj.find('ol', {'class': 'references'})
href = BeautifulSoup(str(references), "html.parser")
for a in href.find_all("a", class_="external text", href=True):
listitem = [a["href"],a.getText()]
print(listitem)
else:
print("Error: Please enter a valid Wikipedia URL")
Instead of only getting the href attribute of the anchor tag you can also get the text of the link.
This can be done simply by
links = [(a["href"], a.text)
for a in href.find_all("a", class_="external text", href=True)]
for link, title in links:
print(link, title)
Now each links element will be a tuple with the link and the title.
You can now display it however you want.
Also the a.text can be written in like a.getText() or a.get_text() so choose what suits your code style.
Related
I am looking to download the "Latest File" from provided url below
https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product
The file i want to download is at the following exact location
https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product/sep-2022#data-downloads
for example file name is "Table 1"
how can i download this when i am only given the base URL as above? using beautifulSoup
I am unable to figure out how to work through nested urls within the html page to find the one i need to download.
First u need to get latest link:
latest_link = 'https://www.abs.gov.au/' + soup.find('span', class_='flag_latest').find_previous('a').get('href')
Then find document to download, in my example - download all, but u can change it:
download_all_link = 'https://www.abs.gov.au/' + soup.find('div', class_='anchor-button-wrapper').find('a').get('href')
And last point - download it.
FULL CODE:
import requests
from bs4 import BeautifulSoup
url = 'https://www.abs.gov.au/statistics/economy/national-accounts/australian-national-accounts-national-income-expenditure-and-product'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
latest_link = 'https://www.abs.gov.au/' + soup.find('span', class_='flag_latest').find_previous('a').get('href')
response = requests.get(latest_link)
soup = BeautifulSoup(response.text, 'lxml')
download_all_link = 'https://www.abs.gov.au/' + soup.find('div', class_='anchor-button-wrapper').find('a').get('href')
file_data = requests.get(download_all_link).content
with open(download_all_link.split("/")[-1], 'wb') as handler:
handler.write(file_data)
I've never used BeautifulSoup before. Pretty cool stuff. This seems to do it or me:
from bs4 import BeautifulSoup
with open("demo.html") as fp:
soup = BeautifulSoup(fp, "html.parser")
# lets look for the span with the 'flag_latest' class attribute
for span in soup.find_all('span'):
if span.get('class', None) and 'flag_latest' in span['class']:
# step up the a level to the div and grab the a tag
print(span.parent.a['href'])
So we just look for the span with the 'flag_latest' class and then step up a level in the tree (a div) and then grab the first a tag and extract the href.
Check out the docs and read the sections on "Navigating the Tree" and "Searching the Tree"
I want to search all hyperlink that its text name includes "article" in https://www.geeksforgeeks.org/
for example, on the bottom of this webpage
Write an Article
Improve an Article
I want to get them all hyperlink and print them, so I tried to,
from urllib.request import urlopen
from bs4 import BeautifulSoup
import os
import re
url = 'https://www.geeksforgeeks.org/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
links = []
for link in soup.findAll('a',href = True):
#print(link.get("href")
if re.search('/article$', href):
links.append(link.get("href"))
However, it get a [] in result, how to solve it?
Here is something you can try:
Note that there are more links with the test article in the link you provided, but it gives the idea how you can deal with this.
In this case I just checked if the word article is in the text of that tag. You can use regex search there, but for this example it is an overkill.
import requests
from bs4 import BeautifulSoup
url = 'https://www.geeksforgeeks.org/'
res = requests.get(url)
if res.status_code != 200:
'no resquest'
soup = BeautifulSoup(res.content, "html.parser")
links_with_article = soup.findAll(lambda tag:tag.name=="a" and "article" in tag.text.lower())
EDIT:
If you know that there is a word in the href, i.e. in the link itself:
soup.select("a[href*=article]")
this will search for the word article in the href of all elements a.
Edit: get only href:
hrefs = [link.get('href') for link in links_with_article]
I'm web scraping the Monster job site with the search aimed at "Software Developer" and my aim is to simply print out only the jobs that have "python" listed in their description in the Python terminal, while discarding all the other jobs for Java, HTML, CSS etc. However when I run this code I end up printing all the jobs on the page.
To solve this I created a variable (called 'search') that searches for all jobs with 'python' and converts it to lowercase. Also I created a variable (called 'python_jobs') that includes all the job listings on the page.
Then I made a "for" loop that looks for every instance where 'search' is found in 'python_jobs'. However this gives the same result as before and prints out every job listing on the page anyways. Any suggestions?
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.com/jobs/search/?q=Software-Developer"
page = requests.get(URL)
print(page)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
search = results.find_all("h2", string=lambda text: "python" in text.lower())
python_jobs = results.find_all("section", class_="card-content")
print(len(search))
for search in python_jobs:
title = search.find("h2", class_="title")
company = search.find("div", class_="company")
if None in (title, company):
continue
print(title.text.strip())
print(company.text.strip())
print()
Your problem is you have two separated list search and python_jobs which are not related. And later you don't even use list search. You should rather get every item from python_jobs and search python inside this item.
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.com/jobs/search/?q=Software-Developer"
page = requests.get(URL)
print(page)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
all_jobs = results.find_all("section", class_="card-content")
for job in all_jobs:
python = job.find("h2", string=lambda text: "python" in text.lower())
if python:
title = job.find("h2", class_="title")
company = job.find("div", class_="company")
print(title.text.strip())
print(company.text.strip())
print()
or
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.com/jobs/search/?q=Software-Developer"
page = requests.get(URL)
print(page)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
all_jobs = results.find_all("section", class_="card-content")
for job in all_jobs:
title = job.find("h2")
if title:
title = title.text.strip()
if 'python' in title.lower():
company = job.find("div", class_="company").text.strip()
print(title)
print(company)
print()
I'm trying to loop over a href and get the URL. I've managed to extrat the href but i need the full url to get into this link. This is my code at the minute
import requests
from bs4 import BeautifulSoup
webpage_response = requests.get('http://www.harness.org.au/racing/results/?activeTab=tab')
webpage_response.content
webpage_response = requests.get
soup = BeautifulSoup(webpage, "html.parser")
#only finding one track
#soup.table to find all links for days racing
harness_table = soup.table
#scraps a href that is an incomplete URL that im trying to get to
for link in soup.select(".meetingText > a"):
link.insert(0, "http://www.harness.org.au")
webpage = requests.get(link)
new_soup = BeautifulSoup(webpage.content, "html.parser")
#work through table to get links to tracks
print(new_soup)'''
You can store the base url of website in a variable and then once you get the href from link you can join them both to create the next url.
import requests
from bs4 import BeautifulSoup
base_url = "http://www.harness.org.au"
webpage_response = requests.get('http://www.harness.org.au/racing/results/?activeTab=tab')
soup = BeautifulSoup(webpage_response.content, "html.parser")
# only finding one track
# soup.table to find all links for days racing
harness_table = soup.table
# scraps a href that is an incomplete URL that im trying to get to
for link in soup.select(".meetingText > a"):
webpage = requests.get(base_url + link["href"])
new_soup = BeautifulSoup(webpage.content, "html.parser")
# work through table to get links to tracks
print(new_soup)
Try this solution. Maybe you'll like this library.
from simplified_scrapy import SimplifiedDoc,req
url = 'http://www.harness.org.au/racing/results/?activeTab=tab'
html = req.get(url)
doc = SimplifiedDoc(html)
links = [doc.absoluteUrl(url,ele.a['href']) for ele in doc.selects('td.meetingText')]
print(links)
Result:
['http://www.harness.org.au/racing/fields/race-fields/?mc=BA040320', 'http://www.harness.org.au/racing/fields/race-fields/?mc=BH040320', 'http://www.harness.org.au/racing/fields/race-fields/?mc=RE040320']
I'm trying to extract the title of a link using BeautifulSoup. The code that I'm working with is as follows:
url = "http://www.example.com"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'}):
title = link.get('title')
print title
Now, an example link element contains the following:
<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>
However, nothing gets displayed after I run the above code. How can I extract the value stored inside the title attribute of the anchor tag stored in link?
Well, it seems you have put two spaces between s-access-detail-page and a-text-normal, which in turn, is not able to find any matching link. Try with correct number of spaces, then printing number of links found. Also, you can print the tag itself - print link
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.in/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=python"
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "lxml")
links = soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'})
print len(links)
for link in links:
title = link.get('title')
print title
You are searching for an exact string here, by using multiple classes. In that case the class string has to match exactly, with single spaces.
See the Searching by CSS class section in the documentation:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
You'd have a better time searching for individual classes:
soup.find_all('a', class_='a-link-normal')
If you must match more than one class, use a CSS selector:
soup.select('a.a-link-normal.s-access-detail-page.a-text-normal')
and it won't matter in what order you list the classes.
Demo:
>>> from bs4 import BeautifulSoup
>>> plain_text = u'<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>'
>>> soup = BeautifulSoup(plain_text)
>>> for link in soup.find_all('a', class_='a-link-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python
>>> for link in soup.select('a.a-link-normal.s-access-detail-page.a-text-normal'):
... print link.text
...
Introduction To Computation And Programming Using Python