beautiful soup HREF not found when capitalized - python

Hi i have something along the lines of:
from BeautifulSoup import BeautifulSoup as bs
import urllib2
url = 'http://www.blah.com'
soup = bs(urllib2.urlopen(url))
for link in soup.findAll('a', href=True):
print link
So the problem is that the website uses both href and HREF (capitalized) for the links. This script only pulls the href. How would i modify the code also get the links with HREF?
Thanks

Related

How can I get href in this a tag?

I want to get the href (https://www.dcard.tw/forum/popular)
from https://www.dcard.tw/
this href is under <a href="/forum/popular">
my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.dcard.tw/"
soup = BeautifulSoup(requests.get(url).text,'lxml')
for link in soup.find_all('a'):
print(link.get('href'))
You can try the examples from this thread, they got some successful options.
How can I get href links from HTML using Python?
Regards,
Alex

Web scraping IMDB with Python's Beautiful Soup

I am trying to parse this page "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1", but I can't find the href that I need (href="/title/tt0068112/episodes?ref_=tt_eps_sm").
I tried with this code:
url="https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
page(requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
for a in soup.find_all('a'):
print(a['href'])
What's wrong with this? I also tried to check "manually" with print(soup.prettify()) but it seems that that link is hidden or something like that.
You can get the page html with requests, the href item is in there, no need for special apis. I tried this and it worked:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1")
soup = BeautifulSoup(page.content, "html.parser")
scooby_link = ""
for item in soup.findAll("a", href="/title/tt0068112/episodes?ref_=tt_eps_sm"):
print(item["href"])
scooby_link = "https://www.imdb.com" + "/title/tt0068112/episodes?ref_=tt_eps_sm"
print(scooby_link)
I'm assuming you also wanted to save the link to a variable for further scraping so I did that as well. 🙂
To get the link with Episodes you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/title/tt0068112/?ref_=fn_al_tt_1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one("a:-soup-contains(Episodes)")["href"])
Prints:
/title/tt0068112/episodes?ref_=tt_eps_sm

How to extract url/links that are contents of a webpage with BeautifulSoup

So the website I am using is : https://keithgalli.github.io/web-scraping/webpage.html and I want to extract all the social media links on the webpage.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
soup = bs(r.content)
links = soup.find_all('a', {'class':'socials'})
actual_links = [link['href'] for link in links]
I get an error, specifically:
KeyError: 'href'
For a different example and webpage, I was able to use the same code to extract the webpage link but for some reason this time it is not working and I don't know why.
I also tried to see what the problem was specifically and it appears that
links is a nested array where links[0] outputs the entire content of the ul tag that has class=socials so its not iterable so to speak since the first element contains all the links rather than having each social li tag be seperate elements inside links
Here is the solution using css selectors:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')
soup = bs(r.content, 'lxml')
links = soup.select('ul.socials li a')
actual_links = [link['href'] for link in links]
print(actual_links)
Output:
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/#keithgalli']
Why not try something like:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-
scraping/webpage.html')
soup = bs(r.content)
links = soup.find_all('a', {'class':'socials'})
actual_links = [link['href'] for link in links if 'href' in link.keys()]
After gaining some new information from you and visiting the webpage, I've realized that you did the following mistake:
The socials class is never used in any a-element and thus you won't find any such in your script. Instead you should look for the li-elements with the class "social".
Thus your code should look like:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://keithgalli.github.io/web-
scraping/webpage.html')
soup = bs(r.content, "lxml")
link_list_items = soup.find_all('li', {'class':'social'})
links = [item.find('a').get('href') for item in link_list_items]
print(links)

How do I selectively scrape hrefs from div tags?

I'm trying to scrape URLS from a news website. Specifically, they are the URLs of news articles listed in the search results for a specific search term.
I'm new to BeautifulSoup, and I don't know how to selectively scrape just the hrefs that take me to an article (when I try to scrape for children hrefs in div tags, I just get an empty set, and when I scrape a tags, I get way more URLs than I want.
Any thoughts?
Here's a link to the webpage:
https://www.thenational.ae/search?q=aramco
Here's the code I'm using.
import requests, random, re
from bs4 import BeautifulSoup as bs
url = "https://www.thenational.ae/search?q=aramco"
webpage = requests.get(url)
soup = bs(webpage.text, "html.parser")
for link in soup.find_all('h1'):
print(link.get('href'))
You need to understand the structure of the HTML. From the structure, you can see that the hrefs you need are childrens of same div with class small-article-desc. So basically you do it this way:
import requests, random, re
from bs4 import BeautifulSoup as bs
url = "https://www.thenational.ae/search?q=aramco"
webpage = requests.get(url)
soup = bs(webpage.text, "html.parser")
for div in soup.find_all('div', {"class": "small-article-desc"}):
a = div.find_all('a')
print(a[0].get('href'))

getting Video URL using Python Scripting

I am working with beautiful soup to extract the URL. I get all the attributes of the href but i want to get only specific URL.
Here is my code:
import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.youtube.com/results?search_query=cooking")
soup = BeautifulSoup(page.content ,'html.parser')
for a_tag in soup.findAll("a"):
if a_tag.has_attr("href"):
print(a_tag['href'])
enter image description here
but i want only this
watch?v=nTe_44ao82w
/watch?v=nTe_44ao82w
More Minimization to the first answer:
import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.youtube.com/results?search_query=cooking")
soup = BeautifulSoup(page.content ,'html.parser')
for a_tag in soup.findAll("a"):
if 'watch' in a_tag['href']:
print(a_tag['href'])
This will check if the href tag has string watch in it.
Hope this helps!
There doesn't really seem to be a good way to differentiate those a tags other than by the URL itself (they don't have any unique classes or anything) so I would probably just check if the URL contains "watch":
...
for a_tag in soup.findAll("a"):
if a_tag.has_attr("href") and "watch" in a_tag["href"]:
print(a_tag['href'])
Outputs
/watch?v=cbxe1ANrfDo
/watch?v=nTe_44ao82w
/watch?v=v1wIThmCams
/watch?v=FTociictyyE
/watch?v=dw2QHkAtB_Y
/watch?v=ej9UHVwlQqk
/watch?v=KGAj8IhnR3c
/watch?v=G8A73R_gZdM
/watch?v=XPQW_2YOmjY
/watch?v=J0pS2lhH0Vc
/watch?v=5aU5qrbCsF4
/watch?v=kvAJ_mc9NXs
/watch?v=kKiYVLIk_9s
/watch?v=G2jYIGdmC6I
/watch?v=jMW5ZDQviOA
/watch?v=iTmcGy9CWhE
/watch?v=66Ck_5SePZg
/watch?v=lyD9t3uhHio

Categories