BeautifulSoup Django Parse for Links

BeautifulSoup Django Parse for Links - python

I'm trying to get all the links a with class=fl I'm using mechanize to get the raw html output and then beautifulsoup to try to parse out the links.
The value of rawGatheredGoogleOutput is outputting output like (which is just a google result):
The red portion is to show you what I'm trying to grab, which is the a.fl

To find a elements with a class=fl attribute, you call find_all like this:
getAdditionalGooglePages = beautifulSoupObj.find_all('a', attrs={"class": "fl"})
For other attributes, it's simpler - for example, with id=fl it would be:
getAdditionalGooglePages = beautifulSoupObj.find_all('a', id="fl")
... but that doesn't work with class, because it's a Python reserved word.

Related

Retrieve all strings in a webpage in Python

I am trying to retrieve all strings from a webpage using BeautifulSoup and return a list of all the retrieved strings.
I have 2 approaches in mind:
Find all elements who have a text that is not null, append the text to result list and return it. I am having a hard time implementing this as I couldn't find any way to do it in BeautifulSoup.
Use BeautifulSoup's "find_all" method to find all attributes that I am looking for such as "p" for paragraphs, "a" for links etc. The problem I am facing with this approach is that for some reason, find_all is returning a duplicated output. For example, if a website has a link with a text "Get Hired", I am receiving "Get Hired" more than once in the output.
I am honestly not sure how to proceed from here and I have been stuck for several hours trying to figure out how to get all strings form a webpage.
Would really appreciate your help.

Use .stripped_strings to get all the strings with whitespaces stripped off.
.stripped_strings - Read the Docs.
Here is the code that returns a list of strings present inside the <body> tag.
import requests
from bs4 import BeautifulSoup
url = 'YOUR URL GOES HERE...'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
b = soup.find('body')
list_of_strings = [s for s in b.stripped_strings]
list_of_strings will have a list of all the strings present in the URL.

Post the code that you've used.
If I remember correctly, something like this should get the complete page in one variable "page" and all the text of the page would be available as page.text
import requests
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
print(page.text)

Xpath in python not getting data

I'm trying to request data from wikipedia in python using xpath.
I'm getting an empty list. What am I doing wrong.
import requests
from lxml import html
pageContent=requests.get(
'https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo'
)
tree = html.fromstring(pageContent.content)
name = tree.xpath('//*[#id="mw-content-text"]/div/table[1]/tbody/tr[2]/td[2]/a[1]/text()')
print name

This is a very common mistake when trying to get the xpath from the browser and the table tags, as the browser is the one that normally adds the tbody tag inside of them, which doesn't actually exist inside the response body.
So just remove it and it should be like:
'//*[#id="mw-content-text"]/div/table[1]//tr[2]/td[2]/a[1]/text()'

Getting text inside a tag, after another tag

I´m scraping some webs with selenium and bs4 and I´m in need of some elegant code to do the following. I have some text inside a tag.
<td><span class="hp">1</span>SJK Seinajoen</td
If I do this
find('td').get_text()
What I get is
1SJK Seinajoen
as it gets all text, including what is in span tag. My question is, is there any way to get the text inside a tags that is after span tag in a pythonic way?
I say pythonic because I could always do split with the resulting string, but that is not very elegant

This is from another post about this issue:
If you are using bs4 you can use strings:
" ".join(result.strings)

In lxml.html you can use below code to get required output:
from lxml import html
source = """<td><span class="hp">1</span>SJK Seinajoen</td>"""
html = html.fromstring(source) # pass web page HTML source code as "source" var
print(html.xpath("//a/text()")[0])
Output
"SJK Seinajoen"

why do python and BS4 return only one 'href' when called specifically, but all values when called as text?

Scraping a page and trying to get all the urls from the first column. When I call as text I get everything in the div, which I get. But, when I specifically target the URL, I only get the first one. How do I get all of them - separated for storage?
from bs4 import BeautifulSoup
from urllib import urlopen
base_url = "http://www.heavyliftpfi.com/news/"
html = urlopen(base_url)
soup = BeautifulSoup(html.read().decode('latin-1', 'ignore'),"lxml")
main_div = soup.select_one("div.fullWidth")
div_sub = main_div.select_one("div.leftcol")
print (div_sub).text # I get that this gets everything as .text
print (div_sub).h2.a['href'] # alternate - with only one 'href' return

since you are navigating the parse tree via tag names, if you have multiple matching attribute names, only the first one is returned. This is expected behavior. Try using find_all() to search for them instead.
from the BS4 docs:
"Using a tag name as an attribute will give you only the first tag by
that name."
"If you need to get all the tags, or anything more complicated
than the first tag with a certain name, you’ll need to use one of the
methods described in Searching the tree, such as find_all()"
see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names

It was the findAll, but I needed to move up the tree
for a in main_div.findAll('a', href=True):
print a['href']

Website scraping with python3 & beautifulsoup 4

I'm starting to make progress on a website scraper, but I've run into two snags. Here is the code first:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.nytimes.com")
soup=BeautifulSoup(r.text)
headlines=soup.find_all(class_="story-heading")
for headline in headlines:
print (headline)
Questions
Why do you a have to use find_all(class_= blahblahblah)
Instead of just find_all(blahblahblah)? I realize that the story-heading is a class of its own, but can't I just search all the HTML using find_all and get the same results? The notes for BeautifulSoup show find_all.a returning all the anchor tags in an HTML document, why won't find_all("story-heading") do the same?
Is it because if I try and do that, it will just find all the instances of "story-heading" within the HTML and return those? I am trying to get python to return everything in that tag. That's my best guess.
Why do I get all this extra junk code? Should my requests to find all just show me everything within the story-header tag? I'm getting a lot more text than what I am just trying to specify.

Beautiful Soup allows you use CSS Selectors. Look in the doc for "CSS selector"
You can find all elements with class "story-heading" like so:
soup.find_all(".story-heading")
If instead it's you're looking for id's just do
soup.find_all("#id-name")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup Django Parse for Links - python

Related

Retrieve all strings in a webpage in Python

Xpath in python not getting data

Getting text inside a tag, after another tag

why do python and BS4 return only one 'href' when called specifically, but all values when called as text?

Website scraping with python3 & beautifulsoup 4

Categories

Resources