BeautifulSoup get href after find_all - python

Im scraping through a vendor link directory. Ive created a soup & isolated all the data I want using the find_all method. However the string I need is nested further within the soup. I understand that find_all returns a list but I need to further distill the list to get what I need. Thanks for the help because Im about to chuck my laptop across the room. Below is my current code.
Im new to the coding world with a decent understanding of Python but only a basic understanding of Beautiful Soup.
URL = get(https://www......) # importing the url I want to work over
soup = BeautifulSoup(URL.text, 'html.parser') # making the soup
IsoUrl = soup.find_all('a',class='xmd-listing-company-name') # Isolates the tags of the links I need.
This is more or less where I get stuck. From the above isolation I get a list composed of the following. Below is only one item of the list.
<a class="xmd-listing-company-name"href="/rated.company.html" itemprop='url><span itemprop='name'>Company</span></a>'
There are 10+ of the above strings in the list. I want to scrape out '/rated.company.html' from each string & append them to a list to iterate through.
Any guidance is greatly appreciate. If I need to clarify anything please let me know

you can simply loop on the results of find_all and extract the href like below :
results = [iso['href'] for iso in IsoUrl]
# >>> ["/rated.company.html", ...]

Related

Unable to locate elements using requests and BeautifulSoup

I am writing a script in Python using the modules 'requests' and 'BeautifulSoup' to scrape results from football matches found in the links from the following page:
https://www.premierleague.com/results?co=1&se=363&cl=-1
The task consists of two steps (taking the first match, Arsenal against Brighton, as an example):
Extract and navigate to the href "https://www.premierleague.com/match/59266" found in the element:
div data-template-rendered data-href.
Navigate or to the "Stats"-tab and extracting the information found in the element:
tbody class = "matchCentreStatsContainer".
I have already tried things like
page = requests.get("https://www.premierleague.com/match/59266")
soup = BeautifulSoup(page.text, "html.parser")
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
but I am not able to locate any of the elements in step 1) or 2) (empty list is returned).
Instead of this:
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
Use this
soup.findAll({"class" : "matchCentreStatsContainer"})
It will work.
In this case the problem is simply that you are looking for the wrong thing. There is no <div class="matchCentreStatsContainer"> on that page, that's a <tbody> so it doesn't match. If you want the div, do:
divs = soup.find_all("div", class_="statsSection")
Otherwise search for the tbodys:
soup.find_all("tbody", class_="matchCentreStatsContainer")
Incidentally the Right Way (TM) to match classes is with class_, which takes either a list or a string (for a single class). This was added to bs4 a while back, but the old syntax is still floating around a lot.
Do note your first url as posted here is invalid: it needs a http: or https: before it.
Update
Please note I would not parse this particularly file like this. It has likely everything you already want as json. I would just do:
import json
data = json.loads(soup.find("div", class_="mcTabsContainer")["data-fixture"])
print(json.dumps(data, indent=2))
Note that data is just a dictionary: I'm only using json.dumps at the end to pretty print it.

Python, extract text from webpage

I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.

Python Web Scraping with lxml

I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

Strip Html Tags Findall + Beautiful Soup

Well I have done probably 2 hours of searching and I believe my brain is probably just fried. Today is my first day with BeautifulSoup (so please be gentle). The source code for the website that I am scraping has a format that is as follows:
$100
I feel pretty dumb because I am getting the whole a tags when writing to a file and I have a sneaking suspicion that there is such a simple solution but I cannot seem to find it.
Currently I'm using the following:
soup = BeautifulSoup(page.content, 'html.parser')
prices = soup.find_all(class_="price")
passed.append(prices)
How can I target just the content with matching classes between specific tags?
prices = soup.find_all(class_="price")
for a in prices:
passed.append(int(a.text.strip().replace('$','')) # will append to the list
This should help.

Categories