Webscraping with selenium, beautiful soup, python - trouble finding specific text

Webscraping with selenium, beautiful soup, python - trouble finding specific text - python

I am quite new to python and webscraping and I am trying to pull the following text ($1.74), and all the other relevant odds on the page from a website:
HMTL text that I am trying to pull
For similar situations previously I have been successful by using a for loop inside another for loop, but on those occasions I was searching by 'class'. I cannot search by class here as there are a lot of other 'td's that have the same class type, and not the odds that I want. Here I would like to (and I am not sure if it is possible) search via 'data-bettype'. The reason I am trying to search via that, and not 'data compid data-bettype', is that when I print out the full HTML in python, it looks like so:
HMTL printed to Python
The relevant part of my code here is:
soup_playup = BeautifulSoup(source_playup, 'lxml')
#print(soup_playup.prettify())
for odds_a in soup_playup.find_all('td',{'data-bettype','Awin'}):
for odds in odds_a.find_all('div'):
print(odds.text)
I am not receiving any errors when I run this code, but it seems as though it just will not find the text.

The correct format for looking up attributes is a dictionary of key-value pairs like so:
soup_playup.find_all('td',attrs={'data-bettype':'Awin'})

Related

Python Financial Chart Scraping

Right now I'm trying to scrape the dividend yield from a chart using the following code.
df = pd.read_html('https://www.macrotrends.net/stocks/charts/BMO/Bank-of-Montreal/dividend-yield-history')
df = df[0].dropna()
But the code wont pick up the chart's data.
Any suggestions on pulling it from the website?
Here is the specific link I'm trying to use: https://www.macrotrends.net/stocks/charts/BMO/Bank-of-Montreal/dividend-yield-history
I've used the code for picking up the book values but the objects they're using for the dividends and book values must be different.
Maybe I could use Beautiful Soup?

Sadly that website is rendered dynamically, so there's nothing in the html pandas is getting to scrape from. (The chart is loaded after the page). Scraping manually isn't going to help you here, because the data isn't there. (It's fetched after the page is loaded.)
You can either find an api which provides the data (best, quite possible given the content), work out where the page is fetching its data from and see if you can get it directly (better if possible), or use something like selenium to control a real browser, render the page, get the html, and then use that.

How to webscrape the correct element from a stat tracking website (cod.tracker.gg) using Python

On this specific page (or any 'matches' page) there are names you can select to view individual statistics for a match. How do I grab the 'kills' stat for example using webscraping?
In most of the tutorials I use the webscraping seems simple. However, when inspecting this site, specifically the 'kills' item, you see something like
<span data-v-71c3e2a1 title="Kills" class ="name".
Question 1.) What is the 'data-v-71c3e2a1'? I've never seen anything like this in my html,css, or webscraping tutorials. It appears in different variations all over the site.
Question 2.) More importantly, how do I grab the number of kills in this section? I've tried using scrapy and grabbing by xpath:
scrapy shell https://cod.tracker.gg/warzone/match/1424533688251708994?handle=PatrickPM
response.xpath("//*[#id="app"]/div[3]/div[2]/div/main/div[3]/div[2]/div[2]/div[6]/div[2]/div[3]/div[2]/div[1]/div/div[1]/span[2]").get()
but this raises a syntax error
response.xpath("//*[#id="app"]
SyntaxError: invalid syntax
Grabbing by response.css("").get() is also difficult. Should I be using selenium? Or just regular requests/bs4? Nothing I do can grab it.
Thank you.

Does this return the data you need?
import requests
endpoint = "https://api.tracker.gg/api/v1/warzone/matches/1424533688251708994"
r = requests.get(endpoint, params={"handle": "PatrickPM"})
data = r.json()["data"]
In any way I suggest using API if there's one available. It's much easier than using BeautifulSoup or selenium.

Extracting links from website with selenium bs4 and python

Okay so.
The heading might seem like this question has already been asked but I had no luck finding an answer for it.
I need help with making link extracting program with python.
Actually It works. It finds all <a> elements on a webpage. Takes their href="" and puts it in an array. Then it exports it in csv file. Which is what I want.
But I can't get a hold of one thing.
The website is dynamic so I am using the Selenium webdriver to get JavaScript results.
The code for the program is pretty simple. I open a website with webdriver and then get its content. Then I get all links with
results = driver.find_elements_by_tag_name('a')
Then I loop through results with for loop and get href with
result.get_attribute("href")
I store results in an array and then print them out.
But the problem is that I can't get the name of the links.
This leads to Google
Is there any way to get 'This leads to Google' string.
I need it for every link that is stored in an array.
Thank you for your time
UPDATE!!!!!
As it seems it only gets dynamic links. I just notice this. This is really strange now. For hard coded items, it returns an empty string. For a dynamic link, it returns its name.

Okay. So. The answer is that instad of using .text you shoud use get_attribute("textContent"). Works better than get_attribute("innerHTML")
Thanks KunduK for this answer. You saved my day :)

How do I retrieve information off of a web page using python and make the code spit it out

Basically I am trying to make a code that searches for a movie online, and spits out information. So far I have used selenium to interact with the web, but I get as far as locating the right element. I designated this by saying:
synopsis = driver.find_element_by_xpath('//*[#id="movieSynopsis"]')
print(synopsis)
Problem is, whenever I do that, it doesn’t print out the text, but I get bunch of stuff that I don't want. This is the output I receive:
<selenium.webdriver.remote.webelement.WebElement (session="3ba7827a8523622eb0e81cbe1d20ecbe", element="0.234680993959405-1")>
How do I make it so that it prints the information I want.
By the way, I am trying to make it print the synopsis of Deadpool on rotten tomatoes, URL is --> https://www.rottentomatoes.com/m/deadpool
Thanks in advance.

Check out BeautifulSoup it's a python library built specifically to help webscrapers parse html. I think it will make this a lot easier for you.

Python - How to scrape Tr/Td table data using 'requests & BeautifulSoup'

I'm new to programming. I'm trying out my first Web Crawler program that will help me with my job. I'm trying to build a program that will scrape tr/td table data from a web page, but am having difficulties succeeding. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for table_data in soup.find_all('td', {'class': 'sorting_1'}):
print(table_data)
start('http://www.datatables.net/')
My goal is to print out each line and then export it to an excel file.
Thank you,
-Cire

My recommendation is that if you are new to Python, play with things via the iPython notebook (interactive prompt) to get things working first and to get a feel for things before you try writing a script or a function. On the plus side all variables will stick around and it is much easier to see what is going on.
From the screen shot here, you can see immediately that the find_all function is not finding anything. An empty lists [] is being returned. By using ipython you can easily try other variants of a function on a previously defined variable. For example, the soup.find_all('td').

Looking at the source of http://www.datatables.net, I do not see any instances of the text sorting_1, so I wouldn't expect a search for all table cells of that class to return anything.
Perhaps that class appeared on a different URL associated with the DataTables website, in which case you would need to use that URL in your code. It's also possible that that class only appears after certain JavaScript has been run client-side (i.e. after certain actions with the sample tables, perhaps), and not on the initially loaded page.
I'd recommend starting with tags you know are on the initial page (seen by looking at the page source in your browser).
For example, currently, I can see a div with class="content". So the find_all code could be changed to the following:
for table_data in soup.find_all('div', {'class': 'content'}):
print(table_data)
And that should find something.
Response to comments from OP:
The precise reason why you're not finding that tag/class pairing in this case is that DataTables renders the table client-side via JavaScript, generally after the DOM has finished loading (although it depends on the page and where the DataTables init code is placed). That means the HTML associated with the base URL does not contain this content. You can see this if you curl the base URL and look at the output.
However when loading it in a browser, once the JavaScript for DataTables fires, the table is rendered and the DOM is dynamically modified to add the table, including cells with the class for which you're looking.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping with selenium, beautiful soup, python - trouble finding specific text - python

The correct format for looking up attributes is a dictionary of key-value pairs like so: soup_playup.find_all('td',attrs={'data-bettype':'Awin'})

Related

Python Financial Chart Scraping

How to webscrape the correct element from a stat tracking website (cod.tracker.gg) using Python

Extracting links from website with selenium bs4 and python

How do I retrieve information off of a web page using python and make the code spit it out

Python - How to scrape Tr/Td table data using 'requests & BeautifulSoup'

Categories

Resources