How do I click an image like below using Python mechanize?
<img name="next" id="next" src="...">
I know the name and id of the image I want to click to. I need to somehow identify the parent link and click it. How can I?
Bonus Question: How can I check if there is such an image or not?
Rather than using mechanize, it's very simple to do with bs4 (beautifulsoup 4).
from bs4 import BeautifulSoup
import urllib2
text = urllib2.urlopen("http://yourwebpage.com/").read()
soup = BeautifulSoup(text)
img = soup.find_all('img',{'id':'next'})
if img:
a_tag = img[0].parent
href = a_tag.get('href')
print href
Retrieving the parent tag is very easy with bs4, as it happens with nothing less than .parent after finding the tag of course with the find_all function. As the find_all function returns an array, it's best to do if img: in the future, but as this may not apply to your website, it'll be safe to do. See below.
EDIT: I have changed the code to include the "Bonus question", which is what I described above as an alternative.
For your bonus question - I would say you can use BeautifulSoup to check to see whether or not the img element works. You can use urllib to see if the image exists (at least, whether or not the server will pass it to you - otherwise you'll get an error back).
You can also check out this thread that someone more intelligent than I answered - it seems to discuss a library called SpiderMonkey and the inability for mechanize to click a button.
Well, I don't know how to do it using Mechanize, however I know how to do in using lxml:
Lets assume that our webpage has this code:
<img name="bla bla" id="next" src="Cat.jpg">. Using lxml we would write this code:
from lxml import html
page = urlllib2.urlopen('http://example.com')
tree = html.fromstring(page.read())
link = tree.xpath('//img[#id="next"]/ancestor::a/attribute::href')
Most of the magic happens in the tree.xpath function, where you define the image you're looking for first with //img[#id="next"], then you specify that you're looking for the a tag right before it: /ancestor::a and that you're looking for specifically the href attribute: /attribute::href. The link variable will now contain a list of strings matching that query - in this case link[0] will be page2.html - which you can urlopen(), thus effectively clicking it.
For the //img[#id="next"] part, you can use other attribute, for example this: //img[#name="bla bla"] and it's going to work perfectly fine. You just need to think which attribute is better for this situation.
I know this answer doesn't use Mechanize, however I hope it's a helpful pointer. Good luck!
Related
I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.
I'm admittedly beginner to intermediate with Python and novice to BeautifulSoup/web-scraping. However, I have successfully built a couple of scrapers. Normal tags = no problem (e.g., div, a, li, etc)
However, can't find how to reference this tag with .select or .find or attrs="" or anything:
..........
<react type="sad" msgid="25314120" num="2"
..........
I ultimately want what looks like the "num" attribute from whatever this ghastly thing is ... a "react" tag (though I don't think that's a thing?)?
.find() works the same way as you'd find other tags such as div, p and a tags. Therefore, we search for the 'react' tag.
react_tag = soup.find('react')
Then, access the num attribute like so.
num_value = react_tag['num']
Should print out:
2
As per bs4 Documentation .find('tag') returns the single tag and .find_all('tag') returns list of tags in html.
In your case if there are multiple react tags use this
for reactTag in soup.find_all('react'):
print(reactTag.get('num'))
To get only first tag use this
print(soup.find('react').get('num'))
The user "s n" was spot on! These are dynamically created javascript which I didn't know anything about, but was pretty easy to figure out. Using the SeleniumLibrary in Python and a "headless" WebChromeDriver together, you can use Selenium selectors like Xpath and many others to find these tags.
I am trying to parse https://www.tandfonline.com/toc/icbi20/current for the titles of all articles. The HTML is divided into Volumes and Issues. Each Volume has an Issue that corresponds to a Month. So for Volume 36 there would be 12 Issues. In the current Volume (37) there are 4 Issues and I would like to parse through each Issue and get each Article's name.
To accomplish this and automate the search I need to fetch the href links for each Issue. Initially I chose the parent's div id: id = 'tocList'.
import requests
from bs4 import BeautifulSoup, SoupStrainer
chronobiology = requests.get("https://www.tandfonline.com/toc/icbi20/current")
chrono_coverpage = chronobiology.content
issues = SoupStrainer(id ='tocList')
issues_soup = BeautifulSoup(chrono_coverpage, 'html.parser', parse_only = issues)
for issue in issues_soup:
print(issue)
This returns a bs4 object BUT only with href links from the Volume div. What's worse is that this div should encompass both Volume div and Issue div.
So, I decided trying to reduce my search space and make it more specific and chose the div containing the Issue href links (class_='issues')
This time Jupiter will think for a bit but won't return ANYTHING. Just blank. Nothing. Zippo. BUT if I ask what type of "nothing" has been returned, jupiter informs it is a "String"??? I just don't know what to make of this.
So, firstly I had a question, why is it that the Issue div element does not respond to the parsing?
When I try running print(BeautifulSoup(chrono_coverpage, 'html.parser').prettify()) the same occurs, the Issue div does not appear (When Inspect Element on the html page it appears immediatly beneath the final Volume span):
So I suspect that it must be javascript oriented or something, not so much HTML oriented. Or maybe the class = 'open' has something to do with it.
Any clarifications would be kindly appreciated. Also, how would one parse through Javascripted links to get them?
Okay, so I've "resolved" the issue though I need to fill in some theoretical gaps:
Firstly this snippet holds the key to the beginning of solving the answer:
As can be seen, the <div class = 'container'> is immediatly followed by a ::before pseudo-element and the Links I am interested in are contained inside a div immediatly beneath this pseudo-element. This last div is then finished with the ::after pseudo-element.
Firstly I realized that my problem was that I needed to select a pseudo-element. I found this to be quite impossible with BeutifulSoup's soup.select() since apparently BeautifulSoup uses Soup Sieve which "aims to allow users to target XML/HTML elements with CSS selectors. It implements many pseudo-classes [...]."
The last part of the paragraph states:
"Soup Sieve also will not match anything for pseudo classes that are only relevant in a live, browser environment, but it will gracefully handle them if they've been implemented;"
So this got me thinking that I have no idea what "pseudo classes that are only relevant in a live browser environment" means. But then I said to myself, "but it also said that had they been implemented, BS4 should be able to parse them". And since I can definitely see the div elements containing my href links of interest using the Inspect tool, I though that I must be implemented.
The first part of that phrase got me thinking: "But do I need a live browser for this to work?"
So that brought me to Selenium's web driver:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]: []
Clearly this result made me sad because I thought I had understood what was going on. But then I though that if I 'clicked' one of the issues, from the previously opened browser that it would work (for some reason, to be honest I'm pretty sure desperation led me to that thought).
Well, surprise surprise. It worked: After clicking on the "Issue 4" and re running the script, I got what I was looking for:
UNANSWERED QUESTIONS?
1 - Apparently these pseudo-elements only "exist" when clicked upon, because otherwise the code doesn't recognize they are there. Why?
2 - What code must be run in order to make an initial click and activiate these pseudo-elements so the code can automatically open these links and parse the information I want? (title of articles)
UPDATE
Question 2 is answered using Selenium's ActionChain:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
action=ActionChains(driver)
action.move_to_element(driver.find_element_by_xpath('//*[#id="tocList"]/div/div/div[3]/div[2]/div')).perform()
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]:
[<div class="loi-issues-scroller">
<a class="open" href="/toc/icbi20/37/4?nav=tocList">Issue<span>4</span></a>
<a class="" href="/toc/icbi20/37/3?nav=tocList">Issue<span>3</span></a>
<a class="" href="/toc/icbi20/37/2?nav=tocList">Issue<span>2</span></a>
<a class="" href="/toc/icbi20/37/1?nav=tocList">Issue<span>1</span></a>
</div>]
The only downside is that one must stay on the page for Selenium's ActionChain.perform() can actually click the element, but at least I've automated this step.
If someone could answer question 1 that would be great
I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.
I have been breaking my head against this wall for a couple days now, so I thought I would ask the SO community. I want a python script that, among other things, can hit 'accept' buttons on forms on websites in order to download files. To that end, though, I need to get access to the form.
This is an example of the kind of file I want to download. I know that within it, there is an unnamed form with an action to accept the terms and download the file. I also know that the div that form can be found in is the main-content div.
However, whenever I BeautifulSoup parse the webpage, I cannot get the main-content div. The closest I've managed to get is the main_content link right before it, which does not provide me any information through BeautifulSoup's object.
Here's a bit of code from my script:
web_soup = soup(urllib2.urlopen(url))
parsed = list(urlparse(url))
ext = extr[1:]
for downloadable in web_soup.findAll("a"):
encode = unicodedata.normalize('NFKD',downloadable.text).encode('UTF-8','ignore')
if ext in str.lower(encode):
if downloadable['href'] in url:
return ("http://%s%s" % (parsed[1],downloadable['href']))
for div in web_soup.findAll("div"):
if div.has_key('class'):
print(div['class'])
if div['class'] == "main-content":
print("Yep")
return False
Url is the name of the url I am looking at (so the url I posted earlier). extr is the type of file I am hoping to download in the form .extension, but that is not really relevant to my question. The code that is relevant is the second for loop, the one where I am attempting to loop through the divs. The first bit of code(the first for loop) is code that goes through to grab download links in another case (when the url the script is given is a 'download link' marked by a file extension such as .zip with a content type of text/html), so feel free to ignore it. I added it in just for context.
I hope I provided enough detail, though I am sure I did not. Let me know if you need any more information on what I am doing and I will be happy to oblige. Thanks, Stack.
Here's the code for getting main-content div and form action:
import re
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://www.cms.gov/apps/ama/license.asp?file=/McrPartBDrugAvgSalesPrice/downloads/Apr-13-ASP-Pricing-file.zip"
web_soup = soup(urllib2.urlopen(url))
# get main-content div
main_div = web_soup.find(name="div", attrs={'class': 'main-content'})
print main_div
# get form action
form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')})
print form['action']
Though, if you need, I can provide examples for lxml, mechanize or selenium.
Hope that helps.