Reading information from a page without Web Locators , Selenium , Python - python

xpath_id = '/html/body'
conf_code = driver.find_element(By.XPATH, (xpath_id))
code_list = []
for c in range(len(conf_code)):
code_list.append(conf_code[c].text)
as seen above i chose the xpath locator, but i can't locate the text, that is because this particular webpage is completly blank as only as text in the «body»
the html of the page is bellow:
«html» , «head», «body» 'text that i want to read and save' «body», «/html»
how to read this text and then store it in a variable

Your question is not clear enough.
Anyway, in case there are multiple elements containing texts on that page you can use something like this:
xpath_id = '/html/body/*'
conf_code = driver.find_elements(By.XPATH, (xpath_id))
code_list = []
for c in conf_code:
code_list.append(c.text)
Don't forget to add some delay to make the page completely loaded before you getting all these elements from there

If you're really just grabbing a website that is so simple, you don't need selenium. Grab the website with requests and split the result on the body tags to get the text. Much simpler code and avoids the overhead of the selenium driver.
import requests
url = "http://your-url-here.com"
content = requests.get(url).text
the_string_youre_looking_for = content.split('<body>')[1].split('</body>')[0]
Is this what you're looking for? If not, maybe try and reword your question, because it's a bit hard to understand what you want your code to do and in what context.

Resolved using
print(driver.page_source)
I got full HTML content, and due to its simplicity it was easy to extract to required content withing the <body> TAG

Related

How to access text element in selenium if it is splitted by body tags

I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?

Crawl a webpage which is generated by Javascript

I want to crawl the data from this website
I only need the text "Pictograph - A spoon 勺 with something 一 in it"
I checked Network -> Doc and I think the information is hidden here.
Because I found there's a line is
i.length > 0 && (r += '<span>» Formation: <\/span>' + i + _Eb)
And I think this page generates part of the page that we can see from the link.
However, I don't know what is the code? It has html, but it also contains so many function().
Update
If the code is Javascript, I would like to know how can I crawl the website not using Selenium?
Thanks!
This page use JavaScript to add this element. Using Selenium I can get HTML after adding this element and then I can search text in HTML. This HTML has strange construction - all text is in tag so this part has no special tag to find it. But it is last text in this tag and it starts with "Formation:" so I use BeautifulSoup to ge all text with all subtags using get_text() and then I can use split('Formation:') to get text after this element.
import selenium.webdriver
from bs4 import BeautifulSoup as BS
driver = selenium.webdriver.Firefox()
driver.get('https://www.archchinese.com/chinese_english_dictionary.html?find=%E4%B8%8E')
soup = BS(driver.page_source)
text = soup.find('div', {'id': "charDef"}).get_text()
text = text.split('Formation:')[-1]
print(text.strip())
Maybe Selenium works slower but it was faster to create solution.
If I could find url used by JavaScript to load data then I would use it without Selenium but I didn't see these information in XHR responses. There was few responses compressed (probably gzip) or encoded and maybe there was this text but I didn't try to uncompress/decode it.

List links of xls files using Beautifulsoup

I'm trying to retrieve a list of downloadable xls files on a website.
I'm a bit reluctant to provide full links to the website in question.
Hopefully I'm able to provide all necessary details all the same.
If this is useless, please let me know.
Download .xls files from a webpage using Python and BeautifulSoup is a very similar question, but the details below will show that the solution most likely will have to be different since the links on that particular site are tagged with a href anchor:
And the ones I'm trying to get are not tagged the same way.
On the webpage, the files that are available for downloading are listed like this:
A simple mousehover gives these further details:
I'm following the setup here with a few changes to produce the snippet below that provides a list of some links, but not to any of the xls files:
from bs4 import BeautifulSoup
import urllib
import re
def getLinks(url):
with urllib.request.urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, "lxml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
links1 = getLinks("https://SOMEWEBSITE")
A further inspection using ctrl+shift+I in Google Chrome reveals that those particular links do not have a href anchor tag, but rather a ng-href anchor tag:
So I tried changing that in the snippet above, but with no success.
And I've tried different combinations with e.compile("^https://"), attrs={'ng-href' and links.append(link.get('ng-href')), but still with no success.
So I'm hoping someone has a better suggestion!
EDIT - Further details
It seems it's a bit problematic to read these links directly.
When I use ctrl+shift+I and the Select an element in the page to inspect it Ctrl+Shift+C, this is what I can see when I hover over one of the links listed above:
And what I'm looking to extract here is the information associated with the ng-href tag. But If I right-click the page and select Show Source, the same tag only appears once along with som metadata(?):
And I guess this is why my rather basic approach is failing in the first place.
I'm hoping this makes sense to some of you.
Update:
using selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome()
driver.get('http://.....')
# wait max 15 second until the links appear
xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#ng-href, ".xls")]'))
# Or
# xls_links = WebDriverWait(driver, 15).until(lambda d: d.find_elements_by_xpath('//a[contains(#href, ".xls")]'))
links = []
for link in xls_links:
url = "https://SOMEWEBSITE" + link.get_attribute('ng-href')
print(url)
links.append(url)
Assume ng-href is not dynamically generated, from your last image I see that the URL is not starts with https:// but the slash / you can try with regex URL contains .xls
for link in soup.findAll('a', attrs={'ng-href': re.compile(r"\.xls")}):
xls_link = "https://SOMEWEBSITE" + link['ng-href']
print(xls_link)
links.append(xls_link)
My guess is that the data you are trying to crawl is created dynamically: ng-href is one of AngularJs's constructs. You could try using Google Chrome's Network inspection as you already did (ctrl+shift+I) and see if you can find the url that is queried (open the network tab and reload the page). The query should typically return a JSON with the links to the xls-files.
There is a thread about a similar problem here. Perhaps that helps you: Unable to crawl some href in a webpage using python and beautifulsoup

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

How to find links with all uppercase text using Python (without a 3rd party parser)?

I am using BeautifulSoup in a simple function to extract links that have all uppercase text:
def findAllCapsUrls(page_contents):
""" given HTML, returns a list of URLs that have ALL CAPS text
"""
soup = BeautifulSoup.BeautifulSoup(page_contents)
all_urls = node_with_links.findAll(name='a')
# if the text for the link is ALL CAPS then add the link to good_urls
good_urls = []
for url in all_urls:
text = url.find(text=True)
if text.upper() == text:
good_urls.append(url['href'])
return good_urls
Works well most of the time, but a handful of pages will not parse correctly in BeautifulSoup (or lxml, which I also tried) due to malformed HTML on the page, resulting in an object with no (or only some) links in it. A "handful" might sound like not-a-big-deal, but this function is being used in a crawler so there could be hundreds of pages that the crawler will never find...
How can the above function be refactored to not use a parser like BeautifulSoup? I've searched around for how to do this using regex, but all the answers say "use BeautifulSoup." Alternatively, I started looking at how to "fix" the malformed HTML so that is parses, but I don't think that is the best route...
What is an alternative solution, using re or something else, that can do the same as the function above?
If the html pages are malformed, there is not a lot of solutions that can really help you. BeautifulSoup or other parsing library are the way to go to parse html files.
If you want to avoir the library path, you could use a regexp to match all your links see regular-expression-to-extract-url-from-an-html-link using a range of [A-Z]
When I need to parse a really broken html and speed is not the most important factor I automate a browser with selenium & webdriver.
This is the most resistant way of html parsing I know.
Check this tutorial it shows how to extract google suggestion using webdriver (the code is in java but it can be changed to python).
I ended up with a combination of regex and BeautifulSoup:
def findAllCapsUrls2(page_contents):
""" returns a list of URLs that have ALL CAPS text, given
the HTML from a page. Uses a combo of RE and BeautifulSoup
to handle malformed pages.
"""
# get all anchors on page using regex
p = r'<a\s+href\s*=\s*"([^"]*)"[^>]*>(.*?(?=</a>))</a>'
re_urls = re.compile(p, re.DOTALL)
all_a = re_urls.findall(page_contents)
# if the text for the anchor is ALL CAPS then add the link to good_urls
good_urls = []
for a in all_a:
href = a[0]
a_content = a[1]
a_soup = BeautifulSoup.BeautifulSoup(a_content)
text = ''.join([s.strip() for s in a_soup.findAll(text=True) if s])
if text and text.upper() == text:
good_urls.append(href)
return good_urls
This is working for my use cases so far, but I wouldn't guarantee it to work on all pages. Also, I only use this function if the original one fails.

Categories