Scraping with lxml and python requests. - python

Okay, I am at it again and really trying to figure this stuff out with lxml and python. The last time I asked a question I was using xpath and had to figure out how to make a change in case that the direct xpath source itself would change. I have edited my code to try to go after the class instead. I keep running into a problem with it pulling the address up in memory and not the text that I want. Before anyone says there is a library for what I want to do, this is not about that but, rather, allowing me to understand this code. Here is what I have so far but when I print it out I get an error and I can add [0] behind the print[0].text but it still give me nothing. Any help would be cool.
from lxml import html
import requests
import time
while True:
page = requests.get('https://markets.businessinsider.com/index/realtime-chart/dow_jones')
content = html.fromstring(page.content)
#This will create a list of prices:
prices = content.find_class('price')
print(prices.text)
time.sleep(.5)

Probably a formatting issue from posting but your while loop is not indented.
Try my code below:
while True:
page = requests.get('https://markets.businessinsider.com/index/realtime-chart/dow_jones')
content = html.fromstring(page.content)
prices = content.find_class('price')
#You need to access the 'text_content' method
text = [p.text_content() for p in prices]
for t in text:
if not t.startswith(r"\"): # Prevents the multiple blank lines
print(t)
time.sleep(0.5)

Related

Reading information from a page without Web Locators , Selenium , Python

xpath_id = '/html/body'
conf_code = driver.find_element(By.XPATH, (xpath_id))
code_list = []
for c in range(len(conf_code)):
code_list.append(conf_code[c].text)
as seen above i chose the xpath locator, but i can't locate the text, that is because this particular webpage is completly blank as only as text in the «body»
the html of the page is bellow:
«html» , «head», «body» 'text that i want to read and save' «body», «/html»
how to read this text and then store it in a variable
Your question is not clear enough.
Anyway, in case there are multiple elements containing texts on that page you can use something like this:
xpath_id = '/html/body/*'
conf_code = driver.find_elements(By.XPATH, (xpath_id))
code_list = []
for c in conf_code:
code_list.append(c.text)
Don't forget to add some delay to make the page completely loaded before you getting all these elements from there
If you're really just grabbing a website that is so simple, you don't need selenium. Grab the website with requests and split the result on the body tags to get the text. Much simpler code and avoids the overhead of the selenium driver.
import requests
url = "http://your-url-here.com"
content = requests.get(url).text
the_string_youre_looking_for = content.split('<body>')[1].split('</body>')[0]
Is this what you're looking for? If not, maybe try and reword your question, because it's a bit hard to understand what you want your code to do and in what context.
Resolved using
print(driver.page_source)
I got full HTML content, and due to its simplicity it was easy to extract to required content withing the <body> TAG

python web scraping code wont open links

This is from the book "automate the boring stuff with python".
At first I made a .bat file and ran it with arguments from cmd, didnt open any pages in chrome, looked up on here, changed up the code, still it executes perfectly and prints the print line but it doesnt open tabs as it should.
What am I doing wrong? Thanks in advance
#! python3
# lucky.py opens several google search matches
import requests,sys,webbrowser,bs4
searchTerm1 = 'python'
print('Googling...')
res = requests.get('https://www.google.com/search?={0}'.format(searchTerm1))
res.raise_for_status()
#retrieve top search result links
soup = bs4.BeautifulSoup(res.text,"html.parser")
#open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5,len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get('href'))
The short answer is that your URL is not returning results. Here's a URL that provides results: https://www.google.com/search?q=python.
I changed the one line in your code to use this template: "https://www.google.com/search?q={0} and I saw linkElems was non-trivial.
In short, webbrowser is not opening any pages because numOpen is 0, so the for loop tries to iterate over 0 items, which results in the code within that for loop block (webbrowser.open) to not get executed.
The longer, more detailed explanation of why the numOpen = 0 is due to a redirect that occurs with the initial GET request given your custom Google query. See this answer for how to circumvent these issues as there are numerous ways- the easiest is probably to use the Google search API.
As a result of the redirect, your BeautifulSoup search will not return any successful results, causing the numOpen variable to be set to 0 as there will be no list elements. As there are no list elements, the for loop does not execute.
You can debug things like this on your own the quick and dirty, but not perfect, way by simply adding print statements throughout the script to see which print statements fail to execute as well as looking at the variables and their returned values.
As an aside, the shebag should also be set to #!/usr/bin/env python3 rather than simply #! python3. Reference here.
Hope this helps

Why does beautifulsoup return an empty list (but not when run in interpreter?)

I am writing a simple program to scrape flight data off of travel websites.
When I test this in the python interpreter, it works fine. However, when I run it as .py file, it returns an empty list.
I know that the program is able to get the data from the site, because I can print the entire source no problem. However, for some reason the find_all method is returning an empty list.
The part that should work:
import urllib2
from bs4 import BeautifulSoup
raw_page_data = urllib2.urlopen(query_url)
soup = BeautifulSoup(raw_page_data, 'html.parser')
raw_prices = soup.find_all(class_='price-whole')
for item in raw_prices:
for character in item:
if character.isdigit() == True:
print character
Usually, in the interpreter, this returns a list of prices. I cannot print raw_prices at all, as it is empty. I believe that the error is with the find_all method, because the output is [].
I have been told that, perhaps, the JavaScript on the site is not loading the elements I am trying to find. However, this makes no sense since the entirety of the page source is loaded, and printable. If I try print soup,
it prints the entire source (which contains the elements I am trying to find).
I tried with different parser as well, to check if there was a bug.
Any help greatly appreciated cheers!

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

Screen scraping in LXML with python-- extract specific data

I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:
Program asks for user input (let's say the type 'happiness')
Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
Program returns first quote from the website.
I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.
The actual meat of the quote appears to be contained in the class "sqq."
If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.
Any ideas?
import lxml.html
import urllib
site = 'http://thinkexist.com/search/searchquotation.asp'
userInput = raw_input('Search for: ').strip()
url = site + '?' + urllib.urlencode({'search':userInput})
root = lxml.html.parse(url).getroot()
quotes = root.xpath('//a[#class="sqq"]')
print quotes[0].text_content()
... and if you enter 'Shakespeare', it returns
In real life, unlike in Shakespeare, the sweetness
of the rose depends upon the name it bears. Things
are not only what they are. They are, in very important
respects, what they seem to be.
If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):
soup = BeautifulSoup(myXml)
for a in soup.findAll(a,{'class' : 'sqq'}):
# this is your quote
print a.contents
Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.
You could open the html source to find out the exact class you are looking for. For example, to grab the first StackOverflow username encountered on the page you could do:
#!/usr/bin/env python
from lxml import html
url = 'http://stackoverflow.com/questions/4710307'
tree = html.parse(url)
path = '//div[#class="user-details"]/a[#href]'
print tree.findtext(path)
# -> Parseltongue
# OR to print text including the text in children
a = tree.find(path)
print a.text_content()
# -> Parseltongue

Categories