Simplified Scrappy won't locate js element

Simplified Scrappy won't locate js element - python

I'm trying to get data from the webpage https://bitinfocharts.com/comparison/price-btc.html and I have the code:
doc = SimplifiedDoc(html)
js = doc.getElementByText('new Dygraph', tag='script').html
js = js[js.find('document.getElementById("container"),') + len('document.getElementById("container"),'):]
js = js[:js.find(', {labels:')] # Get data part
js = js.replace('[new Date("', '').replace('")', '')[1:-2]
data = [kv.split(',') for kv in js.split('],')]
which works for other pages on the same website, but on the price page, it returns a AttributeError: 'NoneType' object has no attribute 'find'. The code for this page seems to be the same as the rest, so I don't why this one returns an error. For example on https://bitinfocharts.com/comparison/transactions-btc.html it works perfectly as intended.

It's wrong. I don't know why. You can use the following line.
from simplified_scrapy import SimplifiedDoc, utils, req
html = req.get('https://bitinfocharts.com/comparison/bitcoin-price.html')
doc = SimplifiedDoc(html)
# js = doc.getElementByText('new Dygraph', tag='script').html
js = doc.selects('script').contains('new Dygraph')[0].html # Change to this
js = js[js.find('document.getElementById("container"),') +
len('document.getElementById("container"),'):]
js = js[:js.find(', {labels:')] # Get data part
js = js.replace('[new Date("', '').replace('")', '')[1:-2]
data = [kv.split(',') for kv in js.split('],')]

Related

BeautifulSoup not Retrieving Accurate HTML - Requests_HTML

I am trying to parse a picture off of this page. Specifically, I am trying to parse the image under the div class "gOenxf". When you inspect the webpage, the HTML elements show an "encrypted" image URL, which is useful to me and what I am trying to retrieve. However, when I parse that same page/class the image is retrieved get as a "Data URL" which is not very useful to me. I am using request_html because I need something faster than selenium. I am also using BeautifulSoup because it is easier than request_html's built-in ".find" system. Does anyone know why this is happening or a solution to the problem?
for google_post in google_initiate():
post_image_url = str(google_post.find(class_='gOenxf'))
post_image_url = post_image_url[post_image_url.find('src="') + len('src="'):post_image_url.rfind('"')]
print(post_image_url)
def google_initiate():
url = 'https://www.google.com/search?tbm=shop&q=desk'
session = HTMLSession()
data = session.get(url)
google_soup = BeautifulSoup(data.text, features='html.parser')
google_parsed = google_soup.find_all('div', {'class': ['sh-dgr__gr-auto', 'sh-dgr__grid-result']})
google_initiate.google_parse_page = URL
session.close()
return google_parsed

Python: Why printed it as blank from a list type by using for loop(edited: case solved)

I'm trying to extract data using web scraping with python. The information contains a table of content of the movie released dates and nations. After I requested and used BeautifulSoup. It printed out as a blank []. I don't know how to fix it... Here is my code:
soup = BeautifulSoup(response)
element_dates = ".ipl-zebra-list ipl-zebra-list--fixed-first release-dates-table-test-only" # css selector (date release table)
select_datesTag = soup.select(element_dates)
result = [i.text for i in select_datesTag]
print(result)
>>>[]
Edit:
Thank you all for trying to help me. As the previous printed result showed as blank, indicating that the information I tried to extract was unsuccessful.
The reason caused that was because of the false label I picked for the "element_dates" css selector, that instead of the ".ipl-..." but actually is ".release-date-item__date".
This is the link of the website I was working on and provided with the fixed line of codes:
import requests
from bs4 import BeautifulSoup
target_url = "https://www.imdb.com/title/tt4154796/releaseinfo"
target_params = {"ref_": "tt_ov_inf"}
response = requests.get(target_url, params = target_params)
response = response.text
soup = BeautifulSoup(response)
element_dates = ".release-date-item__date"
print(element_dates) # successfully printed all the data with element_dates variable.

Function from list of parameters

can you please help me with my python code? I wanted to parse several homepages with beautiful soup provided in the list html with the function stars
html=["https://www.onvista.de/aktien/fundamental/Adidas-Aktie-DE000A1EWWW0", "https://www.onvista.de/aktien/fundamental/Allianz-Aktie-DE0008404005", "https://www.onvista.de/aktien/fundamental/BASF-Aktie-DE000BASF111"]
def stars(html):
bsObj = BeautifulSoup(html.read())
starbewertung = bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16]
str_cells = str(starbewertung)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
lst=[]
lst.append(cleantext)
stars(html)
Instead I am getting an error "AttributeError: 'list' object has no attribute 'read'"

As some of the comments mentioned you need to use the requests library to actually grab the content of each link in your list.
import requests
from bs4 import BeautifulSoup
html=["https://www.onvista.de/aktien/fundamental/Adidas-Aktie-DE000A1EWWW0", "https://www.onvista.de/aktien/fundamental/Allianz-Aktie-DE0008404005", "https://www.onvista.de/aktien/fundamental/BASF-Aktie-DE000BASF111"]
def stars(html):
for url in html:
resp = requests.get(url)
bsObj = BeautifulSoup(resp.content, 'html.parser')
print(bsObj) # Should print the entire html document.
# Do other stuff with bsObj here.
stars(html)
The IndexError from bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16] is something you'll need to figure out yourself.

You have a couple of errors here.
you are trying to load the whole list of pages into BeautifulSoup. You should process page by page.
You should get the source code of the page before processing it.
there is no "section" element on the page you are loading, so you will get an exception as you are trying to get the 8th element. So you might need to evaluate whether you found anything.
def stars(html):
request = requests.get(html)
if request.status_code != 200:
return
page_content = request.content
bsObj = BeautifulSoup(page_content)
starbewertung = bsObj.findAll("section")[8].findAll("div")[1].findAll("span")[16]
str_cells = str(starbewertung)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
for page in html:
stars(page)

Example on webcrawling news headlines and contents in Python

I am a beginner in WebCrawling, and I have a question regarding crawling multiple urls.
I am using CNBC in my project. I want to extract news titles and urls from its home page, and I also want to crawl the contents of the news articles from each url.
This is what I've got so far:
import requests
from lxml import html
import pandas
url = "http://www.cnbc.com/"
response = requests.get(url)
doc = html.fromstring(response.text)
headlineNode = doc.xpath('//div[#class="headline"]')
len(headlineNode)
result_list = []
for node in headlineNode :
url_node = node.xpath('./a/#href')
title = node.xpath('./a/text()')
soup = BeautifulSoup(url_node.content)
text =[''.join(s.findAll(text=True)) for s in soup.findAll("div", {"class":"group"})]
if (url_node and title and text) :
result_list.append({'URL' : url + url_node[0].strip(),
'TITLE' : title[0].strip(),
'TEXT' : text[0].strip()})
print(result_list)
len(result_list)
I am keep on getting an error saying that'list' object has no attribute 'content'. I want to create a dictionary that contains titles for each headlines, urls for each headlines, and the news article content for each headlines. Is there an easier way to approach this?

Great start on the script. However, soup = BeautifulSoup(url_node.content) is wrong. url_content is a list. You need to form the full news URL, use requests to get the HTML and then pass it to BeautifulSoup.
Apart from that, there are a few things I would look at:
I see import issues, BeautifulSoup is not imported.
Add from bs4 import BeautifulSoup to the top. Are you using pandas? If not, remove it.
Some of the news divs on CNN with the big banner picture will yield a 0 length list when you query url_node = node.xpath('./a/#href'). You need to find the appropriate logic and selectors to get those news URLs as well. I will leave that up to you.
Check this out:
import requests
from lxml import html
import pandas
from bs4 import BeautifulSoup
# Note trailing backslash removed
url = "http://www.cnbc.com"
response = requests.get(url)
doc = html.fromstring(response.text)
headlineNode = doc.xpath('//div[#class="headline"]')
print(len(headlineNode))
result_list = []
for node in headlineNode:
url_node = node.xpath('./a/#href')
title = node.xpath('./a/text()')
# Figure out logic to get that pic banner news URL
if len(url_node) == 0:
continue
else:
news_html = requests.get(url + url_node[0])
soup = BeautifulSoup(news_html.content)
text =[''.join(s.findAll(text=True)) for s in soup.findAll("div", {"class":"group"})]
if (url_node and title and text) :
result_list.append({'URL' : url + url_node[0].strip(),
'TITLE' : title[0].strip(),
'TEXT' : text[0].strip()})
print(result_list)
len(result_list)
Bonus debugging tip:
Fire up an ipython3 shell and do %run -d yourfile.py. Look up ipdb and the debugging commands. It's quite helpful to check what your variables are and if you're calling the right methods.
Good luck.

Extracting main image from posted link/ and from the posted page

Game plan is to extract those main images, and display them in a thumbnail in the index page. I'm having so much trouble for this functionality, it seems like there's no example for this functionality in the internet.
I found three options
1. beautifulsoup// seems like people use this approach the most but I have no idea how beautifulsoup can find the representative image...also it requires the most work I think. 2. python goose// this looks legit. the documentation says it extracts main image, I guess I need to trust their words. problem is I don't know how to use this in django.
3. embedly//....maybe wrong choice for the functionality I need. I'm thinking to use python goose for this project.
My question is how would you approach this problem? and do you know any example or can provide some example I can look at? for extracting image from images user provide to my page I can probably use sorl-thumbnail(right?_) but for posted link....??
Edit1: using python goose, it seems (main)image scraping is very simple. problem is I'm not sure how to use the script to my app, how should I turn that image to right thumbnail and display on my index.html...
Here is my media.py(not sure if it works yet
import json
from goose import Goose
def extract(request):
url = request.args.get('url')
g = Goose()
article = g.extract(url=url)
resposne = {'image':article.top_image.src}
return json.dumps(resposne)
source: https://blog.openshift.com/day-16-goose-extractor-an-article-extractor-that-just-works/
the blog example is using flask, I tried to make the script for people using django
Edit 2: Ok, here is my approach. I really think this is right, but unfortunately it doesn't give me anything. no error or no image but the python syntax is right....if there's anyone why it's not working please let me know
Models.py
class Post(models.Model):
url = models.URLField(max_length=250, blank=True, null=True)
def extract(request, url):
url = requests.POST.get('url')
g = Goose()
article = g.extract(url=url)
resposne = {'image':article.top_image.src}
return json.dumps(resposne)
Index.html
{% if posts %}
{% for post in posts %}
{{ post.extract}}
{%endfor%}
{%endif%}

BeautifulSoup would be the way to go for this, and is actually remarkably easy.
To begin, an image in HTML looks like this:
<img src="http://www.url.to/image.png"></img>
We can use BeautifulSoup to extract all img tags and then find the src of the img tag. This is achieved as shown below.
from bs4 import BeautifulSoup #Import stuff
import requests
r = requests.get("http://www.site-to-extract.com/") #Download website source
data = r.text #Get the website source as text
soup = BeautifulSoup(data) #Setup a "soup" which BeautifulSoup can search
links = []
for link in soup.find_all('img'): #Cycle through all 'img' tags
imgSrc = link.get('src') #Extract the 'src' from those tags
links.append(imgSrc) #Append the source to 'links'
print(links) #Print 'links'
I don't know how you plan on deciding which image to use as thumbnail, but you can then through the list of URL's and extract the one you want.
Update
I know you said dJango, but I would highly recommend Flask. It's a lot simpler, yet still very functional.
I wrote this, which simply displays the 1st image of whatever webpage you give it.
from bs4 import BeautifulSoup #Import stuff
import requests
from flask import Flask
app = Flask(__name__)
def getImages(url):
r = requests.get(url) #Download website source
data = r.text #Get the website source as text
soup = BeautifulSoup(data) #Setup a "soup" which BeautifulSoup can search
links = []
for link in soup.find_all('img'): #Cycle through all 'img' tags
imgSrc = link.get('src') #Extract the 'src' from those tags
links.append(imgSrc) #Append the source to 'links'
return links #Return 'links'
#app.route('/<site>')
def page(site):
image = getImages("http://" + site)[0] #Here I find the 1st image on the page
if image[0] == "/":
image = "http://" + site + image #This creates a URL for the image
return "<img src=%s></img>" % image #Return the image in an HTML "img" tag
if __name__ == '__main__':
app.run(debug=True, host="0.0.0.0") #Run the Flask webserver
This hosts a web server on http://localhost:5000/
To input a site, do http://localhost:5000/yoursitehere, for example http://localhost:5000/www.google.com

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simplified Scrappy won't locate js element - python

Related

BeautifulSoup not Retrieving Accurate HTML - Requests_HTML

Python: Why printed it as blank from a list type by using for loop(edited: case solved)

Function from list of parameters

Example on webcrawling news headlines and contents in Python

Extracting main image from posted link/ and from the posted page

Categories

Resources