Game plan is to extract those main images, and display them in a thumbnail in the index page. I'm having so much trouble for this functionality, it seems like there's no example for this functionality in the internet.
I found three options
1. beautifulsoup// seems like people use this approach the most but I have no idea how beautifulsoup can find the representative image...also it requires the most work I think. 2. python goose// this looks legit. the documentation says it extracts main image, I guess I need to trust their words. problem is I don't know how to use this in django.
3. embedly//....maybe wrong choice for the functionality I need. I'm thinking to use python goose for this project.
My question is how would you approach this problem? and do you know any example or can provide some example I can look at? for extracting image from images user provide to my page I can probably use sorl-thumbnail(right?_) but for posted link....??
Edit1: using python goose, it seems (main)image scraping is very simple. problem is I'm not sure how to use the script to my app, how should I turn that image to right thumbnail and display on my index.html...
Here is my media.py(not sure if it works yet
import json
from goose import Goose
def extract(request):
url = request.args.get('url')
g = Goose()
article = g.extract(url=url)
resposne = {'image':article.top_image.src}
return json.dumps(resposne)
source: https://blog.openshift.com/day-16-goose-extractor-an-article-extractor-that-just-works/
the blog example is using flask, I tried to make the script for people using django
Edit 2: Ok, here is my approach. I really think this is right, but unfortunately it doesn't give me anything. no error or no image but the python syntax is right....if there's anyone why it's not working please let me know
Models.py
class Post(models.Model):
url = models.URLField(max_length=250, blank=True, null=True)
def extract(request, url):
url = requests.POST.get('url')
g = Goose()
article = g.extract(url=url)
resposne = {'image':article.top_image.src}
return json.dumps(resposne)
Index.html
{% if posts %}
{% for post in posts %}
{{ post.extract}}
{%endfor%}
{%endif%}
BeautifulSoup would be the way to go for this, and is actually remarkably easy.
To begin, an image in HTML looks like this:
<img src="http://www.url.to/image.png"></img>
We can use BeautifulSoup to extract all img tags and then find the src of the img tag. This is achieved as shown below.
from bs4 import BeautifulSoup #Import stuff
import requests
r = requests.get("http://www.site-to-extract.com/") #Download website source
data = r.text #Get the website source as text
soup = BeautifulSoup(data) #Setup a "soup" which BeautifulSoup can search
links = []
for link in soup.find_all('img'): #Cycle through all 'img' tags
imgSrc = link.get('src') #Extract the 'src' from those tags
links.append(imgSrc) #Append the source to 'links'
print(links) #Print 'links'
I don't know how you plan on deciding which image to use as thumbnail, but you can then through the list of URL's and extract the one you want.
Update
I know you said dJango, but I would highly recommend Flask. It's a lot simpler, yet still very functional.
I wrote this, which simply displays the 1st image of whatever webpage you give it.
from bs4 import BeautifulSoup #Import stuff
import requests
from flask import Flask
app = Flask(__name__)
def getImages(url):
r = requests.get(url) #Download website source
data = r.text #Get the website source as text
soup = BeautifulSoup(data) #Setup a "soup" which BeautifulSoup can search
links = []
for link in soup.find_all('img'): #Cycle through all 'img' tags
imgSrc = link.get('src') #Extract the 'src' from those tags
links.append(imgSrc) #Append the source to 'links'
return links #Return 'links'
#app.route('/<site>')
def page(site):
image = getImages("http://" + site)[0] #Here I find the 1st image on the page
if image[0] == "/":
image = "http://" + site + image #This creates a URL for the image
return "<img src=%s></img>" % image #Return the image in an HTML "img" tag
if __name__ == '__main__':
app.run(debug=True, host="0.0.0.0") #Run the Flask webserver
This hosts a web server on http://localhost:5000/
To input a site, do http://localhost:5000/yoursitehere, for example http://localhost:5000/www.google.com
Related
I'm creating a python program that collects images from this website by Google
The images on the website change after a certain number of seconds, and the image url also changes with time. This change is handled by a script on the website. I have no idea how to get the image links from it.
I tried using BeautifulSoup and the requests library to get the image links from the site's html code:
import requests
from bs4 import BeautifulSoup
url = 'https://clients3.google.com/cast/chromecast/home'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
tags = soup('img')
for tag in tags:
print(tag)
But the code returns:
{{background_url}}' in the image src ("ng-src")
For example:
<img class="S9aygc-AHe6Kc" id="picture-background" image-error-handler="" image-index="0" ng-if="backgroundUrl" ng-src="{{backgroundUrl}}"/>
How can I get the image links from a dynamically changing site? Can BeautifulSoup handle this? If not what library will do the job?
import requests
import re
def main(url):
r = requests.get(url)
match = re.search(r"(lh4\.googl.+?mv)", r.text).group(1)
match = match.replace("\\", "").replace("u003d", "=")
print(match)
main("https://clients3.google.com/cast/chromecast/home")
Just a minor addition to the answer by αԋɱҽԃ αмєяιcαη (ahmed american) in case anyone is wondering
The subdomain (lhx) in lhx.google.com is also dynamic. As a result, the link can be lh3 or lh4 et cetera.
This code fixes the problem:
import requests
import re
r = requests.get("https://clients3.google.com/cast/chromecast/home").text
match = re.search(r"(lh.\.googl.+?mv)", r).group(1)
match = match.replace('\\', '').replace("u003d", "=")
print(match)
The major difference is that the lh4 in the code by ahmed american has been replaced with "lh." so that all images can be collected no matter the url.
EDIT: This line does not work:
match = match.replace('\\', '').replace("u003d", "=")
Replace with:
match = match.replace("\\", "")
match = match.replace("u003d", "=")
None of the provided answers worked for me. Issues may be related to using an older version of python and/or the source page changing some things around.
Also, this will return all matches instead of only the first match.
Tested in Python 3.9.6.
import requests
import re
url = 'https://clients3.google.com/cast/chromecast/home'
r = requests.get(url)
for match in re.finditer(r"(ccp-lh\..+?mv)", r.text, re.S):
image_link = 'https://%s' % (match.group(1).replace("\\", "").replace("u003d", "="))
print(image_link)
I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
you can use xpath expressions and create an absolute path on what you want to scrape
Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]
What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.
Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.
My question is similar to the one asked here:
https://stackoverflow.com/questions/14599485/news-website-comment-analysis
I am trying to extract comments from any news article. E.g. i have a news url here:
http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/
I am trying to use BeautifulSoup in python to extract the comments. However it seems the comment section is either embedded within an iframe or loaded through javascript. Viewing the source through firebug does not reveal the source of the comments section. But explicitly viewing the source of the comments through view-source feature of the browser does. How to go about extracting the comments, especially when the comments come from a different url embedded within the news web-page?
This is what i have done till now although this is not much:
import urllib2
from bs4 import BeautifulSoup
opener = urllib2.build_opener()
url = ('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')
urlContent = opener.open(url).read()
soup = BeautifulSoup(urlContent)
title = soup.title.text
print title
body = soup.findAll('body')
outfile = open("brain.txt","w+")
for i in body:
i=i.text.encode('ascii','ignore')
outfile.write(i +'\n')
Any help in what I need to do or how to go about it will be much appreciated.
its inside an iframe. check for a frame with id="dsq2".
now the iframe has a src attr which is a link to the actual site that has the comments.
so in beautiful soup: css_soup.select("#dsq2") and get the url from the src attribute. it will lead you to a page that has only comments.
to get the actual comments, after you get the page from src you can use this css selector: .post-message p
and if you want to load more comment, when you click to the more comments buttons it seems to be sending this:
http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F
This is my code to get a web page's image's URLs
for some webpage, it works very well, while it' dosen't work for some web page
this is my code:
#!/usr/bin/python
import urllib2
import re
#bufOne = urllib2.urlopen(r"http://vgirl.weibo.com/5show/user.php?fid=17262", timeout=4).read()
bufTwo = urllib2.urlopen(r"http://541626.com/pages/38307", timeout=4).read()
jpgRule = re.findall(r'http://[\w/]*?jpg', bufOne, re.IGNORECASE)
jpgRule = re.findall(r'http://[\w/]*?jpg', bufTwo, re.IGNORECASE)
print jpgRule
bufOne work well, but bufTwodidn't work. so how to write a ruler for it to make bufTwo work well?
Don't use regex to parse HTML. Rather use Beautiful Soup to find all img tags and then get the src attributes.
from BeautifullSoup import BeautifullSoup
#...
soup = BeautifulSoup(bufTwo)
imgTags = soup.findAll('img')
img = [tag['src'] for tag in imgTags]
I'll take this chance ddk gave to show you an easier way of getting all the images.
Using Beautiful Soup like that:
from BeautifulSoup import BeautifulSoup
all_imgs = soup.findAll("img", { "src" : re.compile(r'http://[\w/]*?jpg') })
That will already give you a list with all the images you want.