This is my code:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read())
rows = soup.findAll("table", attrs = {'class': 'data borderTop'})[0].tbody.findAll("tr")[2:]
for row in rows:
fields = row.findAll("td")
if len(fields) >= 3:
anchor = row.findAll("td")[1].find("a")
if anchor:
print anchor
Instead of printing out an image, it gives me where the image is in the page source. Any reasons as to why?
According to BeautifulSoup documentation, soup.findAll returns a list of tags or NavigableStrings.
So you have to use specific methods such as content().
Visit http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html at "Navigating the Parse Tree" subtitle to find what you need in this case.
It looks like you want the team logo thumbnails?
import urllib2
import BeautifulSoup
url = 'http://www.cbssports.com/nba/draft/mock-draft'
txt = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(txt)
# get the main table
t = bs.findAll('table', attrs={'class': 'data borderTop'})[0]
# get the thumbnail urls
imgs = [im["src"] for im in t.findAll('img')] if "logos" in im["src"]]
imgs now looks like
[[u'http://sports.cbsimg.net/images/nba/logos/30x30/NO.png',
u'http://sports.cbsimg.net/images/nba/logos/30x30/CHA.png',
u'http://sports.cbsimg.net/images/nba/logos/30x30/WAS.png',
u'http://sports.cbsimg.net/images/nba/logos/30x30/CLE.png',
etc. These are the file locations for each logo, which is all the HTML actually contains; if you want the actual pictures, you have to get each one separately.
The list contains duplicate references to each logo; the quickest way to remove duplicates is
imgs = list(set(imgs))
Alternatively, the list does not include every team; if you had a full list of team name contractions, you could build the logo-url list directly.
Also, looking at the site, each 30x30 logo has a corresponding 90x90 logo which you might prefer - much larger and clearer. If so,
imgs = [im.replace('30x30', '90x90') for im in imgs]
imgs now looks like
[u'http://sports.cbsimg.net/images/nba/logos/90x90/BOS.png',
u'http://sports.cbsimg.net/images/nba/logos/90x90/CHA.png',
u'http://sports.cbsimg.net/images/nba/logos/90x90/CLE.png',
u'http://sports.cbsimg.net/images/nba/logos/90x90/DAL.png',
etc.
Now, for each url, we download the image and save it:
import os
savedir = 'c:\\my documents\\logos' # assumes this dir actually exists!
for im in imgs:
fname = im.rsplit('/', 1)[1]
fname = os.path.join(savedir, fname)
with open(fname, 'wb') as outf:
outf.write(urllib2.urlopen(im).read())
and you have your logos.
Related
I have a next link which represent an exact graph I want to scrape: https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1
I'm simply can't understand is it a xml or svg graph and how to scrape data. I think I need to use bs4, requests but don't know the way to do that.
Anyone could help?
You will load HTML like this:
import requests
url = "https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
resp = requests.get(url)
data = resp.text
Then you will create a BeatifulSoup object with this HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features="html.parser")
After this, it is usually very subjective how to parse out what you want. The candidate codes may vary a lot. This is how I did it:
Using BeautifulSoup, I parsed all "rect"s and check if "onmouseover" exists in that rect.
rects = soup.svg.find_all("rect")
yx_points = []
for rect in rects:
if rect.has_attr("onmouseover"):
text = rect["onmouseover"]
x_start_index = text.index("'") + 1
y_finish_index = text[x_start_index:].index("'") + x_start_index
yx = text[x_start_index:y_finish_index].split()
print(text[x_start_index:y_finish_index])
yx_points.append(yx)
As you can see from the image below, I scraped onmouseover= part and get those 02.2015 155,1 parts.
Here, this is how yx_points looks like now:
[['12.2009', '100,0'], ['01.2010', '101,8'], ['02.2010', '103,7'], ...]
from bs4 import BeautifulSoup
import requests
import re
#First get all the text from the url.
url="https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
response = requests.get(url)
html = response.text
#Find all the tags in which the data is stored.
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll("rect")
final = []
for each in texts:
names = each.get('onmouseover')
try:
q = re.findall(r"'(.*?)'", names)
final.append(q[0])
except Exception as e:
print(e)
#The details are appended to the final variable
I am trying to download some images from NASS Case Viewer. An example of a case is
https://www-nass.nhtsa.dot.gov/nass/cds/CaseForm.aspx?xsl=main.xsl&CaseID=149006692
The link to the image viewer for this case is
https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?ImageView&ImageID=497001669&Desc=FRONT&Title=Vehicle+1+-+Front&Version=1&Extend=jpg
which may not be viewable, I assume because of the https. However, this is simply the Front second image.
The actual link to the image is (or should be?)
https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?Image&ImageID=497001669&CaseID=149006692&Version=1
This will simply download aspx binaries.
My problem is that I do not know how to store these binaries to proper jpg files.
Example of code I've tried is
import requests
test_image = "https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?Image&ImageID=497001669&CaseID=149006692&Version=1"
pull_image = requests.get(test_image)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(str.encode(pull_image.text))
But this does not result in a proper jpg file. I've also inspected pull_image.raw.read() and saw that it's empty.
What could be the issue here? Are my URL's improper? I've used Beautifulsoup to put these URLs together and reviewed them by inspecting the HTML code from a few pages.
Am I saving the binaries incorrectly?
.text decodes the response content to string, so your imge file will be corrupted.
Instead you should use .content which holds the binary response content.
import requests
test_image = "https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?Image&ImageID=497001669&CaseID=149006692&Version=1"
pull_image = requests.get(test_image)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(pull_image.content)
.raw.read() also returns bytes, but in order to use it you must set the stream parameter to True.
pull_image = requests.get(test_image, stream=True)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(pull_image.raw.read())
I wanted to follow up on #t.m.adam 's answer to provide a complete answer for anyone who is interested in using this data for their own projects.
Here is my code to pull all images for a sample of Case IDs. It's fairly un-clean code, but I think it gives you what you may need to get started.
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
CaseIDs = [149006673, 149006651, 149006672, 149006673, 149006692, 149006693]
url_part1 = 'https://www-nass.nhtsa.dot.gov/nass/cds/'
data = []
with requests.Session() as sesh:
for caseid in tqdm(CaseIDs):
url_full = f"https://www-nass.nhtsa.dot.gov/nass/cds/CaseForm.aspx?ViewText&CaseID={caseid}&xsl=textonly.xsl&websrc=true"
#print(url_full)
source = sesh.get(url_full).text
soup = BeautifulSoup(source, 'lxml')
tr_tags = soup.find_all('tr', style="page-break-after: always")
for tag in tr_tags:
#print(tag)
"""
try:
vehicle = [x for x in tag.text.split('\n') if 'Vehicle' in x][0] ## return the first element
except IndexError:
vehicle = [x for x in tag.text.split('\n') if 'Scene' in x][0] ## return the first element
"""
tag_list = tag.find_all('tr', class_ = 'label')
test = [x.find('td').text for x in tag_list]
#print(test)
img_id, img_type, part_name = test
img_id = img_id.replace(":", "")
img = tag.find('img')
#part_name = img.get('alt').replace(":", "").replace("/", "")
part_name = part_name.replace(":", "").replace("/", "")
image_name = " ".join([img_type, part_name, img_id]) + ".jpg"
url_src = img.get('src')
img_url = url_part1 + url_src
print(img_url)
pull_image = sesh.get(img_url, stream=True)
with open(image_name, "wb+") as myfile:
myfile.write(pull_image.content)
I have a textfile containing the URLs of some RSS feeds. I would like to find out which of these URLs has a title or description (or any other tag) containg certain strings (a list of words).
As for now, I am able to get the URL, title and headline (and whatever). Not really sure on how to proceed though. I guess I would check the tags with regex. If I checked an URLs title and found a wordmatch, how would I then retrieve the URL again? The URL needs to be connected to the tags, like in a .csv. Bit confused here. Maybe someone can shoot me in the right direction?
My path so far:
import requests
from bs4 import BeautifulSoup
rssfeed = open('input.txt')
rss_source = rssfeed.read()
rss_sources = rss_source.split()
i=0
while i<len(rss_sources):
get_rss = requests.get(rss_sources[i])
rss_soup = BeautifulSoup(get_rss.text, 'html.parser')
rss_urls = rss_soup.find_all('link')
i=i+1
for url in rss_urls:
rss_all_urls = url.text
open_urls = requests.get(rss_all_urls)
target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
urls_titles = target_urls_soup.title
urls_headlines = target_urls_soup.h1
print (rss_all_urls, urls_titles, urls_headlines)
So you want to have an array of URLs.
That array should contain certain URLs based on some conditions:
- if the Title of that URL match one of the strings contained on an array
So first you need your arrays:
titlesToMatch = ['title1', 'title2', 'title3']
finalArrayWithURLs = []
then when you have your: rss_all_urls, urls_titles, urls_headlines for a URL you want to include on the finalArrayWithURLs just those ones that match one of the titles on the titleToMatch
for url in rss_urls:
rss_all_urls = url.text
open_urls = requests.get(rss_all_urls)
target_urls_soup = BeautifulSoup(open_urls.text, 'html.parser')
urls_titles = target_urls_soup.title
urls_headlines = target_urls_soup.h1
if any(item in urls_titles for item in titlesToMatch):
finalArrayWithURLs.push(url)
So after that you will have on the finalArrayWithURLs just those URLs where the title match one of the titles of your titlesToMatch array
I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.
Here's my code:
url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"):
print("inside FOR")
print(link.get('src'))
You're almost there - just a couple of mistakes. soup.find() gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (like src) using dictionary access. Here's a reworked version:
film_id = '0423409'
url = 'http://www.imdb.com/title/tt%s/' % (film_id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
link = soup.find(itemprop="image")
print(link["src"])
# output:
http://ia.media-imdb.com/images/M/MV5BMTg2ODMwNTY3NV5BMl5BanBnXkFtZTcwMzczNjEzMQ##._V1_SY317_CR0,0,214,317_.jpg
I've changed id to film_id, because id() is a built-in function, and it's bad practice to mask those.
I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to tag
This code is working for me with BeautifulSoup4:
url = 'http://www.imdb.com/title/tt%s/' % (id,)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print "before FOR"
for tag in soup.findAll(itemprop="image"):
print "inside FOR"
print(tag['src'])
If I understand correctly you are looking for the src of the image, for the extraction of it after that.
In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:
m_url = 'http://www.marca.com/futbol/primera/equipos.html'
client = uOpen(m_url)
page = client.read()
client.close()
page_soup = BS(page, 'html.parser')
teams = page_soup.findAll('li', {'id': 'nombreEquipo'})
for team in teams:
name = team.h2.text
shield_url = team.img['src']
Then, you need to process the image. You have to options.
1st: using numpy:
def url_to_image(url):
'''
FunciĆ³n para extraer una imagen de una URL
'''
resp = uOpen(url)
image = np.asarray(bytearray(resp.read()), dtype='uint8')
image = cv2.imdecode(image, cv2.IMREAD_COLOR)
return image
shield = url_to_image(shield_url)
2nd Using scikit-image library (that you will probably need to install):
shield = io.imread('http:' + shield_url)
Note: Just in this particular example I needed to add http: at the beggining.
Hope it helps!
Here's a full working example with gazpacho:
Step 1 - import everything and download the html:
from pathlib import Path
from urllib.request import urlretrieve as download
from gazpacho import Soup
id = 'tt5057054'
url = f"https://www.imdb.com/title/{id}"
soup = Soup.get(url)
Step 2 - find the src url for the image asset:
image = (soup
.find("div", {"id": "title-overview"})
.find("div", {"class": "poster"})
.find("img")
.attrs['src']
)
Step 3 - save it to your machine:
directory = "images"
Path(directory).mkdir(exist_ok=True)
extension = image.split('.')[-1]
download(image, f"{directory}/{id}.{extension}")
So I wanted to get all of the pictures on this page(of the nba teams).
http://www.cbssports.com/nba/draft/mock-draft
However, my code gives a bit more than that. It gives me,
<img src="http://sports.cbsimg.net/images/nba/logos/30x30/ORL.png" alt="Orlando Magic" width="30" height="30" border="0" />
How can I shorten it to only give me, http://sports.cbsimg.net/images/nba/logos/30x30/ORL.png.
My code:
import urllib2
from BeautifulSoup import BeautifulSoup
# or if your're using BeautifulSoup4:
# from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read())
rows = soup.findAll("table", attrs = {'class': 'data borderTop'})[0].tbody.findAll("tr")[2:]
for row in rows:
fields = row.findAll("td")
if len(fields) >= 3:
anchor = row.findAll("td")[1].find("a")
if anchor:
print anchor
I know this can be "traumatic", but for those automatically generated pages, where you just want to grab the damn images away and never come back, a quick-n-dirty regular expression that takes the desired pattern tends to be my choice (no Beautiful Soup dependency is a great advantage):
import urllib, re
source = urllib.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()
## every image name is an abbreviation composed by capital letters, so...
for link in re.findall('http://sports.cbsimg.net/images/nba/logos/30x30/[A-Z]*.png', source):
print link
## the code above just prints the link;
## if you want to actually download, set the flag below to True
actually_download = False
if actually_download:
filename = link.split('/')[-1]
urllib.urlretrieve(link, filename)
Hope this helps!
To save all the images on http://www.cbssports.com/nba/draft/mock-draft,
import urllib2
import os
from BeautifulSoup import BeautifulSoup
URL = "http://www.cbssports.com/nba/draft/mock-draft"
default_dir = os.path.join(os.path.expanduser("~"),"Pictures")
opener = urllib2.build_opener()
urllib2.install_opener(opener)
soup = BeautifulSoup(urllib2.urlopen(URL).read())
imgs = soup.findAll("img",{"alt":True, "src":True})
for img in imgs:
img_url = img["src"]
filename = os.path.join(default_dir, img_url.split("/")[-1])
img_data = opener.open(img_url)
f = open(filename,"wb")
f.write(img_data.read())
f.close()
To save any particular image on http://www.cbssports.com/nba/draft/mock-draft,
use
soup.find("img",{"src":"image_name_from_source"})
You can use this functions for getting the list of all images url from url.
#
#
# get_url_images_in_text()
#
# #param html - the html to extract urls of images from him.
# #param protocol - the protocol of the website, for append to urls that not start with protocol.
#
# #return list of imags url.
#
#
def get_url_images_in_text(html, protocol):
urls = []
all_urls = re.findall(r'((http\:|https\:)?\/\/[^"\' ]*?\.(png|jpg))', html, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
for url in all_urls:
if not url[0].startswith("http"):
urls.append(protocol + url[0])
else:
urls.append(url[0])
return urls
#
#
# get_images_from_url()
#
# #param url - the url for extract images url from him.
#
# #return list of images url.
#
#
def get_images_from_url(url):
protocol = url.split('/')[0]
resp = requests.get(url)
return get_url_images_in_text(resp.text, protocol)