Python data scraping

Python data scraping - python

I want to download a couple songs off of http://www.youtube-mp3.org/. I'm using urllib2 and BeautifulSoup.
The problem is that when I urllib2 open the site with my video ID plugged in, http://www.youtube-mp3.org/?c#v=lV7r8PiuecQ, I get the site but they are tricky about it and load the info after the initial pageload with some js ajax stuff. So when I try to scrape the url of the download link, literally isn't on the page because it hasn't been loaded.
Anyone know how I can maybe trigger this js loader in my python script, or something?
Here is the relevant empty html BEFORE the content that I want is loaded into it.
<div id="link_box" style="display:none">
<div id="link_box_title" style="font-weight:bold; text-decoration:underline">
</div>
<div class="row">
<div id="link_box_bb_code_title" style="font-weight:bold">
</div>
<input type="text" id="BBCodeLink" onclick="sAll(this)" />
</div>
<div class="row">
<div id="link_box_html_code_title" style="font-weight:bold">
</div>
<input type="text" id="HTMLLink" onclick="sAll(this)" />
</div>
<div class="row">
<div id="link_box_direct_code_title" style="font-weight:bold">
</div>
<input type="text" id="DirectLink" onclick="sAll(this)" />
</div>
</div>
<div id="v-ads">
</div>
<div id="dl_link">
</div>
<div id="progress">
</div>
<div id="loader">
<img src="ajax-loader-b.gif" alt="loading.." width="16" height="11" />
</div>
</div>
<div class="clear">
</div>
</div>

The API is JSON-based, so the contents of the html files won't give you any clue on where to find the files. A good idea when exploring web services like this one, is to open the Network tab in Chrome's developer tools and see what pages it loads when interacting with the page. That exercise showed me that two urls in particular seem interesting:
http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DKMU0tzLwhbE&xy=trve&r=1314700829128
http://www.youtube-mp3.org/api/itemInfo/?video_id=KMU0tzLwhbE&adloc=&r=1314700829314
The first url appears to be queuing a file for processing, the second to get the status of the processing job.
The second url takes a video_id GET parameter that is the id for the video on youtube (http://www.youtube.com/watch?v=KMU0tzLwhbE) and returns the status of the decoding job. The second and third seem irrelevant for this purpose which you can verify by test loading the url with and without the extra parameters.
The content of the page is:
info = { "title" : "Developers",
"image" : "http://i4.ytimg.com/vi/KMU0tzLwhbE/default.jpg",
"length" : "3", "status" : "serving", "progress_speed" : "",
"progress" : "", "ads" : "",
"h" : "a0aa17294103c638fa7f5e0606f839d3" };
Which happens to be JSON data. The interesting bit in this is "a0aa17294103c638fa7f5e0606f839d3" which looks like a hash that the web service use to refer to the decoded mp3 file. Also check out how the download link on the front page looks:
http://www.youtube-mp3.org/get?video_id=KMU0tzLwhbE&h=a0aa17294103c638fa7f5e0606f839d3
Now we have all the missing pieces of the puzzle together. First, we take the url of a youtube video (http://www.youtube.com/watch?v=iKP7DZmqdbU) url quote it and feed it to the api using this url:
http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DiKP7DZmqdbU&xy=trve
Then, wait a few moments until the decoding job is done:
http://www.youtube-mp3.org/api/itemInfo/?video_id=iKP7DZmqdbU
Take the hash found in the info url to construct the download url:
http://www.youtube-mp3.org/get?video_id=iKP7DZmqdbU&h=2e4b61b6ddc8bf83f5a0e4e4ee0635bb
Note that it is possible that the web master of the site does not want to be scraped and will take counter measures if people starts to (in the webmasters eyes) abuse the site. For example it seem to use referer protection so clicking the links in this post won't work, you have to copy them and load them in a new browser window.
Test code:
from re import findall
from time import sleep
from urllib import urlopen, quote
yt_code = 'gijypDkEqUA'
yt_url = 'http://www.youtube.com/watch?v=%s' % yt_code
push_url_fmt = 'http://www.youtube-mp3.org/api/pushItem/?item=%s&xy=trve'
info_url_fmt = 'http://www.youtube-mp3.org/api/itemInfo/?video_id=%s'
download_url_fmt = 'http://www.youtube-mp3.org/get?video_id=%s&h=%s'
push_url = push_url_fmt % quote(yt_url)
data = urlopen(push_url).read()
sleep(10)
info_url = info_url_fmt % yt_code
data = urlopen(info_url).read()
res = findall('"h" : "([^"]*)"', data)
download_url = download_url_fmt % (yt_code, res[0])
print 'Download here:', download_url

You could use selenium to interact with the js stuff and then combine it with BeautifulSoup or do everything with selenium, just as you prefer.
http://seleniumhq.org/
Selenium is a tool for browser automatization and has bindings for a few languages including Python. It takes a running instance of Firefox/IE/Chrome and let's you script it (I suggest using the selenium webdriver for this simple problem, not the whole selenium server).

You're going to have to work through http://www.youtube-mp3.org/client.js and figure out the exact information that is being passed around, this could allow you to post a request, parse the response and download from the correct scraped url.

Related

Parsing website with BeautifulSoup and Requests returns None

Im a beginner in programming all together and work on a project of mine. For that I'm trying to parse data from a website to make a tool that uses the data. I found that BeatifulSoup and Requests are common tools to do it, but unfortunately i can not seem to make it work. It always returns the value None or an error where it says:
"TypeError: 'NoneType' object is not callable"
Did i do anything wrong? Is it maybe not possible to parse some websites data and I'm being restricted the access or something?
If there are other ways to access the data im happy to hear as well.
Here is my code:
from bs4 import BeautifulSoup
import requests
pickrates = {} # dict to store winrate of champions for each position
source = requests.get("http://u.gg/lol/champions/aatrox/build?role=top").text
soup = BeautifulSoup(source, "lxml")
value = soup.find("div", class_="content-section champion-ranking-stats")
print(value.prettify())

Remember when you request a webpage with requests module, you will only get the html of that page. I mean this module is not capable of rendering JavaScript.
Try this code:
import requests
source = requests.get("http://u.gg/lol/champions/aatrox/build?role=top").text
print(source)
Then search for the class names you provided by hand(ctrl + f), there is no such elements at all. It means those are generated by other requests like ajax. They are somehow created after the initial html page is loaded. So before Beautiful soup comes to the party, you can't get them even in .text attribute of the response object.
One way of doing it is to Selenium or any other libraries which handles the JS.

It seems like this question (can't find html tag when I scrape web using beautifulsoup), the problem would be caused by the JavaScript event listener. I would suggest you to use selenium to handle this issue. So, let apply selenium at sending request and getting back page source and then use BeautifulSoup to parse it.
Don't forget to download a browser driver from https://www.selenium.dev/documentation/getting_started/installing_browser_drivers/ and place it in the same directory with your code.
The example of code below is using selenium with Firefox:
from selenium import webdriver
from bs4 import BeautifulSoup
URL = 'http://u.gg/lol/champions/aatrox/build?role=top'
browser = webdriver.Firefox()
browser.get(URL)
soup = BeautifulSoup(browser.page_source, 'html.parser')
time.sleep(1)
browser.close()
value = soup.find("div", class_="content-section champion-ranking-stats")
print(value.prettify())
Your expected output would be like:
>>> print(value.prettify())
<div class="content-section champion-ranking-stats">
<div class="win-rate meh-tier">
<div class="value">
48.4%
</div>
<div class="label">
Win Rate
</div>
</div>
<div class="overall-rank">
<div class="value">
49 / 58
</div>
<div class="label">
Rank
</div>
</div>
<div class="pick-rate">
<div class="value">
3.6%
</div>
<div class="label">
Pick Rate
</div>
</div>
<div class="ban-rate">
<div class="value">
2.3%
</div>
<div class="label">
Ban Rate
</div>
</div>
<div class="matches">
<div class="value">
55,432
</div>
<div class="label">
Matches
</div>
</div>
</div>

Can't get data from inside of span-tag with beautifulsoup

I am trying to scrape Instagram page, and want to get/access div-tags present inside of span-tag. but I can't! the HTML of the Instagram page looks like as
<head>--</head>
<body>
<span id="react-root" aria-hidden="false">
<form enctype="multipart/form-data" method="POST" role="presentation">…</form>
<section class="_9eogI E3X2T">
<main class="SCxLW o64aR" role="main">
<div class="v9tJq VfzDr">
<header class=" HVbuG">…</header>
<div class="_4bSq7">…</div>
<div class="fx7hk">…</div>
</div>
</main>
</section>
</body>
I do, it as
from bs4 import BeautifulSoup
import urllib.request as urllib2
html_page = urllib2.urlopen("https://www.instagram.com/cherrified_/?hl=en")
soup = BeautifulSoup(html_page,"lxml")
span_tag = soup.find('span') # return span-tag correctly
span_tag.find_all('div') # return empty list, why ?
please also specify an example.

Instagram is a Single Page Application powered by React, which means its source is just a simple "empty" page that loads JavaScript to dynamically generate the content in the browser after downloading.
Click "View source" or go to view-source:https://www.instagram.com/cherrified_/?hl=en in Chrome. This is the HTML you download with urllib.request.
You can see that there is a single <span> tag, which does not include a <div> tag. (Note: <div> inside a <span> is not allowed).
Scraping instagram.com this way is not possible. It also might not be legal (I am not a lawyer).
Notes:
your HTML code example doesn't include a closing tag for <span>.
your HTML code example doesn't match the link you provide in the python snippet.
in the last line of the python snippet you probably meant span_tag.find_all('div') (note the variable name and the singular 'div').

BeautifulSoup webpage have protection and prettify() return no data

I am trying to get some data from a wine website.
But I can not assess the data and there have a usage violation message.
The url : https://www.wine-searcher.com/find/drc/2013
The prettify() result something like that:
<div id="bodycontainer">
<div class="colmask contentparent">
<div id="colheader">
<div class="colmask articlecontainer">
<div class="colmidtemp3">
<div class="collefttemp3">
<div class="col1wraptemp3">
<div class="col1temp3">
<div>
<h1 style="margin:50px 0 0">
Usage Violation
</h1>
<div style="margin-bottom:50px;padding:50px 10px;background-color:#FFFACD">
<h2 style="font-size:1.4em">
Blocked
</h2>
<p style="font-size:1.2em">
The IP Address [xx.xxx.xxx.xx] you are using has been used in violation of Wine-Searcher's usage guidelines.
<b>
If you think you have received this message in error restart your web browser and retry accessing wine-searcher.com.
</b>
</p>
<p style="font-size:1.2em">
To re-gain access to Wine-Searcher please
<a href="mailto:wsexcessiveuse#wine-searcher.com?subject=Blocked IP=1 ID=PVBXC7PJCM80025">
Contact Us
</a>
.
</p>
</div>
</div>
</div>
</div>
Is there any possible ways to get the data from the url? Thank you so much.
My coding here :
# -*- coding: utf-8 -*-
import bs4
import re
import requests
import sys
from bs4 import BeautifulSoup
name = "Wine.txt"
k = open(name, "w", encoding='utf-8')
Stat_url = "https://www.wine-searcher.com/find/drc/2012"
page = requests.get(Stat_url)
soup = bs4.BeautifulSoup(page.text,'lxml')
k.write(soup.prettify())

It looks like they added some protection to their page to prevent that, what you try to do ;)
They sell access via API https://www.wine-searcher.com/api.lml and allow in a trial 100 calls in 5 days. Maybe this is enough for you?
I would guess that BeautifulSoup is trying to many requests in to few time? (Maybe limit it to one per 10 seconds and let it run overnight?)
Can the agent-id in BS be changed to something more common like a regular browser agent-id?

Web Scraping with Python Request/lxml: Getting data from ul/li

so I'm pretty new to this, and I haven't been able to find anything on google on this question.
I'm using request and lxml with Python, I've seen that there's a lot of different modules for web scraping, but is there any reason to choose one over the other? Can you do the same stuff with requests/lxml as you can with for example BeautifulSoup?
Anyway, here's my actual question;
This is my code:
import requests
from lxml import html
# Login data
inputUrl = 'http://forum.mytestsite.com/login'
usr = 'myusername'
pwd = 'mypassword'
payload = dict(login=usr, password=pwd)
# Open session
with requests.Session() as s:
# Login
s.post(inputUrl, data=payload)
# Get page data
pageResult = s.get('http://forum.mytestsite.com/icons/', allow_redirects=False)
pageResult = html.fromstring(pageResult.content)
pageIcons = pageResult.xpath('//script[#id="table-icons"]/text()')
print pageIcons[0]
The result when printing pageIcons[0]:
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
This is the website/js code that generates the icons:
<script id="table-icons" type="text/x-handlebars-template">
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
</script>
And here's the result on the page:
<ul id="icons">
<li data-handle="558FSTBI" class="">
<img src="http://testsite.com/icons/558FSTBI.1.png" alt="Icon 1" title="Icon 1">
</li>
<li data-handle="310AYTZI">
<img src="http://testsite.com/icons/310AYTZI.1.png" alt="Icon 2" title="Icon 2">
</li>
<li data-handle="669PQXBI" class="">
<img src="http://testsite.com/icons/669PQXBI.1.png" alt="Icon 3" title="Icon 3">
</li>
</ul>
My goal:
What I would like to do is to retrieve all of li data-handles, but I haven't been able to figure out how to retrieve this data. So my goal is to retrieve all of the icon paths and their titles, could anyone help me out here? I'd really appreciate any help :)

You aren't parsing the li or ul.
Start with this
//ul[#id='icons']/li/img
And from those elements, you can extract the individual information
Regarding the first question, beautifulsoup optionally uses lxml. If you don't think you need it, and are comfortable with XPath, don't worry about it.
However, since it's Javascript generating the page, you need a headless browser rather than requests library.
Get page generated with Javascript in Python
Reading dynamically generated web pages using python

IndexError: list index out of range while using bs4

This I the Link where I am trying to fetch data flipkart
and the part of code :
<div class="toolbar-wrap line section">
<div class="ratings-reviews-wrap">
<div itemprop="aggregateRating" itemscope="" itemtype="http://schema.org/AggregateRating" class="ratings-reviews line omniture-field">
<div class="ratings">
<meta itemprop="ratingValue" content="1">
<div class="fk-stars" title="1 stars">
<span class="unfilled">★★★★★</span>
<span class="rating filled" style="width:20%">
★★★★★
</span>
</div>
<div class="count">
<span itemprop="ratingCount">2</span>
</div>
</div>
</div>
</div>
</div>
here I have to fetch 1 star from title= 1 star and 2 from <span itemprop="ratingCount">2</span>
I try the following code
x = link_soup.find_all("div",class_='fk-stars')[0].get('title')
print x, " product_star"
y = link_soup.find_all("span",itemprop="ratingCount")[0].string.strip()
print y
but It give the
IndexError: list index out of range

The content that you see in the browser is not actually present in the raw HTML that is retrieved from this URL.
When loaded with a browser, the page executes AJAX calls to load additional content, which is then dynamically inserted into the page. One of the calls gets the ratings info that you are after. Specifically this URL is the one that contains the HTML that is inserted as the "action bar".
But if you retrieve the main page using Python, e.g. with requests, urllib et. al., the dynamic content is not loaded and that is why BeautifulSoup can't find the tags.
You could analyse the main page to find the actual link, retrieve that, and then run it through BeautifulSoup. The link looks like it begins with /p/pv1/spotList1/spot1/actionBar so that, or perhaps actionBar is sufficient to locate the actual link.
Or you could use selenium to load the page and then grab and process the rendered HTML.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python data scraping - python

You're going to have to work through http://www.youtube-mp3.org/client.js and figure out the exact information that is being passed around, this could allow you to post a request, parse the response and download from the correct scraped url.

Related

Parsing website with BeautifulSoup and Requests returns None

Can't get data from inside of span-tag with beautifulsoup

BeautifulSoup webpage have protection and prettify() return no data

Web Scraping with Python Request/lxml: Getting data from ul/li

IndexError: list index out of range while using bs4

Categories

Resources