Identify Webpage - python

Hi I am trying to parse a webpage in Python. This webpage is in a restricted area so I can not give the link. In this webpage you can do queries which then are published in a table which is added on the same webpage, but with new url. When I parse the page I get everything except the table.
I have noticed that it does not matter how my queries are, the url is always the same. So I always get the same result from my parser, which is the webpage without the query result (the table). But if I inspect the webpage (in Chrome) then the table and its results is included in the HTML. My parser just look like this:
import urllib.request
with urllib.request.urlopen("http://www.home_page.com") as url:
s = url.read()
#I'm guessing this would output the html source code?
print(s)
Then my question, are there some other way to identify the webpage so I will receive everything that is published on the webpage?

will based on your question i think you are looking up for web scraping techniques
will here is what i'm suggesting
you could use regular expressing to get data that can be expressed in specific patterns
for example
import urllib,re
siteContent = urllib.urlopen("http://example.com").read()
GetBoldWords = re.findall(r"<b>[\w\d ]+",siteContent)
print "Bold Words are :"
print getBoldWords
so in this case you have to learn more about regex (regular expression) and get your own pattern
in some specific cases you might have to deal with Client Side (for example you have to submit query's through pop up pages from javascript or you have to ignore some alert from javascript then you have to use web browsers api , you could use Selenium to deal with this kind of issues

Related

How to webscrape the correct element from a stat tracking website (cod.tracker.gg) using Python

On this specific page (or any 'matches' page) there are names you can select to view individual statistics for a match. How do I grab the 'kills' stat for example using webscraping?
In most of the tutorials I use the webscraping seems simple. However, when inspecting this site, specifically the 'kills' item, you see something like
<span data-v-71c3e2a1 title="Kills" class ="name".
Question 1.) What is the 'data-v-71c3e2a1'? I've never seen anything like this in my html,css, or webscraping tutorials. It appears in different variations all over the site.
Question 2.) More importantly, how do I grab the number of kills in this section? I've tried using scrapy and grabbing by xpath:
scrapy shell https://cod.tracker.gg/warzone/match/1424533688251708994?handle=PatrickPM
response.xpath("//*[#id="app"]/div[3]/div[2]/div/main/div[3]/div[2]/div[2]/div[6]/div[2]/div[3]/div[2]/div[1]/div/div[1]/span[2]").get()
but this raises a syntax error
response.xpath("//*[#id="app"]
SyntaxError: invalid syntax
Grabbing by response.css("").get() is also difficult. Should I be using selenium? Or just regular requests/bs4? Nothing I do can grab it.
Thank you.
Does this return the data you need?
import requests
endpoint = "https://api.tracker.gg/api/v1/warzone/matches/1424533688251708994"
r = requests.get(endpoint, params={"handle": "PatrickPM"})
data = r.json()["data"]
In any way I suggest using API if there's one available. It's much easier than using BeautifulSoup or selenium.

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.
Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values
The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).
You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

Python - How to scrape Tr/Td table data using 'requests & BeautifulSoup'

I'm new to programming. I'm trying out my first Web Crawler program that will help me with my job. I'm trying to build a program that will scrape tr/td table data from a web page, but am having difficulties succeeding. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code)
for table_data in soup.find_all('td', {'class': 'sorting_1'}):
print(table_data)
start('http://www.datatables.net/')
My goal is to print out each line and then export it to an excel file.
Thank you,
-Cire
My recommendation is that if you are new to Python, play with things via the iPython notebook (interactive prompt) to get things working first and to get a feel for things before you try writing a script or a function. On the plus side all variables will stick around and it is much easier to see what is going on.
From the screen shot here, you can see immediately that the find_all function is not finding anything. An empty lists [] is being returned. By using ipython you can easily try other variants of a function on a previously defined variable. For example, the soup.find_all('td').
Looking at the source of http://www.datatables.net, I do not see any instances of the text sorting_1, so I wouldn't expect a search for all table cells of that class to return anything.
Perhaps that class appeared on a different URL associated with the DataTables website, in which case you would need to use that URL in your code. It's also possible that that class only appears after certain JavaScript has been run client-side (i.e. after certain actions with the sample tables, perhaps), and not on the initially loaded page.
I'd recommend starting with tags you know are on the initial page (seen by looking at the page source in your browser).
For example, currently, I can see a div with class="content". So the find_all code could be changed to the following:
for table_data in soup.find_all('div', {'class': 'content'}):
print(table_data)
And that should find something.
Response to comments from OP:
The precise reason why you're not finding that tag/class pairing in this case is that DataTables renders the table client-side via JavaScript, generally after the DOM has finished loading (although it depends on the page and where the DataTables init code is placed). That means the HTML associated with the base URL does not contain this content. You can see this if you curl the base URL and look at the output.
However when loading it in a browser, once the JavaScript for DataTables fires, the table is rendered and the DOM is dynamically modified to add the table, including cells with the class for which you're looking.

Scraping Biography.com using urllib2

So I've scraped websites before, but this time I am stumped. I am attempting to search for a person on Biography.com and retrieve his/her biography. But whenever I search the site using urllib2 and query the URL: http://www.biography.com/search/ I get a blank page with no data in it.
When I look into the source generated in the browser by clicking View Source, I still do not see any data. When I use Chrome's developer tools, I find some data but still no links leading to the biography.
I have tried changing the User Agent, adding referrers, using cookies in Python but to no avail. If someone could help me out with this task it would be really helpful.
I am planning to use this text for my NLP project and worst case, I'll have to manually copy-paste the text. But I hope it doesn't come to that.
Chrome/Chromium's Developer Tools (or Firebug) is definitely your friend here. I can see that the initial search on Biography's site is made via a call to a Google API, e.g.
https://www.googleapis.com/customsearch/v1?q=Barack%20Obama&key=AIzaSyCMGfdDaSfjqv5zYoS0mTJnOT3e9MURWkU&cx=011223861749738482324%3Aijiqp2ioyxw&num=8&callback=angular.callbacks._0
The search term I used is in the q= part of the query string: q=Barack%20Obama.
This returns JSON inside of which there is a key link with the value of the article of interest's URL.
"link": "http://www.biography.com/people/barack-obama-12782369"
Visiting that page shows me that this is generated by a request to:
http://api.saymedia-content.com/:apiproxy-anon/content-sites/cs01a33b78d5c5860e/content-customs/#published/#by-custom-type/ContentPerson/#by-slug/barack-obama-12782369
which returns JSON containing HTML.
So, replacing the last part of the link barack-obama-12782369 with the relevant info for the person of interest in the saymedia-content link may well pull out what you want.
To implement:
You'll need to use urllib2 (or requests) to do the search via their Google API call, using urllib2.urlopen(url) or requests.get(url). Replace the Barack%20Obama with a URL escaped search string, e.g. Bill%20Clinton.
Parse the JSON using Python's json module to extract the string that gives you the http://www.biography.com/people link. From this, extract the part of this link of interest (as barack-obama-12782369 above).
Use urllib2 or requests to do a saymedia-content API request replacing barack-obama-12782369 after #by-slug/ with whatever you extract from 2; i.e. do another urllib2.urlopen on this URL.
Parse the JSON from the response of this second request to extract the content you want.
(Caveat: This is provided that there are no session-based strings in those two API calls that might expire.)
Alternatively, you can use Selenium to visit the website, do the search and then extract the content.
You will most likely need to manually copy and paste, as biography.com is a completely javascript-based site, so it can't be scraped with traditional methods.
You can discover an api url with httpfox (firefox addon). f.e. http://www.biography.com/.api/item/search?config=published&query=marx
brings you a json you can process searching for /people/ to retrive biography links.
Or you can use an screen crawler like selenium

Querying web pages with Python

I am learning web programming with Python, and one of the exercises I am working on is the following: I am writing a Python program to query the website "orbitz.com" and return the lowest airfare. The departure and arrival cities and dates are used to construct the URL.
I am doing this using the urlopen command, as follows:
(search_str contains the URL)
from lxml.html import parse
from urllib2 import urlopen
parsed = parse(urlopen(search_str))
doc = parsed.getroot()
links = doc.findall('.//a')
the_link = (links[j].text_content()).strip()
The idea is to retrieve all the links from the query results and search for strings such as "Delta", "United" etc, and read off the dollar amount next to the links.
It worked successfully until today - It looks like orbitz.com has changed their output page. Now, when you enter the travel details on the orbitz.com website, there appears a page showing a wheel saying "looking up itineraries" or something to that effect. This is just a filler page and contains no real information. After a few seconds, the real results page is displayed. Unfortunately, the Python code return the links for the filler page each time, and I never obtain the real results.
How can I get around this? I am a relative beginner to web programming, so any help is greatly appreciated.
This kind of things is normal in the world of crawlers.
What you need to do is figure out what url it is redirecting to after the "itinerary page" and you hit that url directly from your script.
Then figure out if they have changed the final search results page too, if so modify your script to accommodate those changes.

Categories