python scrape webpage and parse the content

python scrape webpage and parse the content - python

I want to scrape the data on this link
http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json
I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:
import requests
url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text
The type of the source is unicode. I also try to use the urllib2 to scrape like:
source2=urllib2.urlopen(url).read()
The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/
Thanks.

The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets
return_json("JSON code to copy")
In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html

The response is text. It does contain JSON, just need to extract it
import json
strip_len = len("return_json(")
source=requests.get(url).text[strip_len:-2]
source = json.loads(source)

Related

Scraping from a dropdown menu using beautifulsoup

I am trying to scrape a list of dates from: https://ca.finance.yahoo.com/quote/AAPL/options
The dates are located within a drop down menu right above the option chain. I've scraped text from this website before but this text is using a 'select' & 'option' syntax. How would I adjust my code to gather this type of text? I have used many variations of the code below to try and scrape the text but am having no luck.
Thank you very much.
import bs4
import requests
datesLink = ('https://ca.finance.yahoo.com/quote/AAPL/options')
datesPage = requests.get(datesLink)
datesSoup = BeautifulSoup(datesPage.text, 'lxml')
datesQuote = datesSoup.find('div', {'class': 'Cf Pt(18px)controls'}).find('option').text

The reason you can't seem to extract this dropdown list is because this list is generated dynamically, and the easiest way to know this is by saving your html content into a file and giving it a manual look, in a text editor.
You CAN, however, parse those dates out of the script source code, which is in the same html file, using some ugly regex way. For example, this seems to work:
import requests, re
from datetime import *
content = requests.get('https://ca.finance.yahoo.com/quote/AAPL/options').content.decode()
match = re.search(r'"OptionContractsStore".*?"expirationDates".*?\[(.*?)\]', content)
dates = [datetime.fromtimestamp(int(x), tz=timezone.utc) for x in match.group(1).split(',')]
for d in dates:
print(d.strftime('%Y-%m-%d'))
It should be obvious that parsing stuff in such a nasty way isn't fool-proof, and likely going to break sooner rather than later. But the same can be said about any kind of web scraping entirely.

You can simply read HTML directly to Pandas:
import pandas as pd
URI = 'https://ca.finance.yahoo.com/quote/AAPL/options'
df = pd.read_html(URI)[0] #[1] depending on the table you wish for

Add Web browser data to a string using Python

About Tab
Store Tab
Hi I would like to know how you can extract html / webbrowser data
So I can add it to a string

import requests
r = requests.get(“http://somesite.com”)
print(r.text)
you can save r.text into a variable... perform split function on it to put into a list... or use regular expressions to extract data between tags... there's a million possibilities once you have that string. Have fun!

How to search for required data in JSON object obtained in python session

I tried to scrape twitter followers using requests library. finally, i tried to save the response of required page in json format and then tried to search for the required parts. The thing is, how to find required elements in the json object?
my code is:
s =requests.session()
res = s.post("https://twitter.com/sessions",data=payload,headers=headers)
r = s.get("https://twitter.com/akhiltaker619/following/users?include_available_features=1&include_entities=1&max_position=1590310744326457266&reset_error_state=false")
dp = r.text
dp1=json.loads(dp)
x = json.dumps(dp1)
print(res.status_code)
soup = BeautifulSoup(x,"html.parser")
x1= soup.find_all("b",{"class":"u-linkComplex-target"})
for i in x1:
print(i.text)
At the end parsing part is wrong as i am trying to scrape json object which is not possible. When i print the json object, i get this:
The link is attached which contains the output of json object
now from this object, i want "class : u-linkComplex-target" present in the "item_html" of this json object. How to get this? Or is there any way to get the same content without using json object(this content is the followers list page in twitter). I used json inorder to load the dynamic content of the page.

The Beautiful Soup library is for parsing HTML and similar tagged languages, not JSON.
If your requests return JSON responses then you should call the r.json() method. This will return a dictionary of the JSON structure. Suppose you used
j = r.json()
then you probably want j['item-html']['linkComplex-target'] or something similar. If you access the dictionary interactively you will probably find what you want.

HTML data grabbing in python?

I'm fairly new to programming and I am trying to take data from a webpage and use it in my python code. Basically, I'm trying to take the price of an item for a game by having python grab the data whenever I run my code, if that makes sense. Here's what I'm struggling with in particular:
The HTML page I'm using is for runescape, namely
http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151
This page provides me with a bunch of dictionaries from which I am trying to extract the price of the item in question. All I really want to do it get all of this data into python so I can then manipulate it. My current code is:
import urllib2
response =urllib2.urlopen('http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151')
print response
And it outputs:
addinfourl at 49631760 whose fp = socket._fileobject object at 0x02F4B2F0
whereas I just want it to display exactly what is on the URL in question.
Any ideas? I'm sorry if my formatting is terrible. And if it sounds like I have no idea what I'm talking about, it's because I don't.

If the webpage returns a json-encoded data, then do something like this:
import urllib2
import json
response = urllib2.urlopen("http://services.runescape.com/m=itemdb_oldschool/api/catalogue/detail.json?item=4151")
data = json.load(response)
print(data)
Extract the relevant keys in the data variable to get the values you want.

HTTP POST and parsing JSON with Scrapy

I have a site that I want to extract data from. The data retrieval is very straight forward.
It takes the parameters using HTTP POST and returns a JSON object. So, I have a list of queries that I want to do and then repeat at certain intervals to update a database. Is scrapy suitable for this or should I be using something else?
I don't actually need to follow links but I do need to send multiple requests at the same time.

How does looks like the POST request? There are many variations, like simple query parameters (?a=1&b=2), form-like payload (the body contains a=1&b=2), or any other kind of payload (the body contains a string in some format, like json or xml).
In scrapy is fairly straightforward to make POST requests, see: http://doc.scrapy.org/en/latest/topics/request-response.html#request-usage-examples
For example, you may need something like this:
# Warning: take care of the undefined variables and modules!
def start_requests(self):
payload = {"a": 1, "b": 2}
yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))
def parse_data(self, response):
# do stuff with data...
data = json.loads(response.body)

For handling requests and retrieving response, scrapy is more than enough. And to parse JSON, just use the json module in the standard library:
import json
data = ...
json_data = json.loads(data)
Hope this helps!

Based on my understanding of the question, you just want to fetch/scrape data from a web page at certain intervals. Scrapy is generally used for crawling.
If you just want to make http post requests you might consider using the python requests library.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python scrape webpage and parse the content - python

The response is text. It does contain JSON, just need to extract it import json strip_len = len("return_json(") source=requests.get(url).text[strip_len:-2] source = json.loads(source)

Related

Scraping from a dropdown menu using beautifulsoup

Add Web browser data to a string using Python

How to search for required data in JSON object obtained in python session

HTML data grabbing in python?

HTTP POST and parsing JSON with Scrapy

Categories

Resources