Web scraping, getting empty list - python

I have a hard time figuring out a correct path with my web scraping code.
I am trying to scrape different info from http://financials.morningstar.com/company-profile/c.action?t=AAPL.
I have tried several paths, and some seem to work and some not.
I am interested in CIK under Operation Details
page = requests.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL')
tree=html.fromstring(page.text)
#desc = tree.xpath('//div[#class="r_title"]/span[#class="gry"]/text()') #works
#desc = tree.xpath('//div[#class="wrapper"]//div[#class="headerwrap"]//div[#class="h_Logo"]//div[#class="h_Logo_row1"]//div[#class="greeter"]/text()') #works
#desc = tree.xpath('//div[#id="OAS_TopLeft"]//script[#type="text/javascript"]/text()') #works
desc = tree.xpath('//div[#class="col2"]//div[#id="OperationDetails"]//table[#class="r_table1 r_txt2"]//tbody//tr//th[#class="row_lbl"]/text()')
I can't figure the last path. It seems like I am following the path correctly, but I get empty list.

The problem is that Operational Details are loaded separately with an additional GET request. Simulate it in your code maintaining a web-scrapin session:
import requests
from lxml import html
with requests.Session() as session:
page = session.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL')
tree = html.fromstring(page.text)
# get the operational details
response = session.get("http://financials.morningstar.com/company-profile/component.action", params={
"component": "OperationDetails",
"t": "XNAS:AAPL",
"region": "usa",
"culture": "en-US",
"cur": "",
"_": "1444848178406"
})
tree_details = html.fromstring(response.content)
print tree_details.xpath('.//th[#class="row_lbl"]//text()')
Old answer:
It's just that you should remove tbody from the expression:
//div[#class="col2"]//div[#id="OperationDetails"]//table[#class="r_table1 r_txt2"]//tr//th[#class="row_lbl"]/text()
tbody is an element that is inserted by the browser to define the data rows in a table.

Related

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Unable to grab some information displayed in a new tab

I've created a script in python using requests module to fetch some information displayed upon filling in a form using this email africk2#nd.edu. The problem is when I hit the search button, I can see a new tab containing all the information I wish to grab. Moreover, I don't see any link in the All tab under Network section within chrome dev tools. So, I'm hopeless as to how I can get the information using requests module.
website address
Steps to populate the result manually:
Put this email address africk2#nd.edu next to the inputbox of Email address and hit the Search button.
I've tried with:
import requests
from bs4 import BeautifulSoup
url = "https://eds.nd.edu/search/index.shtml"
post_url = "https://eds.nd.edu/cgi-bin/nd_ldap_search.pl"
res = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
payload = {item['name']:item.get('value','') for item in soup.select('input[name]')}
payload['email'] = 'africk2#nd.edu'
del payload['clear']
resp = requests.post(post_url,data=payload)
print(resp.content)
The above script is a faulty approach. However, I can't find any idea to grab the information connected to that email.
P.S. I'm not after selenium-oriented solution.
Ok, solved it:
from urllib.parse import quote
import requests
def get_contact_html(email: str):
encoded = quote('o=\"University of Notre Dame\", '
'st=Indiana, '
'c=US?displayName,edupersonaffiliation,ndTitle,ndDepartment,postalAddress,telephoneNumber,mail,searchGuide,labeledURI,'
'uid?'
'sub?'
f'(&(ndMail=*{email}*))')
data = {
"ldapurl": f'LDAP://directory.nd.edu:389/{encoded}',
"ldaphost": "directory.nd.edu",
"ldapport": '389',
"ldapbase": 'o="University of Notre Dame", st=Indiana, c=US',
"ldapfilter": f'(&(ndMail=*{email}*))',
"ldapheadattr": "displayname",
"displayformat": "nd",
"ldapmask": "",
"ldapscope": "",
"ldapsort": "",
"ldapmailattr": "",
"ldapurlattr": "",
"ldapaltattr": "",
"ldapjpgattr": "",
"ldapdnattr": "",
}
res = requests.post('https://eds.nd.edu/cgi-bin/nd_ldap_search.pl',
data=data)
res.raise_for_status()
return res.text
if __name__ == '__main__':
html = get_contact_html('africk2#nd.edu')
print(html)
output:
...
Formal Name:
...
Aaron D Frick
...
this will give you the HTML for the page.
The trick was converting encoded spaces + to real spaces in "ldapbase": 'o="University of Notre Dame", st=Indiana, c=US', field and letting requests module to encode the value itself. Otherwise + signs get double encoded.

Python data scraping - Elementary concepts

I am trying to get my head around how data scraping works when you look past HTML (i.e. DOM scraping).
I've been trying to write a simple Python code to automatically retrieve the number of people that have seen a specific ad: the part where it says '3365 people viewed Peter's place this week.'
At first I tried to see if that was displayed in the HTML code but could not find it. Did some research and saw that not everything will be in the code as it can be processes by the browser through JavaScript or other languages that I don't quite understand yet. I then inspected the element and realised that I would need to use the Python library 'retrieve' and 'lxml.html'. So I wrote this code:
import requests
import lxml.html
response = requests.get('https://www.airbnb.co.uk/rooms/501171')
resptext = lxml.html.fromstring(response.text)
final = resptext.text_content()
finalu = final.encode('utf-8')
file = open('file.txt', 'w')
file.write(finalu)
file.close()
With that, I get a code with all the text in the web page, but not the text that I am looking for! Which is the magic number 3365.
So my question is: how do I get it? I have thought that maybe I am not using the correct language to get the DOM, maybe it is done with JavaScript and I am only using lxml. However, I have no idea.
The DOM element you are looking at is updated after page load with what looks like an AJAX call with the following request URL:
https://www.airbnb.co.uk/rooms/501171/personalization.json
If you GET that URL, it will return the following JSON data:
{
"extras_price":"£30",
"preview_bar_phrases":{
"steps_remaining":"<strong>1 step</strong> to list"
},
"flag_info":{
},
"user_is_admin":false,
"is_owned_by_user":false,
"is_instant_bookable":true,
"instant_book_reasons":{
"within_max_lead_time":null,
"within_max_nights":null,
"enough_lead_time":true,
"valid_reservation_status":null,
"not_country_or_village":true,
"allowed_noone":null,
"allowed_everyone":true,
"allowed_socially_connected":null,
"allowed_experienced_guest":null,
"is_instant_book_host":true,
"guest_has_profile_pic":null
},
"instant_book_experiments":{
"ib_max_nights":14
},
"lat":51.5299601405844,
"lng":-0.12462748035984603,
"localized_people_pricing_description":"£30 / night after 2 guests",
"monthly_price":"£4200",
"nightly_price":"£150",
"security_deposit":"",
"social_connections":{
"connected":null
},
"staggered_price":"£4452",
"weekly_price":"£1050",
"show_disaster_info":false,
"cancellation_policy":"Strict",
"cancellation_policy_link":"/home/cancellation_policies#strict",
"show_fb_cta":true,
"should_show_review_translations":false,
"listing_activity_data":{
"day":{
"unique_views":226,
"total_views":363
},
"week":{
"unique_views":3365,
"total_views":5000
}
},
"should_hide_action_buttons":false
}
If you look under "listing_activity_data" you will find the information you seek. Appending /personalization.json to any room URL seems to return this data (for now).
Update per the user agent issues
It looks like they are filtering requests to this URL based on user agent. I had to set the user agent on the urllib request in order to fix this:
import urllib2
import json
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://www.airbnb.co.uk/rooms/501171/personalization.json', None, headers)
json = json.load(urllib2.urlopen(req))
print(json['listing_activity_data']['week']['unique_views'])
so first of all you need to figure out if that section of code has any unique tags. So if you look at the HTML tree you have
html > body > #room > ....... > #book-it-urgency-commitment > div > div > ... > div#media-body > b
The data you need is stored in a 'b' tag. I'm not sure about using lxml, but I usually use BeautifulSoup for my scraping.
You can reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/ it's pretty straight forward.

Missing source page information using urllib2

I'm trying to scrape "game tag" data (not the same as HTML tags) from games listed on the digital game distribution site, Steam (store.steampowered.com). This information isn't available via the Steam API, as far as I can tell.
Once I have the raw source data for a page, I want to pass it into beautifulsoup for further parsing, but I have a problem - urllib2 doesn't seem to be reading the information I want (request doesn't work either), even though it's obviously in the source page when viewed in the browser.
For example, I might download the page for the game "7 Days to Die" (http://store.steampowered.com/app/251570/). When viewing the browser source page in Chrome, I can see the following relevant information regarding the game's "tags"
near the end, starting at line 1615:
<script type="text/javascript">
$J( function() {
InitAppTagModal( 251570,
{"tagid":1662,"name":"Survival","count":283,"browseable":true},
{"tagid":1659,"name":"Zombies","count":274,"browseable":true},
{"tagid":1702,"name":"Crafting","count":248,"browseable":true},...
In initAppTagModal, there are the tags "Survival", "Zombies", "Crafting", ect that contain the information I'd like to collect.
But when I use urllib2 to get the page source:
import urllib2
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
page = urllib2.urlopen(url).read()
The part of the source page that I'm interested in is not saved in the my "page" variable, instead everything below line 1555 is simply blank until the closing body and html tags. Resulting in this (carriage returns included):
</div><!-- End Footer -->
</body>
</html>
In the blank space is where the source code I need (along with other code), should be.
I've tried this on several different computers with different installs of python 2.7 (Windows machines and a Mac), and I get the same result on all of them.
How can I get the data that I'm looking for?
Thank you for your consideration.
Well, I don't know if I'm missing something, but it's working for me using requests:
import requests
# Getting html code
url = "http://store.steampowered.com/app/251570/"
html = requests.get(url).text
And even more, the data requested is in json format, so it's easy to extract it in this way:
# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( 251570,'
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]
# Load raw data as python json object
data = json.loads(raw_data)
You will see a beatiful json object like this (this is the info that you need, right?):
[
{
"count": 283,
"browseable": true,
"tagid": 1662,
"name": "Survival"
},
{
"count": 274,
"browseable": true,
"tagid": 1659,
"name": "Zombies"
},
{
"count": 248,
"browseable": true,
"tagid": 1702,
"name": "Crafting"
}......
I hope it helps....
UPDATED:
Ok, I see your problem right now, it seems that the problem is in the page 224600. In this case the webpage requires that you confirm your age before to show you the games info. Anyway, easy to solve it just posting the form that confirm the age. Here is the code updated (and I created a function):
def extract_info_games(page_id):
# Create session
session = requests.session()
# Get initial html
html = session.get("http://store.steampowered.com/app/%s/" % page_id).text
# Checking if I'm in the check age page (just checking if the check age form is in the html code)
if ('<form action="http://store.steampowered.com/agecheck/app/%s/"' % page_id) in html:
# I'm being redirected to check age page
# let's confirm my age with a POST:
post_data = {
'snr':'1_agecheck_agecheck__age-gate',
'ageDay':1,
'ageMonth':'January',
'ageYear':'1960'
}
html = session.post('http://store.steampowered.com/agecheck/app/%s/' % page_id, post_data).text
# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( %s,' % page_id
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]
# Load raw data as python json object
data = json.loads(raw_data)
return data
And to use it:
extract_info_games(224600)
extract_info_games(251570)
Enjoy!
When using urllib2 and read(), you will have to read repeatedly in chunks till you hit EOF, in order to read the entire HTML source.
import urllib2
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
url_handle = urllib2.urlopen(url)
data = ""
while True:
chunk = url_handle.read()
if not chunk:
break
data += chunk
An alternative would be to use the requests module as:
import requests
r = requests.get('http://store.steampowered.com/app/251570/')
soup = BeautifulSoup(r.text)

How to export the contents of a blog to JSON?

I am working on a blog and learning web development at the same time. I want to learn more about JSON so I am trying to implement a way to export the entire contents of my blog to JSON and later XML. I am hitting a lot of problems on the way, the biggest one being getting the url of the page which I want to render as JSON/XML dynamically. The code for my website can be found here. I still need to comment more and I have to implement a lot of functionalities. The main class which is responsible for exporting the contents to JSON is as follows :
class JSONHandler(BaseHandler):
#TODO: get a way to gt the url from the request
def get(self):
self.response.headers['Content-Type'] = 'application/json'
url = "http://www.bigb-myapp.appspot.com/blog"
#url = self.request.path_url
logging.info(url)
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
subject_list = []
day_list = []
content_list = []
subjects = soup.findAll('div', {'class' : 'subject-title'})
days = soup.findAll('div', {'class' : 'day'})
contents = soup.findAll('div', {'class' : 'post'})
for subject in subjects:
subject_list.append(subject.findAll(text = True))
for day in days:
day_list.append(day.findAll(text = True))
for content in contents:
content_list.append(content.findAll(text = True))
i = 0
for s, d, c in subject_list, day_list, content_list:
json_text = json.dumps({'subject': s[i][i],'day': d[i][i], 'content': c[i][i]})
i += 1
self.write(json_text)
I am also sure that the printing function is erroneous, but that is the easy part. As I said getting the url is proving to be a major difficulty.
I have tried to get the url form the environment variable and I also have tired webapp2's request handlers such as self.request.path_url to no avail.
I am working with Google App engine and use the jinja2 template engine.
Thanks.
self.request.url or self.request.path should do the trick.
However, the better way to do this is using similar to what you used in the permalink section. Just parse the post-id from the request. Meaning you should separate JSONHandler into handling two things - a) return the entire blog, b) return an individual post.
I'd also suggest to not use this method you're using to get the blog posts... In the Mainpage class you do it so elegantly with GQL so why do it with urllib2 and BeautifulSoup ?
And as for the last question about the response.. the correct way is: self.response.out.write("something")
EDITED TO ADD:
I meant to split the JSONHandler into two parts, such that there'd be two handlers: ('/blog/(\d+).json',PermalinkJSONHandler),
('/blog.json',FullJSONHandler),...
Both should be about the same (even using the same function for dumping the json) just with different GQLs to get the correct information.

Categories