unable to retrieving table through lxml, xpath (unanswered) - python

I"m trying to scrape http://www.sgx.com/wps/portal/sgxweb/home/company_disclosure/stockfacts for this table, and return the Company Name, Code and industry into a list, for all 15 pages of it.
And i've been trying to work with lxml.html, xpath and Beautifulsoup to try and get this information, but i'm stuck.
I realised that this information seems to be a #html embedded within the website, but i'm not sure how I can build a module to retrieve it.
Any thoughts? Or if I should be using a different module/technique?
Edit
I found out that this link was embedded into the website, which consist of the #html that I was talking about previously: http://sgx.wealthmsi.com/index.html#http%3A%2F%2Fwww.sgx.com%2Fwps%2Fportal%2Fsgxweb%2Fhome%2Fcompany_disclosure%2Fstockfacts
When I tried to use Beautifulsoup to pull the data out:
r = requests.get('http://sgx.wealthmsi.com/index.html#http%3A%2F%2Fwww.sgx.com%2Fwps%2Fportal%2Fsgxweb%2Fhome%2Fcompany_disclosure%2Fstockfacts')
wb = BeautifulSoup(r.text, "html.parser")
print(wb.findAll('div', attrs={'class': 'table-wrapper results-display'}))
It returns the result below:
[<div class="table-wrapper results-display">
<table>
<thead>
<tr></tr>
</thead>
<tbody></tbody>
</table>
</div>]
But that's different from what is in the website. Any thoughts?

You might want to address this problem another way.
By looking at the server calls (chrome -> F12 -> network tab), you can figure out which url you should actually call to get a json response instead.
Apparently, you could use a url that starts like this:
http://sgx-api-lb-195267723.ap-southeast-1.elb.amazonaws.com/sgx/search?callback=json&json=???? (you'll need to do some reverse engineering to figure out the actual json query but it doesn't look too difficult)
Sorry I did not look much further into the json query but I hope this helps you keep going :)
Note: I based my answer on this url
#!/usr/bin/env python
import requests
url = "http://sgx-api-lb-195267723.ap-southeast-1.elb.amazonaws.com/sgx/search"
params = {
'callback': 'json',
'json': {
# key / value pairs defining your actual query to the server
# you need to figure this out yourself depending on the data you want
# to retrieve.
# I usually look at chrome's network tab (F12), find the proper URL
# that queries for the data, reverse engineer the key/value pairs
}
}
response = requests.get(url, params)
print(response.json())

Related

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

send a post request to a website with multiple form tags using requests in python

good evening,
im trying to write a programme that extracts the sell price of certain stocks and shares on a website called hl.co.uk
As you can imagine you have to search for the stock you want to see the sale price of.
my code so far is as follows:
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.hl.co.uk/shares"
page = requests.get(url)
parsed_html = soup(page.content, 'html.parser')
form = parsed_html.find('form', id="stock_search")
input_tag = form.find('input').get('name')
submit = form.find('input', id="stock_search_submit").get('alt')
post_data = {input_tag: "fgt", "alt": submit}
i have been able to extract the correct form tag and the input names i require. but the website has multiple forms on this page.
how can i submit a post request to this website using the data i have in "post_data" to that specfic form in order for it to search the stockk/share that i desire and then give me the next page?
thanks in advance
Actually when you submit the form from the homepage, it redirect you to the the target page with an url looking like this, "https://www.hl.co.uk/shares/search-for-investments?stock_search_input=abc&x=56&y=35&category_list=CEHGINOPW", so in my opinion, instead of submitting the homepage form, you should directly call the target page with your own GET parameters, the url you're supposed to call will look like this https://www.hl.co.uk/shares/search-for-investments?stock_search_input=[your_keywords].
Hope this helped you
This is a pretty general problem which you can use google chrome's devtools to solve. Basically,
1- Navigate to the page where you have a form and bunch of fields.
In your case page should look like this:
2- Then choose XHR tab under Network tab which will filter out all Fetch and XHR requests. These requests are generally sent after a form submission and they return a JSON with resulting data most of the time.
3- Make sure you enable the checkbox on the top left Preserve Log so the list doesn't refresh when form is submitted.
4- Submit the form, then you'll see bunch of requests are being made. Inspect them to hopefully find what you're looking for.
In this case I found this URL endpoint which gives out the results as response.
https://www.hl.co.uk/ajax/funds/fund-search/search?investment=&companyid=1324&sectorid=132&wealth=&unitTypePref=&tracker=&payment_frequency=&payment_type=&yield=&standard_ocf=&perf12m=&perf36m=&perf60m=&fund_size=&num_holdings=&start=0&rpp=20&lo=0&sort=fd.full_description&sort_dir=asc&
You can see all the query parameters here as companyid, sectorid what you need to do is change those and just make a request to URL. Then you'll get the relevant information.
To retrieve those companyid and sectorid values you can send a get request to the page https://www.hl.co.uk/shares/search-for-investments?stock_search_input=ftg&x=17&y=23&category_list=CEHGINOPW which has those dropdowns and filter the html to find these values in the screenshot below:
You can see this documentation for BS4 to find tags inside HTML source, https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find

Beautifulsoup - Submit form data

I am trying to programmatically download (open) data from a website using BeautifulSoup.
The website is using a php form where you need to submit input data and then outputs the resulting links apparently within this form.
My approach was as follows
Step 1: post form data via request
Step 2: parse resulting links via BeautifulSoup
However, it seems like this is not working / I am doing wrong as the post method seems not to work and Step 2 is not even possible as no results are available.
Here is my code:
from bs4 import BeautifulSoup
import requests
def get_text_link(soup):
'Returns list of links to individual legal texts'
ergebnisse = soup.findAll(attrs={"class":"einErgebnis"})
if ergebnisse:
links = [el.find("a",href=True).get("href") for el in ergebnisse]
else:
links = []
return links
url = "https://www.justiz.nrw.de/BS/nrwe2/index.php#solrNrwe"
# Post specific day to get one day of data
params ={'von':'01.01.2018',
'bis': '31.12.2018',
"absenden":"Suchen"}
response = requests.post(url,data=params)
content = response.content
soup = BeautifulSoup(content,"lxml")
resultlinks_to_parse = get_text_link(soup) # is always an empty list
# proceed from here....
Can someone tell what I am doing wrong. I am not really familiar with request post. The form field for "bis" e.g. looks as follows:
<input id="bis" type="text" name="bis" size="10" value="">
If my approach is flawed I would appreaciate any hint how to deal with this kind of site.
Thanks!
I've found what is the issue in your requests.
My investigation give the following params was availables:
gerichtst:
yp:
gerichtsbarkeit:
gerichtsort:
entscheidungsart:
date:
von: 01.01.2018
bis: 31.12.2018
validFrom:
von2:
bis2:
aktenzeichen:
schlagwoerter:
q:
method: stem
qSize: 10
sortieren_nach: relevanz
absenden: Suchen
advanced_search: true
I think the qsize param is mandatory for yourPOST request
So, you have to replace your params by:
params = {
'von':'01.01.2018',
'bis': '31.12.2018',
'absenden': 'Suchen',
'qSize': 10
}
Doing this, here are my results when I print resultlinks_to_parse
print(resultlinks_to_parse)
OUTPUT:
[
'http://www.justiz.nrw.de/nrwe/lgs/detmold/lg_detmold/j2018/03_S_69_18_Urteil_20181031.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1122_17_Urteil_20180126.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/13_TaBV_10_18_Beschluss_20181123.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1810_17_Urteil_20180629.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/10_Sa_1811_17_Urteil_20180629.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_Sa_1196_17_Urteil_20180118.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_Sa_1775_17_Urteil_20180614.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/11_SaGa_9_18_Urteil_20180712.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/12_Sa_748_18_Urteil_20181009.html',
'http://www.justiz.nrw.de/nrwe/arbgs/hamm/lag_hamm/j2018/12_Sa_755_18_Urteil_20181106.html'
]

Python3 Parse more than 30 videos at a time from youtube

I recently decided to get into parsing with python, i made up a project where i need to get data from all of a youtubers videos. I decided it would be easy to just go to the video tab in their channel and parse it for all if its links. However when i do parse it i can only get 30 videos at a time. I was wondering why this is because the link never seems to change when you load more. As well as if there was a way around it.
Here is my code
import bs4 as bs
import requests
page = requests.get("/run/media/morpheous/PORTEUS/Workspace/Python/Parsing/parse.py")
soup = bs.BeautifulSoup(page.text, 'html.parser')
soup.find_all("a", "watch-view-count")
k = soup.find_all("div", "yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2")
storage = open('data.csv', 'a')
storage.write(k.get('href')
storage.close()
Any help is appreciated, thanks
I should first say that I agree with #jonrsharpe. Using the YouTube API is the more sensible choice.
However, if you must do this by scraping, here's a suggestion.
Let's take MKBHD's videos page as an example. The Load more button at the bottom of the page has a button tag with this attribute (You can use your browser's 'inspect element' feature to see this value):
data-uix-load-more-href="/browse_ajax?action_continuation=1&continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"
When you click the Load more button, it makes an AJAX request to this /browse_ajax url. The response is a JSON object that looks like this:
{
content_html: "the html for the videos",
load_more_widget_html: " \n\n\n\n \u003cbutton class=\"yt-uix-button yt-uix-button-size-default yt-uix-button-default load-more-button yt-uix-load-more browse-items-load-more-button\" type=\"button\" onclick=\";return false;\" aria-label=\"Load more\n\" data-uix-load-more-href=\"\/browse_ajax?action_continuation=1\u0026amp;continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk03Z0JBQSUzRCUzRA%253D%253D\" data-uix-load-more-target-id=\"channels-browse-content-grid\"\u003e\u003cspan class=\"yt-uix-button-content\"\u003e \u003cspan class=\"load-more-loading hid\"\u003e\n \u003cspan class=\"yt-spinner\"\u003e\n \u003cspan class=\"yt-spinner-img yt-sprite\" title=\"Loading icon\"\u003e\u003c\/span\u003e\n\nLoading...\n \u003c\/span\u003e\n\n \u003c\/span\u003e\n \u003cspan class=\"load-more-text\"\u003e\n Load more\n\n \u003c\/span\u003e\n\u003c\/span\u003e\u003c\/button\u003e\n\n\n"
}
The content_html contains the html for the new page of videos. You can parse that to get the videos in that page. To get to the next page, you need to use the load_more_widget_html value and extract the url which again looks like:
data-uix-load-more-href="/browse_ajax?action_continuation=1&continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"
The only thing in that url that changes is the value of the continuation parameter. You can keep making requests to this 'continuation' url, until the returning JSON object does not have the load_more_widget_html.

Python data scraping - Elementary concepts

I am trying to get my head around how data scraping works when you look past HTML (i.e. DOM scraping).
I've been trying to write a simple Python code to automatically retrieve the number of people that have seen a specific ad: the part where it says '3365 people viewed Peter's place this week.'
At first I tried to see if that was displayed in the HTML code but could not find it. Did some research and saw that not everything will be in the code as it can be processes by the browser through JavaScript or other languages that I don't quite understand yet. I then inspected the element and realised that I would need to use the Python library 'retrieve' and 'lxml.html'. So I wrote this code:
import requests
import lxml.html
response = requests.get('https://www.airbnb.co.uk/rooms/501171')
resptext = lxml.html.fromstring(response.text)
final = resptext.text_content()
finalu = final.encode('utf-8')
file = open('file.txt', 'w')
file.write(finalu)
file.close()
With that, I get a code with all the text in the web page, but not the text that I am looking for! Which is the magic number 3365.
So my question is: how do I get it? I have thought that maybe I am not using the correct language to get the DOM, maybe it is done with JavaScript and I am only using lxml. However, I have no idea.
The DOM element you are looking at is updated after page load with what looks like an AJAX call with the following request URL:
https://www.airbnb.co.uk/rooms/501171/personalization.json
If you GET that URL, it will return the following JSON data:
{
"extras_price":"£30",
"preview_bar_phrases":{
"steps_remaining":"<strong>1 step</strong> to list"
},
"flag_info":{
},
"user_is_admin":false,
"is_owned_by_user":false,
"is_instant_bookable":true,
"instant_book_reasons":{
"within_max_lead_time":null,
"within_max_nights":null,
"enough_lead_time":true,
"valid_reservation_status":null,
"not_country_or_village":true,
"allowed_noone":null,
"allowed_everyone":true,
"allowed_socially_connected":null,
"allowed_experienced_guest":null,
"is_instant_book_host":true,
"guest_has_profile_pic":null
},
"instant_book_experiments":{
"ib_max_nights":14
},
"lat":51.5299601405844,
"lng":-0.12462748035984603,
"localized_people_pricing_description":"£30 / night after 2 guests",
"monthly_price":"£4200",
"nightly_price":"£150",
"security_deposit":"",
"social_connections":{
"connected":null
},
"staggered_price":"£4452",
"weekly_price":"£1050",
"show_disaster_info":false,
"cancellation_policy":"Strict",
"cancellation_policy_link":"/home/cancellation_policies#strict",
"show_fb_cta":true,
"should_show_review_translations":false,
"listing_activity_data":{
"day":{
"unique_views":226,
"total_views":363
},
"week":{
"unique_views":3365,
"total_views":5000
}
},
"should_hide_action_buttons":false
}
If you look under "listing_activity_data" you will find the information you seek. Appending /personalization.json to any room URL seems to return this data (for now).
Update per the user agent issues
It looks like they are filtering requests to this URL based on user agent. I had to set the user agent on the urllib request in order to fix this:
import urllib2
import json
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://www.airbnb.co.uk/rooms/501171/personalization.json', None, headers)
json = json.load(urllib2.urlopen(req))
print(json['listing_activity_data']['week']['unique_views'])
so first of all you need to figure out if that section of code has any unique tags. So if you look at the HTML tree you have
html > body > #room > ....... > #book-it-urgency-commitment > div > div > ... > div#media-body > b
The data you need is stored in a 'b' tag. I'm not sure about using lxml, but I usually use BeautifulSoup for my scraping.
You can reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/ it's pretty straight forward.

Categories