How to download web table using Python or R script - python

I am trying to download data from below link.
link: https://dataminer2.pjm.com/feed/act_sch_interchange
i used below python code not working.
import urllib.request
from pprint import pprint
from html_table_parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
xhtml = url_get_contents('https://dataminer2.pjm.com/feed/act_sch_interchange')
p = HTMLTableParser()
p.feed(xhtml)
pprint(p.tables[1])
print("\n\nPANDAS DATAFRAME\n")
print(pd.DataFrame(p.tables[1]))
I am beginner in coding pls let me know if i did any wrong in the code.
addition to download the data i want to download the table by changing dates and text boxes.
is this possible? any help, thank in advance.
enter image description here

Downloading the HTTP data from said website will unfortunately not give you the data.
What the page contains (simplified version) is some very basic HTML and JS code that loads the data in the background.
One way to visualize what gets loaded by the webiste is to use Chrome's developer mode (Settings->developer mode). Having had a quick look (no, just putting the URL in the browser will not work), it seems that in order to load the JSON data that contains the data you will need to construct HTTP requests containing the correct headers (e.g. Ocp-Apim-Subscription-Key) to directly query the API where the data is made accessible.

You can try this:
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0",
"Ocp-Apim-Subscription-Key": "d408630449804e23b07148259c96b24a",
"Connection": "keep-alive",
}
r = requests.get("https://api.pjm.com/api/v1/act_sch_interchange", headers=headers)
df = pd.DataFrame(r.json()["items"])
If you want to change date, you should change url:
url = "https://api.pjm.com/api/v1/act_sch_interchange?rowCount=25&sort=datetime_beginning_utc&order=Asc&startRow=1&isActiveMetadata=true&fields=actual_flow,datetime_beginning_ept,datetime_beginning_utc,datetime_ending_ept,datetime_ending_utc,inadv_flow,sched_flow,tie_line&datetime_beginning_ept=1/27/2021 00:00to2/1/2021 23:59"

Related

Web scraping (Protein Data Bank) highly nested tags using beautifulsoup and Python3

I am trying to create a CSV file of all protein names, their PDB (Protein Data Bank) Ids, and the Experiment method based on an advanced search query on RSPB. There are 444 search results and I wanted to create a neat CSV file. Here is the link of the search.
I have written the following script to extract information about first search result but the output says "None".
import requests
from bs4 import BeautifulSoup
source = requests.get(url) # url is same as mentioned above
soup = BeautifulSoup(source.text, 'lxml')
item1 = soup.find('div', class_='row results-item')
The HTML code of the page seems to be highly nested and confusing.
TL;DR
I'm trying to get the following in a csv but the HTML is highly nested :(
1) PDB ID (4 digit alphanumeric code)
2) Protein complex name (Ex : The Fk1 domain of FKBP51....)
3) Method (X-ray diffraction, NMR etc)
Any help or advice will be highly appreciated!
Thank you in advance :)
Actually u cant scrape this kind of website with BeautifulSoup.. This website uses internal cdn for rendering data on webpage..However I have come up with a soulution to fetch the data as JSON format..
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P
Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/80.0.3987.162 Mobile Safari/537.36"
}
payload = {
"query":{"type":"group","logical_operator":"and","nodes":
[{"type":"group","logical_operator":"and","nodes":
[{"type":"group","logical_operator":"and","nodes":
[{"type":"terminal","service":"text","parameters":
{"negation":False,"value":"plasmodium falciparum"},"node_id":0},
{"type":"group","logical_operator":"and","nodes":
[{"type":"terminal","service":"text","parameters": {"operator":"exact_match","negation":False,"value":"Homosapiens","attribute":"rcsb_entity_source_organism.ncbi_scientific_name"},"node_id":1}]}]}],"label":"text"}],"label":"query-builder"},"return_type":"entry","request_options":{"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}],"pager":{"start":0,"rows":100}},"request_info":{"src":"ui","query_id":"6878ab86935e083352a6914232c8b2e5"}}
response = requests.post('https://www.rcsb.org/search/data', headers=headers,
json=data)
print(response.json())
You can also play with the payload values to manipulate responses..Hope this will help You!!

Crawling a site with iframe

I am trying to crawl data from this site. It uses multiple iframes for different components.
When I try to open one of the iframe url in browser, it opens in that particular session but in another icognito/private session it doesn't. Same happens when I try to do this via requests or wget.
I have tried using requests along with session, then also it doesn't work. Here is my code snippet
import requests
s = requests.Session()
s.get('https://www.epc.shell.com/')
r = s.get('https://www.epc.shell.com/welcome.asp')
r.text
The last line only returns the javascript text with error that URL is invalid.
I know Selenium can solve this problem but I am considering it as last option.
Is it possible to crawl this URL with requests (or without using Javascript)? If yes, any help would be appreciated. If no, is there any alternative lightweight Javascript library in Python that can achieve this?
Your issue can be easily solved by adding custom headers to your requests, all in all, your code should look like this:
import requests
s = requests.Session()
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Language": "en-US,en;q=0.5"}
s.get('https://www.epc.shell.com/', headers = headers)
r = s.get('https://www.epc.shell.com/welcome.asp', headers = headers)
print(r.text)
(Do note that it is almost always recommended to use headers when sending requests).
I hope this helps!

Real page content isn't what I get with Requests and BeautifulSoup

as it happens sometimes to me, I can't access everything with requests that I can see on the page in the browser, and I would like to know why. On these pages, I am particularly interested in the comments. Does anyone have an idea how to access those comments, please? Thanks!
import requests
from bs4 import BeautifulSoup
import re
url='https://aukro.cz/uzivatel/paluska_2009?tab=allReceived&type=all&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
searched = soup.find_all('td', class_='col1')
print(searched)
Worth knowing you can get the scoring info for the individual as JSON using POST request. Handle the JSON as you require.
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
headers = {
'Content-Type': 'application/json',
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
url = 'https://aukro.cz/backend/api/users/profile?username=paluska_2009'
response = requests.post(url, headers=headers,data = "")
response.raise_for_status()
data = json_normalize(response.json())
df = pd.DataFrame(data)
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8',index = False )
Sample view of JSON:
I run your code and analized the content you have in page.
Seems like aukro.cz is built in Angular since it uses ng-app, therefore it's all dynamic content you apparently can't load using requests. You could try to use selenium in headless mode to scrape that part of content you are looking for.
Let me now if you need instructions for it.
To address your curiosity for QHarr's answer,
Upon loading the URL in chrome browser, if you trace Network calls. You will find out, there post request on URL - https://aukro.cz/backend/api/users/profile?username=paluska_2009, whose response - a JSON, which contains your desired information.
This is a trivial way of scraping data. While web-scraping, in most of the sites, you'll find out part of page is loading through some other api calls. To find the URL and POST params for the request, chrome Network tools is handy tool.
Let me know, if you need any details further.

How to use request module (python 2.7) to crawl .js website?

As I am trying to crawl the real-time stock info from Taiwan Stock Exchange, I used their API to access the desired information. Something strange happened.
For example, I can access the information with the following link API link which will return a nice json for me on the browser (something like this: . But it cannot return a json in my program.
My code is the following
url = "http://mis.twse.com.tw/stock/api/getStockInfo.jsp?ex_ch=tse_t00.tw|otc_o00.tw|tse_1101.tw|tse_2330.tw&json=1&delay=0&_=1516681976742"
print url
def get_data(query_url):
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0",
'Accept-Language': 'en-US'
#'Accept-Language': 'zh-tw'
}
req = requests.session()
req.get('http://mis.twse.com.tw/stock/index.jsp', headers = headers)
#print req.cookies['JSESSIONID']
#print req.cookies.get_dict()
response = req.get(query_url, cookies = req.cookies.get_dict(), headers = headers)
return response.text#json.loads(response.text)
a = get_data(query_url = url)
And it will simply return u'\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n'.
Is there something wrong in my code? Or it is simply not possible to access this kind of web pages with request module?
Or any other suggestions? Thanks a lot!!
ps: Their API is of the format: http://mis.twse.com.tw/stock/api/getStockInfo.jsp?ex_ch=[tickers]&[time]
pps: I try the module selenium and its webDriver. It worked but is very slow. That is why I would like to use request.
As seen in the attached image text is non english hence it is printing is ascii value probably. Use utf-8 encoding on the result and you would see the different result.

How to download webpage in python, url includes GET parameters

I have an URL such as
http://www.example-url.com/content?param1=1&param2=2
in particular I am testing in on
http://ws.parlament.ch/votes/councillors?concillorNumberFilter=2565&format=json
How do I get the content of such URL, such that the get parameters are considered as well?
How can I save it to file?
How can I access multiple URLs like this either in parallel or asynchronously (saving to file on response received callback like in JavaScript)?
I have tried
import urllib
urllib.urlretrieve("http://ws.parlament.ch/votes/councillors?concillorNumberFilter=2565&format=json", "file.json")
but I am getting a content of http://ws.parlament.ch/votes/councillors instead of the json I want.
You can use urllib, but there are other libraries I know of which make it a lot easier in different situations. for example, if you want to also have user authentication done you can use Requests.
For this situation you can use httplib2 for example, here is a clean small piece of code which takes the GET into consideration (source).
import httplib2
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://example.org/", "GET")
It seems that jou need to set the user agent of the connection otherwise it will refuse to give you the data. I also use the urllib2.Request() instead of the standard urlretrieve() and or urlopen(), mostly because this function allows GET, POST requests and allows the user agent to be set by the programmer.
import urllib2, json
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
header = { 'User-Agent' : user_agent }
fullurl = "http://ws.parlament.ch/votes/councillors?councillorNumberFilter=2565&format=json"
response = urllib2.Request(fullurl, headers=header)
data = urllib2.urlopen(response)
print json.loads(data.read())
Some extra information about headers in python
if you want to keep using httplib2 here is the code for this one:
import httplib2
header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
fullurl = "http://ws.parlament.ch/votes/councillors?councillorNumberFilter=2565&format=json"
http = httplib2.Http(".cache")
response, content = http.request(fullurl, "GET", headers=header)
print content
The data printed by my last example can be saved to a file with json.dump(filename, data).

Categories