I am trying to parse data from this website.
In Network section of inspect element i found this link https://busfor.pl/api/v1/searches that is used for a POST request that returns JSON i am interested in.
But for making this POST request there is request Payload with some dictionary.
I assumed it like normal formdata that we use to make FormRequest in scrapy but it returns 403 error.
I have already tried the following.
url = "https://busfor.pl/api/v1/searches"
formdata = {"from_id" : d_id
,"to_id" : a_id
,"on" : '2019-10-10'
,"passengers" : 1
,"details" : []
}
yield scrapy.FormRequest(url, callback=self.parse, formdata=formdata)
This returns 403 Error
I also tried this by referring to one of the StackOverflow post.
url = "https://busfor.pl/api/v1/searches"
payload = [{"from_id" : d_id
,"to_id" : a_id
,"on" : '2019-10-10'
,"passengers" : 1
,"details" : []
}]
yield scrapy.Request(url, self.parse, method = "POST", body = json.dumps(payload))
But even this returns the same error.
Can someone help me. to figure out how to parse the required data using Scrapy.
The way to send POST requests with json data is the later, but you are passing a wrong json to the site, it expects a dictionary, not a list of dictionaries.
So instead of:
payload = [{"from_id" : d_id
,"to_id" : a_id
,"on" : '2019-10-10'
,"passengers" : 1
,"details" : []
}]
You should use:
payload = {"from_id" : d_id
,"to_id" : a_id
,"on" : '2019-10-10'
,"passengers" : 1
,"details" : []
}
Another thing you didn't notice are the headers passed to the POST request, sometimes the site uses IDs and hashes to control access to their API, in this case I found two values that appear to be needed, X-CSRF-Token and X-NewRelic-ID. Luckily for us these two values are available on the search page.
Here is a working spider, the search result is available at the method self.parse_search.
import json
import scrapy
class BusForSpider(scrapy.Spider):
name = 'busfor'
start_urls = ['https://busfor.pl/autobusy/Sopot/Gda%C5%84sk?from_id=62113&on=2019-10-09&passengers=1&search=true&to_id=3559']
search_url = 'https://busfor.pl/api/v1/searches'
def parse(self, response):
payload = {"from_id" : '62113',
"to_id" : '3559',
"on" : '2019-10-10',
"passengers" : 1,
"details" : []}
csrf_token = response.xpath('//meta[#name="csrf-token"]/#content').get()
newrelic_id = response.xpath('//script/text()').re_first(r'xpid:"(.*?)"')
headers = {
'X-CSRF-Token': csrf_token,
'X-NewRelic-ID': newrelic_id,
'Content-Type': 'application/json; charset=UTF-8',
}
yield scrapy.Request(self.search_url, callback=self.parse_search, method="POST", body=json.dumps(payload), headers=headers)
def parse_search(self, response):
data = json.loads(response.text)
Related
The documentation of the API is here, and I try to implement this line in python
//retrieve entries created on a specific day (use the date_created field)
//this example returns entries created on September 10, 2019
https://localhost/wp-json/gf/v2/entries?search={"field_filters": [{"key":"date_created","value":"09/10/2019","operator":"is"}]}
But when I try to do with python in the following code, I got an error:
import json
import oauthlib
from requests_oauthlib import OAuth1Session
consumer_key = ""
client_secret = ""
session = OAuth1Session(consumer_key,
client_secret=client_secret,signature_type=oauthlib.oauth1.SIGNATURE_TYPE_QUERY)
url = 'https://localhost/wp-json/gf/v2/entries?search={"field_filters": [{"key":"date_created","value":"09/01/2023","operator":"is"}]}'
r = session.get(url)
print(r.content)
The error message is :
ValueError: Error trying to decode a non urlencoded string. Found invalid characters: {']', '['} in the string: 'search=%7B%22field_filters%22:%20[%7B%22key%22:%22date_created%22,%22value%22:%2209/01/2023%22,%22operator%22:%22is%22%7D]%7D'. Please ensure the request/response body is x-www-form-urlencoded.
One solution is to parameterize the url:
import requests
import json
url = 'https://localhost/wp-json/gf/v2/entries'
params = {
"search": {"field_filters": [{"key":"date_created","value":"09/01/2023","operator":"is"}]}
}
headers = {'Content-type': 'application/json'}
response = session.get(url, params=params, headers=headers)
print(response.json())
But in the retrieved entries, the data is not filtered with the specified date.
In the official documentation, they gave a date in this format "09/01/2023", but in my dataset, the format is: "2023-01-10 19:16:59"
Do I have to transform the format ? I tried a different format for the date
date_created = "09/01/2023"
date_created = datetime.strptime(date_created, "%d/%m/%Y").strftime("%Y-%m-%d %H:%M:%S")
What alternative solutions can I test ?
What if you use urllib.parse.urlencode function, so your code would looks like:
import json
import oauthlib
from requests_oauthlib import OAuth1Session
import urllib.parse
consumer_key = ""
client_secret = ""
session = OAuth1Session(consumer_key,
client_secret=client_secret,signature_type=oauthlib.oauth1.SIGNATURE_TYPE_QUERY)
params = {
"search": {"field_filters": [{"key":"date_created","value":"09/01/2023","operator":"is"}]}
}
encoded_params = urllib.parse.urlencode(params)
url = f'https://localhost/wp-json/gf/v2/entries?{encoded_params}'
r = session.get(url)
print(r.content)
hope that helps
I had the same problem and found a solution with this code:
params = {
'search': json.dumps({
'field_filters': [
{ 'key': 'date_created', 'value': '2023-01-01', 'operator': 'is' }
],
'mode': 'all'
})
}
encoded_params = urllib.parse.urlencode(params, quote_via=urllib.parse.quote)
url = 'http://localhost/depot_git/wp-json/gf/v2/forms/1/entries?' + encoded_params + '&paging[page_size]=999999999' # nombre de réponses par page forcé manuellement
I'm not really sure what permitted it to work as I'm an absolute beginner with Python, but I found that you need double quotes in the URL ( " ) instead of simple quotes ( ' ), so the solution by William Castrillon wasn't enough.
As for the date format, Gravity Forms seems to understand DD/MM/YYYY. It doesn't need a time either.
I am sending the post request to the TAP PAYMENT GATEWAY in order to save the card, the url is expecting two parameters like one is the source (the recently generated token) and inside the url the {customer_id}, I am trying the string concatenation, but it is showing the error like Invalid JSON request.
views.py:
ifCustomerExits = CustomerIds.objects.filter(email=email)
totalData = ifCustomerExits.count()
if totalData > 1:
for data in ifCustomerExits:
customerId = data.customer_id
print("CUSTOMER_ID CREATED ONE:", customerId)
tokenId = request.session.get('generatedTokenId')
payload = {
"source": tokenId
}
headers = {
'authorization': "Bearer sk_test_**********************",
'content-type': "application/json"
}
# HERE DOWN IS THE url of TAP COMPANY'S API:
url = "https://api.tap.company/v2/card/%7B"+customerId+"%7D"
response = requests.request("POST", url, data=payload, headers=headers)
json_data3 = json.loads(response.text)
card_id = json_data3["id"]
return sponsorParticularPerson(request, sponsorProjectId)
Their expected url = https://api.tap.company/v2/card/{customer_id}
Their documentation link: https://tappayments.api-docs.io/2.0/cards/create-a-card
Try this..
First convert dict. into JSON and send post request with request.post:
import json
...
customerId = str(data.customer_id)
print("CUSTOMER_ID CREATED ONE:", customerId)
tokenId = request.session.get('generatedTokenId')
payload = {
'source': tokenId
}
headers = {
'authorization': "Bearer sk_test_**************************",
'content-type': "application/json"
}
pd = json.dumps(payload)
# HERE DOWN IS THE url of TAP COMPANY'S API:
url = "https://api.tap.company/v2/card/%7B"+customerId+"%7D"
response = requests.post(url, data=pd, headers=headers)
json_data3 = json.loads(response.text)
card_id = json_data3["id"]
return sponsorParticularPerson(request, card_id)
Please tell me this works or not...
I am parsing through Patient Metadata scraped from a url, and I am trying to access the 'PatientID' field. However, there is also an 'OtherPatientIDs' field, which is grabbed by my search.
I have tried looking into using regular expressions but I am unclear on how to match an EXACT string or how to incorporate it into my code.
So at the moment, I have done:
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
PatientID = "PatientID"
lines = soup.decode('utf8').split("\n")
for line in lines:
if "PatientID" in line:
PatientID = line.split(':')[1].split('\"')[1].split('\"')[0]
print(PatientID)
Which successfully finds the values of both the PatientID AND OtherPatientIDs field. How do I specify that I only want the PatientID field?
EDIT:
I was asked to give an example of what I get with response.text, and it's of the form:
{
"ID" : "shqowihdojcoughwoeh"
"LastUpdate: "20190507"
"MainTags" : {
"OtherPatientIDs" : "0304992098"
"PatientBirthDate" : "29/04/1803"
"PatientID" : "92879837"
"PatientName" : "LASTNAME^FIRSTNAME"
},
"Type" : "Patient"
}
Why not use the json library instead?
import json
import requests
response = requests.get(url)
data = json.loads(response.text)
print(data['MainTags']['PatientID'])
I am working on updating a Confluence page via python and the Confluence API.
I have found a function to write the data to the my page, unfortunately it creates a new page with my data and the old page becomes an archived version.
I have been searching the reference material and cannot see the reason I am getting a new page instead of appending the data to the end of the page. The only solution I can think of is to copy the body and append the new data to it, then create the new page ... but I am thinking there should be a way to append.
The write function code I found / am leveraging is as follows :
def write_data(auth, html, pageid, title = None):
info = get_page_info(auth, pageid)
print ("after get page info")
ver = int(info['version']['number']) + 1
ancestors = get_page_ancestors(auth, pageid)
anc = ancestors[-1]
del anc['_links']
del anc['_expandable']
del anc['extensions']
if title is not None:
info['title'] = title
data = {
'id' : str(pageid),
'type' : 'page',
'title' : info['title'],
'version' : {'number' : ver},
'ancestors' : [anc],
'body' : {
'storage' :
{
'representation' : 'storage',
'value' : str(html),
}
}
}
data = json.dumps(data)
pprint (data)
url = '{base}/{pageid}'.format(base = BASE_URL, pageid = pageid)
r = requests.put(
url,
data = data,
auth = auth,
headers = { 'Content-Type' : 'application/json' }
)
r.raise_for_status()
I am starting to think copying / appending to the body is the option, but hope someone else has encountered this issue.
Not an elegant solution, but I went with the copy the old body and append option.
Basically, wrote a simple function to return the existing body :
def get_page_content(auth, pageid):
url = '{base}/{pageid}?expand=body.storage'.format(
base = BASE_URL,
pageid = pageid)
r = requests.get(url, auth = auth)
r.raise_for_status()
return (r.json()['body']['storage']['value'])
In this example, just appending (+=) a new string to the existing body.
def write_data(auth, html, pageid, title = None):
info = get_page_info(auth, pageid)
page_content = get_page_content(auth, pageid)
ver = int(info['version']['number']) + 1
ancestors = get_page_ancestors(auth, pageid)
anc = ancestors[-1]
del anc['_links']
del anc['_expandable']
del anc['extensions']
page_content += "\n\n"
page_content += html
if title is not None:
info['title'] = title
data = {
'id' : str(pageid),
'type' : 'page',
'title' : info['title'],
'version' : {'number' : ver},
'ancestors' : [anc],
'body' : {
'storage' :
{
'representation' : 'storage',
'value' : str(page_content),
}
}
}
data = json.dumps(data)
url = '{base}/{pageid}'.format(base = BASE_URL, pageid = pageid)
r = requests.put(
url,
data = data,
auth = auth,
headers = { 'Content-Type' : 'application/json' }
)
r.raise_for_status()
print "Wrote '%s' version %d" % (info['title'], ver)
print "URL: %s%d" % (VIEW_URL, pageid)
Should note that as this is a post to a confluence body, the text being passed is html. '\n' for newline does not work, you need to pass '<br \>' etc ...
If anyone has a more elegant solution, would welcome the suggestions.
Dan.
I am trying to create a merge request using the GitLab Merge Request API with python and the python requests package. This is a snippet of my code
import requests, json
MR = 'http://www.gitlab.com/api/v4/projects/317/merge_requests'
id = '317'
gitlabAccessToken = 'MySecretAccessToken'
sourceBranch = 'issue110'
targetBranch = 'master'
title = 'title'
description = 'description'
header = { 'PRIVATE-TOKEN' : gitlabAccessToken,
'id' : id,
'title' : title,
'source_branch' : sourceBranch,
'target_branch' : targetBranch
}
reply = requests.post(MR, headers = header)
status = json.loads(reply.text)
but I keep getting the following message in the reply
{'error': 'title is missing, source_branch is missing, target_branch is missing'}
What should I change in my request to make it work?
Apart from PRIVATE-TOKEN, all the parameters should be passed as form-encoded parameters, meaning:
header = {'PRIVATE-TOKEN' : gitlabAccessToken}
params = {
'id' : id,
'title' : title,
'source_branch' : sourceBranch,
'target_branch' : targetBranch
}
reply = requests.post(MR, data=params, headers=header)