How to match an exact word in a json soup? - python

I am parsing through Patient Metadata scraped from a url, and I am trying to access the 'PatientID' field. However, there is also an 'OtherPatientIDs' field, which is grabbed by my search.
I have tried looking into using regular expressions but I am unclear on how to match an EXACT string or how to incorporate it into my code.
So at the moment, I have done:
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
PatientID = "PatientID"
lines = soup.decode('utf8').split("\n")
for line in lines:
if "PatientID" in line:
PatientID = line.split(':')[1].split('\"')[1].split('\"')[0]
print(PatientID)
Which successfully finds the values of both the PatientID AND OtherPatientIDs field. How do I specify that I only want the PatientID field?
EDIT:
I was asked to give an example of what I get with response.text, and it's of the form:
{
"ID" : "shqowihdojcoughwoeh"
"LastUpdate: "20190507"
"MainTags" : {
"OtherPatientIDs" : "0304992098"
"PatientBirthDate" : "29/04/1803"
"PatientID" : "92879837"
"PatientName" : "LASTNAME^FIRSTNAME"
},
"Type" : "Patient"
}

Why not use the json library instead?
import json
import requests
response = requests.get(url)
data = json.loads(response.text)
print(data['MainTags']['PatientID'])

Related

Gravity form API with python

The documentation of the API is here, and I try to implement this line in python
//retrieve entries created on a specific day (use the date_created field)
//this example returns entries created on September 10, 2019
https://localhost/wp-json/gf/v2/entries?search={"field_filters": [{"key":"date_created","value":"09/10/2019","operator":"is"}]}
But when I try to do with python in the following code, I got an error:
import json
import oauthlib
from requests_oauthlib import OAuth1Session
consumer_key = ""
client_secret = ""
session = OAuth1Session(consumer_key,
client_secret=client_secret,signature_type=oauthlib.oauth1.SIGNATURE_TYPE_QUERY)
url = 'https://localhost/wp-json/gf/v2/entries?search={"field_filters": [{"key":"date_created","value":"09/01/2023","operator":"is"}]}'
r = session.get(url)
print(r.content)
The error message is :
ValueError: Error trying to decode a non urlencoded string. Found invalid characters: {']', '['} in the string: 'search=%7B%22field_filters%22:%20[%7B%22key%22:%22date_created%22,%22value%22:%2209/01/2023%22,%22operator%22:%22is%22%7D]%7D'. Please ensure the request/response body is x-www-form-urlencoded.
One solution is to parameterize the url:
import requests
import json
url = 'https://localhost/wp-json/gf/v2/entries'
params = {
"search": {"field_filters": [{"key":"date_created","value":"09/01/2023","operator":"is"}]}
}
headers = {'Content-type': 'application/json'}
response = session.get(url, params=params, headers=headers)
print(response.json())
But in the retrieved entries, the data is not filtered with the specified date.
In the official documentation, they gave a date in this format "09/01/2023", but in my dataset, the format is: "2023-01-10 19:16:59"
Do I have to transform the format ? I tried a different format for the date
date_created = "09/01/2023"
date_created = datetime.strptime(date_created, "%d/%m/%Y").strftime("%Y-%m-%d %H:%M:%S")
What alternative solutions can I test ?
What if you use urllib.parse.urlencode function, so your code would looks like:
import json
import oauthlib
from requests_oauthlib import OAuth1Session
import urllib.parse
consumer_key = ""
client_secret = ""
session = OAuth1Session(consumer_key,
client_secret=client_secret,signature_type=oauthlib.oauth1.SIGNATURE_TYPE_QUERY)
params = {
"search": {"field_filters": [{"key":"date_created","value":"09/01/2023","operator":"is"}]}
}
encoded_params = urllib.parse.urlencode(params)
url = f'https://localhost/wp-json/gf/v2/entries?{encoded_params}'
r = session.get(url)
print(r.content)
hope that helps
I had the same problem and found a solution with this code:
params = {
'search': json.dumps({
'field_filters': [
{ 'key': 'date_created', 'value': '2023-01-01', 'operator': 'is' }
],
'mode': 'all'
})
}
encoded_params = urllib.parse.urlencode(params, quote_via=urllib.parse.quote)
url = 'http://localhost/depot_git/wp-json/gf/v2/forms/1/entries?' + encoded_params + '&paging[page_size]=999999999' # nombre de réponses par page forcé manuellement
I'm not really sure what permitted it to work as I'm an absolute beginner with Python, but I found that you need double quotes in the URL ( " ) instead of simple quotes ( ' ), so the solution by William Castrillon wasn't enough.
As for the date format, Gravity Forms seems to understand DD/MM/YYYY. It doesn't need a time either.

How can i pass the body parameters in Requests.PUT in python?

I am working with a API to create a survey. The api url doesn't mention any 'data' parameter. The doc have body parameter but i am not sure if i should send a string or a json.
data = {
'Id' : '60269c21-b4f8-4f58-b070-262179269d64',
'Json' : '{ pages :[{ name : page1 , elements :[{ type : text , name : question1 }]}]}'
}
dddd = requests.put(f'https://api.surveyjs.io/private/Surveys/changeJson?accessKey={access_key}', data = data)
## dddd.status_code is 200 for this code.
but there is not change in the survey at all.
Can someone help me with this?

Python BeautifulSoup Find data inside a variable

I am trying to use BeautifulSoup to get some data from website the data is returned as follows
window._sharedData = {
"config": {
"csrf_token": "DMjhhPBY0i6ZyMKYQPjMjxJhRD0gkRVQ",
"viewer": null,
"viewerId": null
},
"country_code": "IN",
"language_code": "en",
"locale": "en_US"
}
How can I import the same into json.loads so I can extract the data?
You need to change it first to a json format by removing the variable name and parsing it as a string:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('window._sharedData = ', '')
data = json.loads(text)
country_code = data['country_code']
Or you can use the eval function to transform it to a python dictionary. For that you need to replace json types to python and parse it in a dictionary format:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('null', None)
text = text.replace('window._sharedData = ', '')
data = eval(text)
country_code = data['country_code']

how to make a POST request in Scrapy that requires Request payload

I am trying to parse data from this website.
In Network section of inspect element i found this link https://busfor.pl/api/v1/searches that is used for a POST request that returns JSON i am interested in.
But for making this POST request there is request Payload with some dictionary.
I assumed it like normal formdata that we use to make FormRequest in scrapy but it returns 403 error.
I have already tried the following.
url = "https://busfor.pl/api/v1/searches"
formdata = {"from_id" : d_id
,"to_id" : a_id
,"on" : '2019-10-10'
,"passengers" : 1
,"details" : []
}
yield scrapy.FormRequest(url, callback=self.parse, formdata=formdata)
This returns 403 Error
I also tried this by referring to one of the StackOverflow post.
url = "https://busfor.pl/api/v1/searches"
payload = [{"from_id" : d_id
,"to_id" : a_id
,"on" : '2019-10-10'
,"passengers" : 1
,"details" : []
}]
yield scrapy.Request(url, self.parse, method = "POST", body = json.dumps(payload))
But even this returns the same error.
Can someone help me. to figure out how to parse the required data using Scrapy.
The way to send POST requests with json data is the later, but you are passing a wrong json to the site, it expects a dictionary, not a list of dictionaries.
So instead of:
payload = [{"from_id" : d_id
,"to_id" : a_id
,"on" : '2019-10-10'
,"passengers" : 1
,"details" : []
}]
You should use:
payload = {"from_id" : d_id
,"to_id" : a_id
,"on" : '2019-10-10'
,"passengers" : 1
,"details" : []
}
Another thing you didn't notice are the headers passed to the POST request, sometimes the site uses IDs and hashes to control access to their API, in this case I found two values that appear to be needed, X-CSRF-Token and X-NewRelic-ID. Luckily for us these two values are available on the search page.
Here is a working spider, the search result is available at the method self.parse_search.
import json
import scrapy
class BusForSpider(scrapy.Spider):
name = 'busfor'
start_urls = ['https://busfor.pl/autobusy/Sopot/Gda%C5%84sk?from_id=62113&on=2019-10-09&passengers=1&search=true&to_id=3559']
search_url = 'https://busfor.pl/api/v1/searches'
def parse(self, response):
payload = {"from_id" : '62113',
"to_id" : '3559',
"on" : '2019-10-10',
"passengers" : 1,
"details" : []}
csrf_token = response.xpath('//meta[#name="csrf-token"]/#content').get()
newrelic_id = response.xpath('//script/text()').re_first(r'xpid:"(.*?)"')
headers = {
'X-CSRF-Token': csrf_token,
'X-NewRelic-ID': newrelic_id,
'Content-Type': 'application/json; charset=UTF-8',
}
yield scrapy.Request(self.search_url, callback=self.parse_search, method="POST", body=json.dumps(payload), headers=headers)
def parse_search(self, response):
data = json.loads(response.text)

Unable to see whole pdf content after indexing pdf file in ES

Below is my code to index a pdf url in Elasticsearch:
import requests
from elasticsearch import Elasticsearch
es = Elasticsearch()
body = {
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
es.index(index='_ingest', doc_type='pipeline', id='attachment', body=body)
url = 'https://pubs.vmware.com/nsx-63/topic/com.vmware.ICbase/PDF/nsx_63_cross_vc_install.pdf'
response = requests.get(url)
import base64
data = base64.b64encode(response.content).decode('ascii')
result2 = es.index(index='my_index', doc_type='my_type', pipeline='attachment',
body={'data': data})
result2
doc = es.get(index='my_index', doc_type='my_type', id=result2['_id'], _source_exclude=['data'])
doc
print(doc['_source']['attachment']['content'])
Last line is printing the contents of pdf file till 63 page only out of 126.
Do I need to change any settings somewhere(already tried to increase the console o/p,dint helped).
Please provide pointers on this.
There is a limit of 100000 characters extracted.
You can change it in the pipeline definition by setting indexed_chars.
See https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

Categories