Scraping dynamic website with unchanging urls

Scraping dynamic website with unchanging urls - python

I need to scrape data of all dental clinis. What's the next step? Can someone help me out? I have now 2 options of code:
1, Here i don't know how to set 'for loop' for all pages
url = "https://www.dent.cz/zubni-lekari"
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
for x in range(1, 31):
clinic = r.html.xpath(
f'//*[#id="main"]/div/div[3]/div[1]/div/div[{x}]/h3', first=True)
adress = r.html.xpath(
f'//*[#id="main"]/div/div[3]/div[1]/div/div[{x}]/p[1]', first=True)
try:
phone = r.html.xpath(
f'//*[#id="main"]/div/div[3]/div[1]/div/div[{x}]/p[1]/strong[1]', first=True)
except:
phone = "None"
try:
email = r.html.xpath(
f'//*[#id="main"]/div/div[3]/div[1]/div/div[{x}]/p[1]/strong[2]', first=True)
except:
email = "None"
clinics_list = {
"Clinic": clinic.text,
"Adress": adress.text,
"Phone": phone.text,
"Email": email.text
}
print(clinics_list)
2, Here i don't know how to find out the rest of data (adresses, phone, email)
api_url = "https://is-api.dent.cz/api/v1/web/workplaces"
payload = {
"deleted": False,
"filter": "accepts_new_patients=false",
"fulltext": "",
"page": 1, # <--- you can implement pagination via this parameter
"per_page": 30,
"sort_fields": "name",
}
data = requests.post(api_url, json=payload).json()
for item in data["data"]:
print(format(item["name"]))

You just need to change the page number. And extract the information from the json response
api_url = "https://is-api.dent.cz/api/v1/web/workplaces"
payload = {
"deleted": False,
"filter": "accepts_new_patients=false",
"fulltext": "",
"page": 1, # <--- you can implement pagination via this parameter
"per_page": 30,
"sort_fields": "name",
}
PAGES = 233
for i in range(1, PAGES):
payload['page'] = i
response = requests.post(api_url, json=payload)
data = response.json()
The output looks like this:
{'data': [{'id': 'df313eba-7447-4496-bca5-abd8a840394a',
'name': '#staycool s.r.o.',
'regional_chamber': {'id': 'ce0d8c8a-99db-46ed-85ff-87b6650c677a',
'name': 'OSK UHERSKÉ HRADIŠTĚ',
'checked': False,
'code': '',
'tooltip': ''},
'provider': {'id': '256f41bd-a9f1-452b-99e7-63f77005ecfa',
'name': '#staycool s.r.o.',
'is_also_member': False,
'registration_number': '11982861',
'identification_number': '',
'type_cares': []},
'accepts_new_patients': False,
'address': {'city': 'Uherské Hradiště',
'state': '',
'country_name': '',
'print': 'J.E.Purkyně 365, 686 06 Uherské Hradiště',
'street': 'J.E.Purkyně 365',
'postcode': '686 06',
'name': ''},
'contact': {'email1': '',
'email2': '',
'full': '',
'phone1': '',
'phone2': '',
'web': '',
'deleted': False},
'membes': [],
'insurance_companies': []},
so we just need to extract the data from the dictionary inside the data list
for i in range(1, PAGES):
payload['page'] = i
response = requests.post(api_url, json=payload)
data = response.json()
for item in data['data']:
clinic = item['name']
address_city = item['address']['city']
address_street = item['address']['street']
address_postcode = item['address']['postcode']
phone = item['contact']['phone1']
email = item['contact']['email1']

Related

Create a new dictionary from a nested JSON output after parsing

In python3 I need to get a JSON response from an API call,
and parse it so I will get a dictionary That only contains the data I need.
The final dictionary I ecxpt to get is as follows:
{'Severity Rules': ('cc55c459-eb1a-11e8-9db4-0669bdfa776e', ['cc637182-eb1a-11e8-9db4-0669bdfa776e']), 'auto_collector': ('57e9a4ec-21f7-4e0e-88da-f0f1fda4c9d1', ['0ab2470a-451e-11eb-8856-06364196e782'])}
the JSON response returns the following output:
{
'RuleGroups': [{
'Id': 'cc55c459-eb1a-11e8-9db4-0669bdfa776e',
'Name': 'Severity Rules',
'Order': 1,
'Enabled': True,
'Rules': [{
'Id': 'cc637182-eb1a-11e8-9db4-0669bdfa776e',
'Name': 'Severity Rule',
'Description': 'Look for default severity text',
'Enabled': False,
'RuleMatchers': None,
'Rule': '\\b(?P<severity>DEBUG|TRACE|INFO|WARN|ERROR|FATAL|EXCEPTION|[I|i]nfo|[W|w]arn|[E|e]rror|[E|e]xception)\\b',
'SourceField': 'text',
'DestinationField': 'text',
'ReplaceNewVal': '',
'Type': 'extract',
'Order': 21520,
'KeepBlockedLogs': False
}],
'Type': 'user'
}, {
'Id': '4f6fa7c6-d60f-49cd-8c3d-02dcdff6e54c',
'Name': 'auto_collector',
'Order': 4,
'Enabled': True,
'Rules': [{
'Id': '2d6bdc1d-4064-11eb-8856-06364196e782',
'Name': 'auto_collector',
'Description': 'DO NOT CHANGE!! Created via API coralogix-blocker tool',
'Enabled': False,
'RuleMatchers': None,
'Rule': 'AUTODISABLED',
'SourceField': 'subsystemName',
'DestinationField': 'subsystemName',
'ReplaceNewVal': '',
'Type': 'block',
'Order': 1,
'KeepBlockedLogs': False
}],
'Type': 'user'
}]
}
I was able to create a dictionary that contains the name and the RuleGroupsID, like that:
response = requests.get(url,headers=headers)
output = response.json()
outputlist=(output["RuleGroups"])
groupRuleName = [li['Name'] for li in outputlist]
groupRuleID = [li['Id'] for li in outputlist]
# Create a dictionary of NAME + ID
ruleDic = {}
for key in groupRuleName:
for value in groupRuleID:
ruleDic[key] = value
groupRuleID.remove(value)
break
Which gave me a simple dictionary:
{'Severity Rules': 'cc55c459-eb1a-11e8-9db4-0669bdfa776e', 'Rewrites': 'ddbaa27e-1747-11e9-9db4-0669bdfa776e', 'Extract': '0cb937b6-2354-d23a-5806-4559b1f1e540', 'auto_collector': '4f6fa7c6-d60f-49cd-8c3d-02dcdff6e54c'}
but when I tried to parse it as nested JSON things just didn't work.

In the end, I managed to create a function that returns this dictionary,
I'm doing it by breaking the JSON into 3 lists by the needed elements (which are Name, Id, and Rules from the first nest), and then create another list from the nested JSON ( which listed everything under Rule) which only create a list from the keyword "Id".
Finally creating a dictionary using a zip command on the lists and dictionaries created earlier.
def get_filtered_rules() -> List[dict]:
groupRuleName = [li['Name'] for li in outputlist]
groupRuleID = [li['Id'] for li in outputlist]
ruleIDList = [li['Rules'] for li in outputlist]
ruleIDListClean = []
ruleClean = []
for sublist in ruleIDList:
try:
lstRule = [item['Rule'] for item in sublist]
ruleClean.append(lstRule)
ruleContent=list(zip(groupRuleName, ruleClean))
ruleContentDictionary = dict(ruleContent)
lstID = [item['Id'] for item in sublist]
ruleIDListClean.append(lstID)
# Create a dictionary of NAME + ID + RuleID
ruleDic = dict(zip(groupRuleName, zip(groupRuleID, ruleIDListClean)))
except Exception as e: print(e)
return ruleDic

How to extract values from list and store it as dictionary(key-value pair)?

I need to extract 2 values from this list of dictionary and store it as a key-value pair.
Here I attached sample data..Where I need to extract "Name" and "Service" from this input and store it as a dictionary. Where "Name" is Key and corresponding "Service" is its value.
Input:
response = {
'Roles': [
{
'Path': '/',
'Name': 'Heera',
'Age': '25',
'Policy': 'Policy1',
'Start_Month': 'January',
'PolicyDocument':
{
'Date': '2012-10-17',
'Statement': [
{
'id': '',
'RoleStatus': 'New_Joinee',
'RoleType': {
'Service': 'Service1'
},
'Action': ''
}
]
},
'Duration': 3600
},
{
'Path': '/',
'Name': 'Prem',
'Age': '40',
'Policy': 'Policy2',
'Start_Month': 'April',
'PolicyDocument':
{
'Date': '2018-11-27',
'Statement': [
{
'id': '',
'RoleStatus': 'Senior',
'RoleType': {
'Service': ''
},
'Action': ''
}
]
},
'Duration': 2600
},
]
}
From this input, I need output as a dictionary type.
Output Format: { Name : Service }
Output:
{ "Heera":"Service1","Prem" : " "}
My try:
Role_name =[]
response = {#INPUT WHICH I SPECIFIED ABOVE#}
roles = response['Roles']
for role in roles:
Role_name.append(role['Name'])
print(Role_name)
I need to pair the name with its corresponding service. Any help would be really appreciable.
Thanks in advance.

You just have to write a long line which can reach till the key 'Service'.
And you a syntax error in line Start_Month': 'January') and 'Start_Month': 'April'). You can't have one unclosed brackets.
Fix it and run the following.
This is the code:
output_dict = {}
for r in response['Roles']:
output_dict[r["Name"]] = r['PolicyDocument']['Statement'][0]['RoleType']['Service']
print(output_dict)
Output:
{'Heera': 'Service1', 'Prem': ''}

You just have to do like this:
liste = []
for role in response['Roles']:
liste.append(
{
role['Name']:role['PolicyDocument']['Statement'][0]['RoleType']['Service'],
}
)
print(liste)

It seems your input data is structured kind of strange and I am not sure what the ) are doing next to the months since they make things invalid but here is a working script assuming you removed the parenthesis from your input.
response = {
'Roles': [
{
'Path': '/',
'Name': 'Heera',
'Age': '25',
'Policy': 'Policy1',
'Start_Month': 'January',
'PolicyDocument':
{
'Date': '2012-10-17',
'Statement': [
{
'id': '',
'RoleStatus': 'New_Joinee',
'RoleType': {
'Service': 'Service1'
},
'Action': ''
}
]
},
'Duration': 3600
},
{
'Path': '/',
'Name': 'Prem',
'Age': '40',
'Policy': 'Policy2',
'Start_Month': 'April',
'PolicyDocument':
{
'Date': '2018-11-27',
'Statement': [
{
'id': '',
'RoleStatus': 'Senior',
'RoleType': {
'Service': ''
},
'Action': ''
}
]
},
'Duration': 2600
},
]
}
output = {}
for i in response['Roles']:
output[i['Name']] = i['PolicyDocument']['Statement'][0]['RoleType']['Service']
print(output)

This should give you what you want in a variable called role_services:
role_services = {}
for role in response['Roles']:
for st in role['PolicyDocument']['Statement']:
role_services[role['Name']] = st['RoleType']['Service']
It will ensure you'll go through all of the statements within that data structure but be aware you'll overwrite key-value pairs as you traverse the response, if they exist in more than a single entry!
A reference on for loops which might be helpful, illustrates using if statements within them which can help you to extend this to check if items already exist!
Hope that helps

Process JSON Responses, Editing and Sending

I am working with an API. I get a response from the API which looks like this:
from oauthlib.oauth2 import BackendApplicationClient
from requests.auth import HTTPBasicAuth
from requests_oauthlib import OAuth2Session
auth = HTTPBasicAuth(client_id, client_secret)
client = BackendApplicationClient(client_id=client_id)
oauth = OAuth2Session(client=client)
token = oauth.fetch_token(token_url=token_url, auth=auth)
client = OAuth2Session(client_id, token=token, auto_refresh_url=token_url,token_updater=token_saver)
token_saver = []
device_policy = client.get('{URL}/v1?ids='+ids)
I get this response
[{'id': '',
'name': 'A Name',
'description': '',
'platform_name': 'Windows',
'groups': [],
'enabled': True,
'created_by': 'An Email',
'created_timestamp': '2019-03-28T12:51:30.989736386Z',
'modified_by': 'An Email ,
'modified_timestamp': '2019-11-19T21:14:53.0189419Z',
'settings': {'enforcement_mode': 'MONITOR_ENFORCE',
'end_user_notification': 'SILENT',
'classes': [{'id': 'ANY', 'action': 'FULL_ACCESS', 'exceptions': []},
{'id': 'IMAGING', 'action': 'FULL_ACCESS', 'exceptions': []},
{'id': 'MASS_STORAGE', 'action': 'BLOCK_ALL', 'exceptions': []},
{'id': 'MOBILE', 'action': 'BLOCK_ALL', 'exceptions': []},
{'id': 'PRINTER', 'action': 'FULL_ACCESS', 'exceptions': []},
{'id': 'WIRELESS', 'action': 'BLOCK_ALL', 'exceptions': []}]}}]
In each class there is list for hold exceptions. The API accepts a patch (not really a patch) that if this data is resubmitted with the exception field holding the contents of this function then an exception is accepted.
`
file_info = {
"class": "ANY",
"vendor_name": "",
"product_name": "",
"serial_number": serial_number,
"combined_id": "",
"action": "FULL_ACCESS",
"match_method": "VID_PID_SERIAL"
}
`
The challenge I have is accepting the first document and then adding the exception material to create a this patch request. I can "walk" the document but cannot work how to create a new body text to send. I think I want to do something like this but not using append as this throws an error.
new_walk_json = walk_json.append(['classes'][0]['exceptions']['Test'])

Realised .update function can used.

Telegram wrong URL host

I'm making bots. I tried every thing and this my code:
def send_photo(chat_id, location , reply_markup=None):
url = URL + "sendPhoto?chat_id={}&photo={}".format(chat_id,open('1.jpg', 'rb'))
if reply_markup:
url += "&reply_markup={}".format(reply_markup)
print(get_url(url))
get_url(url)
My file is in my .py folder and I double checked every thing, I even used photo telegram id's and url's and I'm still getting:
{"ok":false,"error_code":400,"description":"Bad Request: wrong URL host"}

#https://core.telegram.org/bots/api#sendphoto
import requests
import json
token = 'Token'
def send_photo(chat_id, photo, caption='', parse_mode=None, disable_notification=False, reply_to_message_id=0, reply_markup=None):
with open(photo, 'rb') as file:
response = requests.post(
'https://api.telegram.org/bot{token}/sendPhoto?'.format(token=token),
data={
'chat_id':chat_id, #Integer or String
'caption':caption, #String
'parse_mode': parse_mode, #String https://core.telegram.org/bots/api#formatting-options
'disable_notification': disable_notification, #Boolean
'reply_to_message_id': reply_to_message_id, #Integer
'reply_markup': json.dumps(reply_markup) if reply_markup is not None else reply_markup, #List
},
files={
'photo': file.read()
}
)
file.close()
if response.status_code == 200:
return json.loads(response.text)
reply_markup = {
'inline_keyboard':[
[
{'text':'stackoverflow', 'url':'https://stackoverflow.com'}
]
]
}
print(send_photo('802959264', 'test.png', caption='caption', reply_markup=reply_markup))
This is the output
{'ok': True, 'result': {'message_id': 10, 'from': {'id': 1157936984, 'is_bot': True, 'first_name': 'test', 'username': 'RoomSupervisorBot'}, 'chat': {'id': 802959264, 'firs
t_name': 'milad', 'username': 'milad_dev', 'type': 'private'}, 'date': 1597892647, 'photo': [{'file_id': 'AgACAgQAAxkDAAMKXz3oJ8X-3jkfsP8GgT_oAtOUhUwAArO0MRuwP_BR0owk2ZSqQ
RdwWPEiXQADAQADAgADbQADaNoEAAEbBA', 'file_unique_id': 'AQADcFjxIl0AA2jaBAAB', 'file_size': 8226, 'width': 320, 'height': 180}, {'file_id': 'AgACAgQAAxkDAAMKXz3oJ8X-3jkfsP8GgT_
oAtOUhUwAArO0MRuwP_BR0owk2ZSqQRdwWPEiXQADAQADAgADeAADadoEAAEbBA', 'file_unique_id': 'AQADcFjxIl0AA2naBAAB', 'file_size': 35836, 'width': 800, 'height': 450}, {'file_id': 'AgAC
AgQAAxkDAAMKXz3oJ8X-3jkfsP8GgT_oAtOUhUwAArO0MRuwP_BR0owk2ZSqQRdwWPEiXQADAQADAgADeQADatoEAAEbBA', 'file_unique_id': 'AQADcFjxIl0AA2raBAAB', 'file_size': 78830, 'width': 1280, '
height': 720}, {'file_id': 'AgACAgQAAxkDAAMKXz3oJ8X-3jkfsP8GgT_oAtOUhUwAArO0MRuwP_BR0owk2ZSqQRdwWPEiXQADAQADAgADdwADZtoEAAEbBA', 'file_unique_id': 'AQADcFjxIl0AA2baBAAB', 'fil
e_size': 88002, 'width': 1366, 'height': 768}], 'caption': 'caption', 'reply_markup': {'inline_keyboard': [[{'text': 'stackoverflow', 'url': 'https://stackoverflow.com'}]]}}}

Kindly use http url instead of local,
It will work else
upload the file first and then get the file_id
then send that file id to the bot.

How to send a local image instead of URL to Microsoft Cognitive Vision API(analyze an image) using Python?

Am trying to play with Vision API(analyze an image) of Microsoft Cognitive Services. Am wondering how to send a local image through rest API calls to Vision API and request for the results from it using Python. Can anyone help me with this please?
The Testing opting provided by Microsoft on their site only takes URL, I Tried to convert my local path to URL and give it as input but that doesn't work.

You can see the full code here: https://github.com/miparnisari/Cognitive-Vision-Python/blob/master/Jupyter%20Notebook/Computer%20Vision%20API%20Example.ipynb
But the gist of it:
import requests # pip3 install requests
region = "YOUR-API-REGION" #For example, "westus"
api_key = "YOUR-API-KEY"
path_to_file = "C:/Users/mparnisari/Desktop/test.jpg"
# Read file
with open(path_to_file, 'rb') as f:
data = f.read()
# Set request headers
headers = dict()
headers['Ocp-Apim-Subscription-Key'] = api_key
headers['Content-Type'] = 'application/octet-stream'
# Set request querystring parameters
params = {'visualFeatures': 'Color,Categories,Tags,Description,ImageType,Faces,Adult'}
# Make request and process response
response = requests.request('post', "https://{}.api.cognitive.microsoft.com/vision/v1.0/analyze".format(region), data=data, headers=headers, params=params)
if response.status_code == 200 or response.status_code == 201:
if 'content-length' in response.headers and int(response.headers['content-length']) == 0:
result = None
elif 'content-type' in response.headers and isinstance(response.headers['content-type'], str):
if 'application/json' in response.headers['content-type'].lower():
result = response.json() if response.content else None
elif 'image' in response.headers['content-type'].lower():
result = response.content
print(result)
else:
print("Error code: %d" % response.status_code)
print("Message: %s" % response.json())
This will print something like this:
{
'categories': [{
'name': 'others_',
'score': 0.0078125
}, {
'name': 'outdoor_',
'score': 0.0078125
}, {
'name': 'people_',
'score': 0.4140625
}],
'adult': {
'isAdultContent': False,
'isRacyContent': False,
'adultScore': 0.022686801850795746,
'racyScore': 0.016844550147652626
},
'tags': [{
'name': 'outdoor',
'confidence': 0.9997920393943787
}, {
'name': 'sky',
'confidence': 0.9985970854759216
}, {
'name': 'person',
'confidence': 0.997259259223938
}, {
'name': 'woman',
'confidence': 0.944902777671814
}, {
'name': 'posing',
'confidence': 0.8417303562164307
}, {
'name': 'day',
'confidence': 0.2061375379562378
}],
'description': {
'tags': ['outdoor', 'person', 'woman', 'snow', 'posing', 'standing', 'skiing', 'holding', 'lady', 'photo', 'smiling', 'top', 'wearing', 'girl', 'mountain', 'sitting', 'young', 'people', 'sun', 'slope', 'hill', 'man', 'covered', 'umbrella', 'red', 'white'],
'captions': [{
'text': 'a woman posing for a picture',
'confidence': 0.9654204679303702
}]
},
'metadata': {
'width': 3264,
'height': 1836,
'format': 'Jpeg'
},
'faces': [{
'age': 26,
'gender': 'Female',
'faceRectangle': {
'left': 597,
'top': 2151,
'width': 780,
'height': 780
}
}],
'color': {
'dominantColorForeground': 'White',
'dominantColorBackground': 'White',
'dominantColors': ['White', 'Grey'],
'accentColor': '486E83',
'isBWImg': False
},
'imageType': {
'clipArtType': 0,
'lineDrawingType': 0
}
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping dynamic website with unchanging urls - python

Related

Create a new dictionary from a nested JSON output after parsing

How to extract values from list and store it as dictionary(key-value pair)?

Process JSON Responses, Editing and Sending

Telegram wrong URL host

How to send a local image instead of URL to Microsoft Cognitive Vision API(analyze an image) using Python?

Categories

Resources