Extracting JSON from HTML with BeautifulSoup

Extracting JSON from HTML with BeautifulSoup - python

I've now tried everything for the past few hours but I can't extract a specific thing from the HTML below. I want to grab the "sessionCartId" but I can't figure out how....
Thats what i tried so far :
sessioncartid = BeautifulSoup(response.text, "html.parser").findAll("script", {"type":"text/javascript"})[2]
data = json.loads(sessioncartid.text)
print(data)
^^ This gives me the correct script tag but i cant transform it into a json nor get the sessioncarId
<script type="text/javascript">
/*<![CDATA[*/
var ACC = {config: {}};
ACC.config.contextPath = "";
ACC.config.encodedContextPath = "/de/web";
ACC.config.commonResourcePath = "/_ui/20220811221438/responsive/common";
ACC.config.themeResourcePath = "/_ui/20220811221438/responsive/theme-gh";
ACC.config.siteResourcePath = "/_ui/20220811221438/responsive/site-ghstore";
ACC.config.rootPath = "/_ui/20220811221438/responsive";
ACC.config.CSRFToken = "81b0156a-5a78-4969-b52e-e5080473fb83";
ACC.pwdStrengthVeryWeak = 'password.strength.veryweak';
ACC.pwdStrengthWeak = 'password.strength.weak';
ACC.pwdStrengthMedium = 'password.strength.medium';
ACC.pwdStrengthStrong = 'password.strength.strong';
ACC.pwdStrengthVeryStrong = 'password.strength.verystrong';
ACC.pwdStrengthUnsafePwd = 'password.strength.unsafepwd';
ACC.pwdStrengthTooShortPwd = 'password.strength.tooshortpwd';
ACC.pwdStrengthMinCharText = 'password.strength.minchartext';
ACC.accessibilityLoading = 'aria.pickupinstore.loading';
ACC.accessibilityStoresLoaded = 'aria.pickupinstore.storesloaded';
ACC.config.googleApiKey = "";
ACC.config.googleApiVersion = "3.7";
ACC.autocompleteUrl = '/de/web/search/autocompleteSecure';
ACC.config.loginUrl = '/de/web/login';
ACC.config.authenticationStatusUrl = '/de/web/authentication/status';
/*]]>*/
var OCC =
{
"token": "1799248c-8de0-4199-b5fe-1d610452010a",
"currentUser": "test#gmail.com",
"sessionCartGuid": "2323121232323",
"sessionCartId": "121212123435324",
"sessionLanguageIso": "de",
"sessionCountryIso": "DE",
"urlPosCode": "web",
"isASM": false,
"intermediaryID": "",
"isASMCustomerEmulated": false,
"siteId": "ghstore",
"OCCBaseUrl": "/ghcommercewebservices/v2/ghstore",
"availablePointsOfService": "BUD,FRA,DTM,HAM,GRZ,HAJ,SZG,VIE,WEB,BER",
"primaryPointOfSevice": "WEB",
"clientChannel": "web-eu"
};
</script>

This is how you can extract that dictionary:
from bs4 import BeautifulSoup
import json
import re
html = '''
<script type="text/javascript">
/*<![CDATA[*/
var ACC = {config: {}};
ACC.config.contextPath = "";
ACC.config.encodedContextPath = "/de/web";
ACC.config.commonResourcePath = "/_ui/20220811221438/responsive/common";
ACC.config.themeResourcePath = "/_ui/20220811221438/responsive/theme-gh";
ACC.config.siteResourcePath = "/_ui/20220811221438/responsive/site-ghstore";
ACC.config.rootPath = "/_ui/20220811221438/responsive";
ACC.config.CSRFToken = "81b0156a-5a78-4969-b52e-e5080473fb83";
ACC.pwdStrengthVeryWeak = 'password.strength.veryweak';
ACC.pwdStrengthWeak = 'password.strength.weak';
ACC.pwdStrengthMedium = 'password.strength.medium';
ACC.pwdStrengthStrong = 'password.strength.strong';
ACC.pwdStrengthVeryStrong = 'password.strength.verystrong';
ACC.pwdStrengthUnsafePwd = 'password.strength.unsafepwd';
ACC.pwdStrengthTooShortPwd = 'password.strength.tooshortpwd';
ACC.pwdStrengthMinCharText = 'password.strength.minchartext';
ACC.accessibilityLoading = 'aria.pickupinstore.loading';
ACC.accessibilityStoresLoaded = 'aria.pickupinstore.storesloaded';
ACC.config.googleApiKey = "";
ACC.config.googleApiVersion = "3.7";
ACC.autocompleteUrl = '/de/web/search/autocompleteSecure';
ACC.config.loginUrl = '/de/web/login';
ACC.config.authenticationStatusUrl = '/de/web/authentication/status';
/*]]>*/
var OCC =
{
"token": "1799248c-8de0-4199-b5fe-1d610452010a",
"currentUser": "test#gmail.com",
"sessionCartGuid": "2323121232323",
"sessionCartId": "121212123435324",
"sessionLanguageIso": "de",
"sessionCountryIso": "DE",
"urlPosCode": "web",
"isASM": false,
"intermediaryID": "",
"isASMCustomerEmulated": false,
"siteId": "ghstore",
"OCCBaseUrl": "/ghcommercewebservices/v2/ghstore",
"availablePointsOfService": "BUD,FRA,DTM,HAM,GRZ,HAJ,SZG,VIE,WEB,BER",
"primaryPointOfSevice": "WEB",
"clientChannel": "web-eu"
};
</script>
'''
soup = BeautifulSoup(html, 'html.parser')
info = soup.select_one('script', string = re.compile('sessionCartGuid'))
json_obj = json.loads(info.text.split('var OCC =')[1].split(';')[0])
# print(json_obj)
print(json_obj['token'])
print(json_obj['currentUser'])
print(json_obj['sessionCartId'])
Result:
1799248c-8de0-4199-b5fe-1d610452010a
test#gmail.com
121212123435324
BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

Related

Can i use while loop with 'i' as a variable which will be used in tr[i] in xpath?

import scrapy
import logging
class AssetSpider(scrapy.Spider):
name = 'asset'
start_urls = ['http://mnregaweb4.nic.in/netnrega/asset_report_dtl.aspx?lflag=eng&state_name=WEST%20BENGAL&state_code=32&district_name=NADIA&district_code=3201&block_name=KRISHNAGAR-I&block_code=&panchayat_name=DOGACHI&panchayat_code=3201009009&fin_year=2020-2021&source=national&Digest=8+kWKUdwzDQA1IJ5qhD8Fw']
def parse(self, response):
i = 4
while i<2236:
assetid = response.xpath("//table[2]//tr['i']/td[2]/text()")
assetcategory = response.xpath("//table[2]//tr['i']/td[3]/text()")
schemecode = response.xpath("//table[2]//tr['i']/td[5]/text()")
link = response.xpath("//table[2]//tr['i']/td[6]/a/#href")
schemename = response.xpath("//table[2]//tr['i']/td[7]/text()")
yield {
'assetid' : assetid,
'assetcategory' : assetcategory,
'schemecode' : schemecode,
'link' : link,
'schemename' : schemename
}
i += 1
I want to use 'i' variable to loop in the xpath of tr[position] from 4 to 2235. i just dont know if it is possible! and if it is possible, then what is the right way to do it? mine does not work.

Sure, it is possible and widely used.
You can format the string with variable.
There are several syntaxes for that.
For example you can do it like this:
i = 4
while i<2236:
assetid_path = "//table[2]//tr[{1}]/td[2]/text()".format(i)
assetcategory_path = "//table[2]//tr[{1}]/td[3]/text()".format(i)
schemecode_path = "//table[2]//tr[{1}]/td[5]/text()".format(i)
link_path = "//table[2]//tr[{1}]/td[6]/a/#href".format(i)
schemename_path = "//table[2]//tr[{1}]/td[7]/text()".format(i)
assetid = response.xpath(assetid_path)
assetcategory = response.xpath(assetcategory_path)
schemecode = response.xpath(schemecode_path)
link = response.xpath(link_path)
schemename = response.xpath(schemename_path)
yield {
'assetid' : assetid,
'assetcategory' : assetcategory,
'schemecode' : schemecode,
'link' : link,
'schemename' : schemename
}
i += 1
While the above can be shortened like this:
i = 4
while i<2236:
root_path = "//table[2]//tr[{1}]".format(i)
assetid_path = root_path + "/td[2]/text()"
assetcategory_path = root_path + "/td[3]/text()"
schemecode_path = root_path + "/td[5]/text()"
link_path = root_path + "/td[6]/a/#href"
schemename_path = root_path + "/td[7]/text()"
assetid = response.xpath(assetid_path)
assetcategory = response.xpath(assetcategory_path)
schemecode = response.xpath(schemecode_path)
link = response.xpath(link_path)
schemename = response.xpath(schemename_path)
yield {
'assetid' : assetid,
'assetcategory' : assetcategory,
'schemecode' : schemecode,
'link' : link,
'schemename' : schemename
}
i += 1
But the better way is to use bind variable. As following:
i = 4
while i<2236:
assetid = response.xpath("//table[2]//tr[$i]/td[2]/text()",i=i))
assetcategory = response.xpath("//table[2]//tr[$i]/td[3]/text()",i=i))
schemecode = response.xpath("//table[2]//tr[$i]/td[5]/text()",i=i)
link = response.xpath("//table[2]//tr[$i]/td[6]/a/#href",i=i)
schemename = response.xpath("//table[2]//tr[$i]/td[7]/text()",i=i)
yield {
'assetid' : assetid,
'assetcategory' : assetcategory,
'schemecode' : schemecode,
'link' : link,
'schemename' : schemename
}
i += 1

You send string to xpath so I would suggest to use formating... eg.:
response.xpath(f"//table[2]//tr[{i}]/td[2]/text()")

construct mongoDB query dynamically in Pymongo

I have SCIM search request body like this,
{
"schemas": ["urn:ietf:params:scim:api:messages:2.0:SearchRequest"],
"attributes": ["displayName", "userName"],
"excludedAttributes": ["emails"],
"filter":"displayName sw \"smith\"",
"startIndex": 1,
"count": 10,
"sortBy": "userName",
"sortOrder": "ascending"
}
all the above attributes are optional, except for "schemas" attribute.
because, all the attributes are optional i have construct query accordingly.
below is the code for this handling, as you can see there are conditions which make code look untidy.
data = request.get_json()
a = {}
attributes = data.get('attributes',[])
sortby = data.get('sortBy',None)
sortorder = data.get('sortOrder',None)
if not attributes:
pass
else:
for i in attributes:
if i not in a:
a[i]=1
excludedAttributes = data.get('excludedAttributes',[])
if not excludedAttributes:
pass
else:
for i in excludedAttributes:
if i not in a:
a[i]=0
if not a and not sortby:
result = mongo.db.test.find({}, )
if a and not sortby:
result = mongo.db.test.find({}, a)
if not a and sortby and not sortorder:
result = mongo.db.test.find({}, ).sort([(sortby,flask_pymongo.ASCENDING)])
if a and sortby and not sortorder:
result = mongo.db.test.find({}, a).sort([(sortby, flask_pymongo.ASCENDING)])
if not a and sortby and sortorder=='ascending':
result = mongo.db.test.find({}, ).sort([(sortby, flask_pymongo.ASCENDING)])
if a and sortby and not sortorder=='ascending':
result = mongo.db.test.find({}, a).sort([(sortby, flask_pymongo.ASCENDING)])
if not a and sortby and sortorder=='descending':
result = mongo.db.test.find({}, ).sort([(sortby, flask_pymongo.DESCENDING)])
if a and sortby and not sortorder=='descending':
result = mongo.db.test.find({}, a).sort([(sortby, flask_pymongo.DESCENDING)])
for i in result:
full_data.append(i)
resp = jsonify(json.loads(dumps(full_data)))
return resp
if i include even pagination, there will be even more conditions piling up.
How do i construct these queries effectively.

data = request.get_json()
a = {}
attributes = data.get('attributes',[])
sortby = data.get('sortBy',None)
sortorder = data.get('sortOrder',None)
if not attributes:
pass
else:
for i in attributes:
if i not in a:
a[i]=1
excludedAttributes = data.get('excludedAttributes',[])
if not excludedAttributes:
pass
else:
for i in excludedAttributes:
if i not in a:
a[i]=0
result = mongo.db.test.find({}, )
if a:
result = mongo.db.test.find({}, a)
if sortby:
if sortorder == "descending":
sortorder = flask_pymongo.DESCENDING
else:
sortorder = flask_pymongo.ASCENDING
result = result.sort([(sortby, sortorder)])
for i in result:
full_data.append(i)
resp = jsonify(json.loads(dumps(full_data)))
return resp

Parsing logs to json Python

Folks,
I am trying to parse log file into json format.
I have a lot of logs, there is one of them
How can I parse this?
03:02:03.113 [info] ext_ref = BANK24AOS_cl_reqmarketcreditorderstate_6M8I1NT8JKYD_1591844522410384_4SGA08M8KIXQ reqid = 1253166 type = INREQ channel = BANK24AOS sid = msid_1591844511335516_KRRNBSLH2FS duration = 703.991 req_uri = marketcredit/order/state login = 77012221122 req_type = cl_req req_headers = {"accept-encoding":"gzip","connection":"close","host":"test-mobileapp-api.bank.kz","user-agent":"okhttp/4.4.1","x-forwarded-for":"212.154.169.134","x-real-ip":"212.154.169.134"} req_body = {"$sid":"msid_1591844511335516_KRRNBSLH2FS","$sid":"msid_1591844511335516_KRRNBSLH2FS","app":"bank","app_version":"2.3.2","channel":"aos","colvir_token":"GExPR0lOX1BBU1NXT1JEX0NMRUFSVEVYVFNzrzh4Thk1+MjDKWl/dDu1fQPsJ6gGLSanBp41yLRv","colvir_commercial_id":"-1","colvir_id":"000120.335980","openway_commercial_id":"6247520","openway_id":"6196360","$lang":"ru","ekb_id":"923243","inn":"990830221722","login":"77012221122","bank24_id":"262"} resp_body = {"task_id":"","status":"success","data":{"state":"init","applications":[{"status":"init","id":"123db561-34a3-4a8d-9fa7-03ed6377b44f","name":"Sulpak","amount":101000,"items":[{"name":"Switch CISCO x24","price":100000,"count":1,"amount":100000}]}],"segment":{"range":{"min":6,"max":36,"step":1},"payment_day":{"max":28,"min":1}}}}
Into this type of json, or any other format (but I guess json is best one)
{
"time":"03:02:03.113",
"class_req":"info",
"ext_ref":"BANK24AOS_cl_reqmarketcreditorderstate_6M8I1NT8JKYD_1591844522410384_4SGA08M8KIXQ",
"reqid":"1253166",
"type":"INREQ",
"channel":"BANK24AOS",
"sid":"msid_1591844511335516_KRRNBSLH2FS",
"duration":"703.991",
"req_uri":"marketcredit/order/state",
"login":"77012221122",
"req_type":"cl_req",
"req_headers":{
"accept-encoding":"gzip",
"connection":"close",
"host":"test-mobileapp-api.bank.kz",
"user-agent":"okhttp/4.4.1",
"x-forwarded-for":"212.154.169.134",
"x-real-ip":"212.154.169.134"
},
"req_body":{
"$sid":"msid_1591844511335516_KRRNBSLH2FS",
"$sid":"msid_1591844511335516_KRRNBSLH2FS",
"app":"bank",
"app_version":"2.3.2",
"channel":"aos",
"colvir_token":"GExPR0lOX1BBU1NXT1JEX0NMRUFSVEVYVFNzrzh4Thk1+MjDKWl/dDu1fQPsJ6gGLSanBp41yLRv",
"colvir_commercial_id":"-1",
"colvir_id":"000120.335980",
"openway_commercial_id":"6247520",
"openway_id":"6196360",
"$lang":"ru",
"ekb_id":"923243",
"inn":"990830221722",
"login":"77012221122",
"bank24_id":"262"
},
"resp_body":{
"task_id":"",
"status":"success",
"data":{
"state":"init",
"applications":[
{
"status":"init",
"id":"123db561-34a3-4a8d-9fa7-03ed6377b44f",
"name":"Sulpak",
"amount":101000,
"items":[
{
"name":"Switch CISCO x24",
"price":100000,
"count":1,
"amount":100000
}
]
}
],
"segment":{
"range":{
"min":6,
"max":36,
"step":1
},
"payment_day":{
"max":28,
"min":1
}
}
}
}
}
I am trying to split first whole text, but there I met another problem is to match keys to values depending on '=' sign. Also there might be some keys with empty values. For ex.:
type = INREQ channel = sid = duration = 1.333 (to get to know that there is an empty value, you need to pay attention on number of spaces. Usually there is 1 space between prev.value and next key). So this example should look like this:
{
"type":"INREQ",
"channel":"",
"sid":"",
"duration":"1.333"
}
Thanks ahead!

Here, one thing pass for duplicate key about "$sid":"msid_1591844511335516_KRRNBSLH2FS"
import re
text = """03:02:03.113 [info] ext_ref = reqid = 1253166 type = INREQ channel = BANK24AOS sid = msid_1591844511335516_KRRNBSLH2FS duration = 703.991 req_uri = marketcredit/order/state login = 77012221122 req_type = cl_req req_headers = {"accept-encoding":"gzip","connection":"close","host":"test-mobileapp-api.bank.kz","user-agent":"okhttp/4.4.1","x-forwarded-for":"212.154.169.134","x-real-ip":"212.154.169.134"} req_body = {"$sid":"msid_1591844511335516_KRRNBSLH2FS","$sid":"msid_1591844511335516_KRRNBSLH2FS","app":"bank","app_version":"2.3.2","channel":"aos","colvir_token":"GExPR0lOX1BBU1NXT1JEX0NMRUFSVEVYVFNzrzh4Thk1+MjDKWl/dDu1fQPsJ6gGLSanBp41yLRv","colvir_commercial_id":"-1","colvir_id":"000120.335980","openway_commercial_id":"6247520","openway_id":"6196360","$lang":"ru","ekb_id":"923243","inn":"990830221722","login":"77012221122","bank24_id":"262"} resp_body = {"task_id":"","status":"success","data":{"state":"init","applications":[{"status":"init","id":"123db561-34a3-4a8d-9fa7-03ed6377b44f","name":"Sulpak","amount":101000,"items":[{"name":"Switch CISCO x24","price":100000,"count":1,"amount":100000}]}],"segment":{"range":{"min":6,"max":36,"step":1},"payment_day":{"max":28,"min":1}}}}"""
index1 = text.index('[')
index2 = text.index(']')
new_text = 'time = '+ text[:index1-1] + ' class_req = ' + text[index1+1:index2] + text[index2+2:]
lst = re.findall(r'\S+? = |\S+? = \{.*?\} |\S+? = \{.*?\}$|\S+? = \S+? ', new_text)
res = {}
for item in lst:
key, equal, value = item.partition('=')
key, value = key.strip(), value.strip()
if value.startswith('{'):
try:
value = json.loads(value)
except:
print(value)
res[key] = value

you can try regulation in python.
here is what i write, it works for your problem.
for convenience i deleted string before "ext_ref...",you can directly truncate the raw string.
import re
import json
string = 'ext_ref = BANK24AOS_cl_reqmarketcreditorderstate_6M8I1NT8JKYD_1591844522410384_4SGA08M8KIXQ reqid = 1253166 type = INREQ channel = BANK24AOS sid = msid_1591844511335516_KRRNBSLH2FS duration = 703.991 req_uri = marketcredit/order/state login = 77012221122 req_type = cl_req req_headers = {"accept-encoding":"gzip","connection":"close","host":"test-mobileapp-api.bank.kz","user-agent":"okhttp/4.4.1","x-forwarded-for":"212.154.169.134","x-real-ip":"212.154.169.134"} req_body = {"$sid":"msid_1591844511335516_KRRNBSLH2FS","$sid":"msid_1591844511335516_KRRNBSLH2FS","app":"bank","app_version":"2.3.2","channel":"aos","colvir_token":"GExPR0lOX1BBU1NXT1JEX0NMRUFSVEVYVFNzrzh4Thk1+MjDKWl/dDu1fQPsJ6gGLSanBp41yLRv","colvir_commercial_id":"-1","colvir_id":"000120.335980","openway_commercial_id":"6247520","openway_id":"6196360","$lang":"ru","ekb_id":"923243","inn":"990830221722","login":"77012221122","bank24_id":"262"} resp_body = {"task_id":"","status":"success","data":{"state":"init","applications":[{"status":"init","id":"123db561-34a3-4a8d-9fa7-03ed6377b44f","name":"Sulpak","amount":101000,"items":[{"name":"Switch CISCO x24","price":100000,"count":1,"amount":100000}]}],"segment":{"range":{"min":6,"max":36,"step":1},"payment_day":{"max":28,"min":1}}}}'
position = re.search("req_headers",string) # position of req_headers
resp_body_pos = re.search("resp_body",string)
resp_body = string[resp_body_pos.span()[0]:]
res1 = {}
res1.setdefault(resp_body.split("=")[0],resp_body.split("=")[1])
print(res1)
before = string[:position.span()[0]]
after = string[position.span()[0]:resp_body_pos.span()[0]] # handle req_body seperately
res2 = re.findall("(\S+) = (\S+)",before)
print(res2)
res3 = re.findall("(\S+) = ({.*?})",after)
print(res3)
#res1 type: dict{'resp_body':'...'} content in resp_body
#res2 type: list[(),()..] content before req_head
#res3 type: list[(),()..] the rest content
and now you can do what you want to do with the data(.e.g. transform it into json respectively)
Hope this is helpful

python script to download youtube video

On giving youtube video url, I first download video page and extract javascript object between
<script>var ytplayer = ytplayer ..... </script>
I got
{
"args": {
"is_listed": "1",
"account_playback_token": "QUFFLUhqbWdXR1NfQjRiRmNzWVhRVTM0ajlNcnM1alVUd3xBQ3Jtc0tsVi01WFp5VmV2MTU3RnpkYUVkRzVqR1ZTNUI4T2JaQzk1ckxPejdVNkYzUk5zOTdjZnNmb1BYZHNLQ05nblZZbFk2ZWJXNHRPNVFoNVVNc2RjTE1YekdKSGY4dlVhSnlCU1ctNFZJdXBKbWhIRG1TZw==",
"ptk": "RajshriEntertainment",
"focEnabled": "1",
"tag_for_child_directed": false,
"adaptive_fmts": ......,
"probe_url": .....,
"rmktEnabled": "1",
"allow_ratings": "1",
"dbp": "ChoKFk5RNTV5UGs5bDZmSk5wSjQ4a3RiSHcQARABGAI",
"cc3_module": "1",
"no_get_video_log": "1",
"fmt_list": ......,
"title":..........,
"invideo": true,
"sffb": true,
"iurlmq_webp": ,
"cosver": "10_8_4",
"url_encoded_fmt_stream_map": .................,
"max_dynamic_allocation_ad_tag_length": "2040",
"innertube_api_key": "AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8",
"timestamp": "1446586407",
"cc_asr": "1",
"apiary_host_firstparty": "",
"adsense_video_doc_id": "yt_Vd4iNPuRlx4",
"innertube_context_client_version": "1.20151102",
"mpu": true,
"tmi": "1",
"ldpj": "-19",
"fade_out_duration_milliseconds": "1000",
.........
}
}
i found key adaptive_fmts and url_encoded_fmt_stream_map contain multiple url in percent-encoded form.
i take one url from url_encoded_fmt_stream_map it look like this
https://r1---sn-o3o-qxal.googlevideo.com/videoplayback?
ratebypass=yes&
signature=982E413BBE08CA5801420F9696E0F2ED691B99FA.D666D39D1A0AF066F76F12632A10D3B8076076CE&
lmt=1443906393476832&
expire=1446604919&
fexp=9406983%2C9408710%2C9414764%2C9416126%2C9417707%2C9421410%2C9422596%2C9423663&
itag=22&
dur=128.801&
source=youtube&
upn=pk2CEhVBeFM&
sver=3&
key=yt6&
id=o-AK-OlE5NUsbkp51EZY2yKuz5vsSGofgUvrvTtOrhC72e&
sparams=dur%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cpl%2Cratebypass%2Crequiressl%2Csource%2Cupn%2Cexpire&
mime=video%2Fmp4&
ipbits=0&
pl=21&
ip=x.y.z.a&
initcwndbps=5405000&
requiressl=yes&
mn=sn-o3o-qxal&
mm=31&
ms=au&
mv=m&
mt=1446583222&
itag=22&
type=video/mp4
but when I paste this(above) url in browser nothing happen, I mean not work.
Please help me.
Also
What is difference between adaptive_fmts and url_encoded_fmt_stream_map containing urls?

In python2.7, this works:
import urlparse, urllib2
vid = "vzS1Vkpsi5k"
save_title = "YouTube SpaceX - Booster Number 4 - Thaicom 8 06-06-2016"
url_init = "https://www.youtube.com/get_video_info?video_id=" + vid
resp = urllib2.urlopen(url_init, timeout=10)
data = resp.read()
info = urlparse.parse_qs(data)
title = info['title']
print "length: ", info['length_seconds'][0] + " seconds"
stream_map = info['adaptive_fmts'][0]
vid_info = stream_map.split(",")
mp4_filename = save_title + ".mp4"
for video in vid_info:
item = urlparse.parse_qs(video)
#print 'quality: ', item['quality'][0]
#print 'type: ', item['type'][0]
url_download = item['url'][0]
resp = urllib2.urlopen(url_download)
print resp.headers
length = int(resp.headers['Content-Length'])
my_file = open(mp4_filename, "w+")
done, i = 0, 0
buff = resp.read(1024)
while buff:
my_file.write(buff)
done += 1024
percent = done * 100.0 / length
buff = resp.read(1024)
if not i%1000:
percent = done * 100.0 / length
print str(percent) + "%"
i += 1
break

How to access latitude and longitude in a script with beautifulsoup?

I want to get latitude and longitude from a webpage using beautifulsoup but they are in a script:
//<![CDATA[
theForm.oldSubmit = theForm.submit;
theForm.submit = WebForm_SaveScrollPositionSubmit;
theForm.oldOnSubmit = theForm.onsubmit;
theForm.onsubmit = WebForm_SaveScrollPositionOnSubmit;
var GMapsProperties={};function getGMapElementById(mapId,GMapElementId){var _mapId=typeof(mapId)=='string'? mapId : mapId.getDiv().id;var overlayArray=GMapsProperties[_mapId]['overlayArray'];for(var i=0;i < overlayArray.length;i++){if(overlayArray[i][0]==GMapElementId){return overlayArray[i][1];}}return null;}function removeGMapElementById(mapId,GMapElementId){var _mapId=typeof(mapId)=='string'? mapId : mapId.getDiv().id;var overlayArray=GMapsProperties[_mapId]['overlayArray'];for(var i=0;i < overlayArray.length;i++){if(overlayArray[i][0]==GMapElementId){overlayArray.splice(i,1);return;}}}function closeWindows(mapId){for(var i=0;i<GMapsProperties[mapId]['windowArray'].length;i++){GMapsProperties[mapId]['windowArray'][i][1].close();}}var _sg=_sg ||{};_sg.cs=(function(){var p={};p.createMarker=function(opt,id){var m=new google.maps.Marker(opt);if(id && m.getMap())GMapsProperties[m.getMap().getDiv().id]['overlayArray'].push([id,m]);return m;};p.createPolyline=function(opt,id){var m=new google.maps.Polyline(opt);if(id && m.getMap())GMapsProperties[m.getMap().getDiv().id]['overlayArray'].push([id,m]);return m;};p.createPolygon=function(opt,id){var m=new google.maps.Polygon(opt);if(id && m.getMap())GMapsProperties[m.getMap().getDiv().id]['overlayArray'].push([id,m]);return m;};return p;})();function addEvent(el,ev,fn){if(el.addEventListener)el.addEventListener(ev,fn,false);else if(el.attachEvent)el.attachEvent('on'+ev,fn);else el['on'+ev]=fn;}GMapsProperties['subgurim_GoogleMapControl'] = {}; var GMapsProperties_subgurim_GoogleMapControl = GMapsProperties['subgurim_GoogleMapControl']; GMapsProperties_subgurim_GoogleMapControl['enableStore'] = false; GMapsProperties_subgurim_GoogleMapControl['overlayArray'] = new Array(); GMapsProperties_subgurim_GoogleMapControl['windowArray'] = new Array();var subgurim_GoogleMapControl;function load_subgurim_GoogleMapControl(){var mapDOM = document.getElementById('subgurim_GoogleMapControl'); if (!mapDOM) return;subgurim_GoogleMapControl = new google.maps.Map(mapDOM);function subgurim_GoogleMapControlupdateValues(eventId,value){var item=document.getElementById('subgurim_GoogleMapControl_Event'+eventId);item.value=value;}google.maps.event.addListener(subgurim_GoogleMapControl, 'addoverlay', function(overlay) { if(overlay) { GMapsProperties['subgurim_GoogleMapControl']['overlayArray'].push(overlay); } });google.maps.event.addListener(subgurim_GoogleMapControl, 'clearoverlays', function() { GMapsProperties['subgurim_GoogleMapControl']['overlayArray'] = new Array(); });google.maps.event.addListener(subgurim_GoogleMapControl, 'removeoverlay', function(overlay) { removeGMapElementById('subgurim_GoogleMapControl',overlay.id) });google.maps.event.addListener(subgurim_GoogleMapControl, 'maptypeid_changed', function() { var tipo = subgurim_GoogleMapControl.getMapTypeId(); subgurim_GoogleMapControlupdateValues('0', tipo);});google.maps.event.addListener(subgurim_GoogleMapControl, 'dragend', function() { var lat = subgurim_GoogleMapControl.getCenter().lat(); var lng = subgurim_GoogleMapControl.getCenter().lng(); subgurim_GoogleMapControlupdateValues('2', lat+','+lng); });google.maps.event.addListener(subgurim_GoogleMapControl, 'zoom_changed', function() { subgurim_GoogleMapControlupdateValues('1', subgurim_GoogleMapControl.getZoom()); });subgurim_GoogleMapControl.setOptions({center:new google.maps.LatLng(35.6783546483511,51.4196634292603),disableDefaultUI:true,keyboardShortcuts:false,mapTypeControl:false,mapTypeId:google.maps.MapTypeId.ROADMAP,scrollwheel:false,zoom:14});var marker_subgurim_920435_=_sg.cs.createMarker({position:new google.maps.LatLng(35.6783546483511,51.4196634292603),clickable:true,draggable:false,map:subgurim_GoogleMapControl,raiseOnDrag:true,visible:true,icon:'/images/markers/Site/Tourism/vase.png'}, 'marker_subgurim_920435_');}addEvent(window,'load',load_subgurim_GoogleMapControl);//]]>
and I want information in this part:
{position:new google.maps.LatLng(35.6783546483511,51.4196634292603)
is it possible to access that information by using beautifulsoup or any other web-scraper?

Use Regular expression for this purpose.
import re
#Suppose script is stored in variable script_file
m = re.search('LatLng(\(.+?\))', script_file)
latlng = m.group(1)
latlng = eval(latlng)
print(latlng) #(35.6783546483511, 51.4196634292603)

import re
s = 'position:new google.maps.LatLng(35.6783546483511,51.4196634292603)'
lat, lng = map(float, re.search(r'\(([^,]+),([^)]+)', s).groups())

If you want to get Latitude and Longitude separately, use regex expression in this way:
import re
s = 'position:new google.maps.LatLng(35.6783546483511,51.4196634292603)'
Lat, Lng = map(float, re.search(r'LatLng\(([\d.]+),([\d.]+)\)',s).groups())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting JSON from HTML with BeautifulSoup - python

Related

Can i use while loop with 'i' as a variable which will be used in tr[i] in xpath?

construct mongoDB query dynamically in Pymongo

Parsing logs to json Python

python script to download youtube video

How to access latitude and longitude in a script with beautifulsoup?

Categories

Resources