I'm scraping a site and the data I want is included in a script tag of an html page, I wrote a re code to find a match but it seems I am doing it the wrong way.
Hub = {};
Hub.config = {
config: {},
get: function(key) {
if (key in this.config) {
return this.config[key];
} else {
return null;
}
},
set: function(key, val) {
this.config[key] = val;
}
};
Hub.config.set('sku', {
valCartInfo : {
itemId : '576938415361',
cartUrl: '//cart.mangolane.com/cart.htm'
},
apiRelateMarket : '//tui.mangolane.com/recommend?appid=16&count=4&itemid=576938415361',
apiAddCart : '//cart.mangolane.com/add_cart_item.htm?item_id=576938415361',
apiInsurance : '',
wholeSibUrl : '//detailskip.mangolane.com/service/getData/1/p1/item/detail/sib.htm?itemId=576938415361&sellerId=499095250&modules=dynStock,qrcode,viewer,price,duty,xmpPromotion,delivery,upp,activity,fqg,zjys,amountRestriction,couponActivity,soldQuantity,page,originalPrice,tradeContract',
areaLimit : '',
bigGroupUrl : '',
valPostFee : '',
coupon : {
couponApi : '//detailskip.mangolane.com/json/activity.htm?itemId=576938415361&sellerId=499095250',
couponWidgetDomain: '//assets.mgcdn.com',
cbUrl : '/cross.htm?type=weibo'
},
valItemInfo : {
defSelected: -1,
skuMap : {";20549:103189693;1627207:811754571;":{"price":"528.00","stock":"2","skuId":"4301611864655","oversold":false},
";20549:59280855;1627207:412796441;":{"price":"528.00","stock":"2","skuId":"4432149803707","oversold":false},
";20549:59280855;1627207:196576508;":{"price":"528.00","stock":"2","skuId":"4018119863100","oversold":false},
";20549:72380707;1627207:28341;":{"price":"528.00","stock":"2","skuId":"4166690818570","oversold":false},
";20549:418624880;1627207:28341;":{"price":"528.00","stock":"2","skuId":"4166690818566","oversold":false},
";20549:418624880;1627207:196576508;":{"price":"528.00","stock":"2","skuId":"4018119863098","oversold":false},
";20549:72380707;1627207:3224419;":{"price":"528.00","stock":"2","skuId":"4166690818571","oversold":false},
";20549:147478970;1627207:196576508;":{"price":"528.00","stock":"2","skuId":"4018119863094","oversold":false},
";20549:72380707;1627207:384366805;":{"price":"528.00","stock":"2","skuId":"4432149803708","oversold":false},
";20549:296172561;1627207:811754571;":{"price":"528.00","stock":"2","skuId":"4301611864659","oversold":false},
";20549:72380707;1627207:1150336209;":{"price":"528.00","stock":"2","skuId":"4301611864664","oversold":false},
";20549:147478970;1627207:93586002;":{"price":"528.00","stock":"2","skuId":"4018119863095","oversold":false}}
,propertyMemoMap: {"1627207:811754571":"黑色单里(预售) 年后2.29发货","1627207:93586002":"黑色加绒 现货","1627207:412796441":"黑色(兔毛) 现货","1627207:384366805":"米白色(兔毛) 现货","1627207:3224419":"驼色 现货","1627207:1150336209":"驼色单里(预售) 年后2.29发货","1627207:28341":"黑色 现货","1627207:196576508":"驼色加绒 现货"}
}
});
I need to get only the data in Hub.config.set('sku'
I did this but it didnt work
config_base_str = re.findall("Hub.config.set ({[\s\S]*?});", config) where config is the string of data
The period and parenthesis have a special meaning in regex. If you want to search for the literal characters, you will need to escape them first with a backslash.
For example assuming the string:
config = """
Hub.config.set('sku', {
valCartInfo : {
itemId : '576938415361',
cartUrl: '//cart.mangolane.com/cart.htm'
},
.........
};
"""
If you only want the key, you can do something like this:
config_base_str = re.findall("Hub\.config\.set\('(\w*)", config) # ['sku']
If you want everything after the key within the brackets, you can do something like this instead:
config_base_str = re.findall("Hub\.config\.set\('\w*',\s*({[\s\S]*})", config) # ["{\n valCartInfo : {} ...}"]
https://regex101.com/r/QHdaG2/3/
Related
I have a MongoDB NoSQL database, the name is baike, there is a collection named baike_items with the following format:
id:
title:
baike_id
page_url
text
All other fields are fine except the page_url. Some of the urls are normal like:
'https://baike.baidu.hk/item/%E5%A5%91%E4%B8%B9%E6%97%8F/2390374'
But some urls are ended with a string #viewPageContent, like:
https://baike.baidu.hk/item/%E5%E6%97%8F/11435374#viewPageContent
My intention is to write a mongoDB query to remove all the urls' #viewPageContent string while keep the rest of the string.
https://baike.baidu.hk/item/123#viewPageContent
https://baike.baidu.hk/item/456#viewPageContent
.
.
.
to
https://baike.baidu.hk/item/123
https://baike.baidu.hk/item/456
.
.
.
Any suggestions? thanks.
update1
The following python should do it.
db.baike_items.update_many(
{ "page_url": { "$regex": "#viewPageContent"} },
[{
"$set": { "page_url": {
"$replaceOne": { "input": "$page_url", "find": "#viewPageContent", "replacement": "" }
}}
}]
)
old_url = "https://baike.baidu.hk/item/%E7%89%A9%E7%90%86%E5%85%89%E5%AD%B8/61334055#viewPageContent"
new_url = old_url.replace("#viewPageContent", "")
print(old_url)
>>> https://baike.baidu.hk/item/%E7%89%A9%E7%90%86%E5%85%89%E5%AD%B8/61334055#viewPageContent
print(new_url)
>>> https://baike.baidu.hk/item/%E7%89%A9%E7%90%86%E5%85%89%E5%AD%B8/61334055
import re
a = "https://baike.baidu.hk/item/%E7%89%A9%E7%90%86%E5%85%89%E5%AD%B8/61334055#viewPageContent"
print(re.sub(r"#viewPageContent", '', a))
output: https://baike.baidu.hk/item/%E7%89%A9%E7%90%86%E5%85%89%E5%AD%B8/61334055
Hope I could help you!
db.baike_items.update_many(
{ "page_url": { "$regex": "#viewPageContent"} },
[{
"$set": { "page_url": {
"$replaceOne": { "input": "$page_url", "find": "#viewPageContent", "replacement": "" }
}}
}]
)
I have a JSON file where I need to replace the UUID and update it with another one. I'm having trouble replacing the deeply nested keys and values.
Below is my JSON file that I need to read in python, replace the keys and values and update the file.
JSON file - myfile.json
{
"name": "Shipping box"
"company":"Detla shipping"
"description":"---"
"details" : {
"boxes":[
{
"box_name":"alpha",
"id":"a3954710-5075-4f52-8eb4-1137be51bf14"
},
{
"box_name":"beta",
"id":"31be3763-3d63-4e70-a9b6-d197b5cb6929"
}
]
}
"container": [
"a3954710-5075-4f52-8eb4-1137be51bf14":[],
"31be3763-3d63-4e70-a9b6-d197b5cb6929":[]
]
"data":[
{
"data_series":[],
"other":50
},
{
"data_series":[],
"other":40
},
{
"data_series":
{
"a3954710-5075-4f52-8eb4-1137be51bf14":
{
{
"dimentions":[2,10,12]
}
},
"31be3763-3d63-4e70-a9b6-d197b5cb6929":
{
{
"dimentions":[3,9,12]
}
}
},
"other":50
}
]
}
I want achieve something like the following-
"details" : {
"boxes":[
{
"box_name":"alpha"
"id":"replace_uuid"
},
}
.
.
.
"data":[ {
"data_series":
{
"replace_uuid":
{
{
"dimentions":[2,10,12]
}
}
]
In such a type of deeply nested dictionary, how can we replace all the occurrence of keys and values with another string, here replace_uuid?
I tried with pop() and dotty_dict but I wasn't able to replace the nested list.
I was able to achieve it in the following way-
def uuid_change(): #generate a random uuid
new_uuid = uuid.uuid4()
return str(new_uuid)
dict = json.load(f)
for uid in dict[details][boxes]:
old_id = uid['id']
replace_id = uuid_change()
uid['id'] = replace_id
for i in range(n):
for uid1 in dict['container'][i].keys()
if uid1 == old_id:
dict['container'][i][replace_id]
= dict['container'][i].pop(uid1) #replace the key
for uid2 in dict['data'][2]['data_series'].keys()
if uid2 == old_id:
dict['data'][2]['data_series'][replace_id]
= dict['data'][2]['data_series'].pop(uid2) #replace the key
following Update json nodes in Python using jsonpath, would like to know how one might update the JSON data given a certain context.
So, say we pick the exact same JSON example:
{
"SchemeId": 10,
"nominations": [
{
"nominationId": 1
}
]
}
But this time, would like to double the value of the original value, hence some lambda function is needed which takes into account the current node value.
No need for lambdas; for example, to double SchemeId, something like this should work:
data = json.loads("""the json string above""")
jsonpath_expr = parse('$.SchemeId')
jsonpath_expr.find(data)
val = jsonpath_expr.find(data)[0].value
jsonpath_expr.update(data, val*2)
print(json.dumps(data, indent=2))
Output:
{
"SchemeId": 20,
"nominations": [
{
"nominationId": 1
}
]
}
Here is example with lambda expression:
import json
from jsonpath_ng import parse
settings = '''{
"choices": {
"atm": {
"cs": "Strom",
"en": "Tree"
},
"bar": {
"cs": "Dům",
"en": "House"
},
"sea": {
"cs": "Moře",
"en": "Sea"
}
}
}'''
json_data = json.loads(settings)
pattern = parse('$.choices.*')
def magic(f: dict, to_lang='cs'):
return f[to_lang]
pattern.update(json_data,
lambda data_field, data, field: data.update({field: magic(data[field])}))
json_data
returns
{
'choices': {
'atm': 'Strom',
'bar': 'Dům',
'sea': 'Moře'
}
}
How can I make a string from json text when the json text contains many, many quotation marks and string escapes?
For example, the following works:
json_string = """
{
"styles":[
{
"label":"Style",
"target":{
"label":"Target"
},
"overrides":{
"materialProperties":{
"CRYPTO_ID":{
"script":{
"binding":"name"
}
}
}
}
}
]
}
"""
However this does not, due to the escapes:
new_string = """
{
"styles":[
{
"label":"Style",
"target":{
"label":"Target",
"objectName":"*"
},
"overrides":{
"materialProperties":{
"perObj":{
"script":{
"code":"cvex myFn(string myObj=\"\"; export string perObj=\"\") { perObj = myObj; } ",
"bindings":{
"myObj":"myObj"
}
}
}
}
}
}
]
}
"""
Is there a smart way to break this up? I've had no luck breaking it out into chunks and re-assembling to form the same thing when joined and printed.
Your string per se is valid JSON, however Python still sees the \ as special characters.
Use a raw string by prefixing your string with r:
import json
new_string = r"""
{
"styles":[
{
"label":"Style",
"target":{
"label":"Target",
"objectName":"*"
},
"overrides":{
"materialProperties":{
"perObj":{
"script":{
"code":"cvex myFn(string myObj=\"\"; export string perObj=\"\") { perObj = myObj; } ",
"bindings":{
"myObj":"myObj"
}
}
}
}
}
}
]
}
"""
json.loads( new_string )
Or escape your \ characters:
import json
new_string = """
{
"styles":[
{
"label":"Style",
"target":{
"label":"Target",
"objectName":"*"
},
"overrides":{
"materialProperties":{
"perObj":{
"script":{
"code":"cvex myFn(string myObj=\\"\\"; export string perObj=\\"\\") { perObj = myObj; } ",
"bindings":{
"myObj":"myObj"
}
}
}
}
}
}
]
}
"""
json.loads( new_string )
I would recommend reading from an actual JSON file rather than embedding it into your Python code:
with open('path/to/file.json') as f:
json_string = f.read()
Or, if you need the JSON parsed into Python objects (dicts, lists etc.):
import json
with open('path/to/file.json') as f:
json_data = json.load(f)
I'm retrieving a document like this:
user = db.users.find_one( { '_id' : ObjectId( 'anID' ) } )
But I can't figure out how to update the document if I want to change the value of 'gender'. This doesn't work:
newValue = {
'gender' : gender
}
db.users.update( user, newValue, False )
Is my syntax wrong? What's the best way to update user
Your update syntax is not correct, it should be:
update(spec, document, upsert=False, multi=False, ...)
Where spec is the same filter that you used for the find, i.e. { '_id' : ObjectId( 'anID' ) }
You can either update the document by replacing it with a modified document or use a targeted update to change only a certain value. The advantage of the targeted update is that it saves you the first round trip to the server to get the user document.
Replacement update:
user = db.users.find_one( { '_id' : ObjectId( 'anID' ) } )
user['gender'] = newGender
db.users.update( { '_id' : user['_id'] }, user, False)
Targeted update:
db.users.update( { '_id' : ObjectId( 'anID' ) }, \
{ '$set': { 'gender' : newGender } }, False )
If you don't want to replace the entire document you should use the $set operator as:
db.users.update( { '_id': user['_id'] }, { '$set': newValue }, False )