Scraping element <script> for strings in Python

Scraping element <script> for strings in Python - python

Currently trying to check the stock of a size small on this PAGE (which is 0) but specifically retrieve the inventory of a size small from this data:
<script>
(function($) {
var variantImages = {},
thumbnails,
variant,
variantImage;
variant = {"id":18116649221,"title":"XS","option1":"XS","option2":null,"option3":null,"sku":"BGT16073100","requires_shipping":true,"taxable":true,"featured_image":null,"available":true,"name":"Iron Lords T-Shirt - XS","public_title":"XS","options":["XS"],"price":2499,"weight":136,"compare_at_price":null,"inventory_quantity":16,"inventory_management":"shopify","inventory_policy":"deny","barcode":""};
if ( typeof variant.featured_image !== 'undefined' && variant.featured_image !== null ) {
variantImage = variant.featured_image.src.split('?')[0].replace('http:','');
variantImages[variantImage] = variantImages[variantImage] || {};
if (typeof variantImages[variantImage]["option-0"] === 'undefined') {
variantImages[variantImage]["option-0"] = "XS";
}
else {
var oldValue = variantImages[variantImage]["option-0"];
if ( oldValue !== null && oldValue !== "XS" ) {
variantImages[variantImage]["option-0"] = null;
}
}
}
variant = {"id":18116649285,"title":"Small","option1":"Small","option2":null,"option3":null,"sku":"BGT16073110","requires_shipping":true,"taxable":true,"featured_image":null,"available":false,"name":"Iron Lords T-Shirt - Small","public_title":"Small","options":["Small"],"price":2499,"weight":159,"compare_at_price":null,"inventory_quantity":0,"inventory_management":"shopify","inventory_policy":"deny","barcode":""};
if ( typeof variant.featured_image !== 'undefined' && variant.featured_image !== null ) {
variantImage = variant.featured_image.src.split('?')[0].replace('http:','');
variantImages[variantImage] = variantImages[variantImage] || {};
if (typeof variantImages[variantImage]["option-0"] === 'undefined') {
variantImages[variantImage]["option-0"] = "Small";
}
else {
var oldValue = variantImages[variantImage]["option-0"];
if ( oldValue !== null && oldValue !== "Small" ) {
variantImages[variantImage]["option-0"] = null;
}
}
}
How can I tell python to locate the <script> tag and then the specific "inventory_quantity":0 to return the stock of the product for a size Small?

you can find it using regex:
s = 'some sample text in which "inventory_quantity":0 appears'
occurences = re.findall('"inventory_quantity":(\d+)', s)
print(occurences[0])
'0'
edit:
I suppose you can get the whole content of <script>...</script> in a variable t (either lxml, xml.etree, beautifulsoup or simply re).
before we start, let's define some variables:
true = True
null = None
then using regex find a dictionary as text and convert to dict via eval
r = re.findall('variant = (\{.*}?);', t)
if r:
variant = eval(r)
This is what you get:
>>> variant
{'available': True,
'barcode': '',
'compare_at_price': None,
'featured_image': None,
'id': 18116649221,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 16,
'name': 'Iron Lords T-Shirt - XS',
'option1': 'XS',
'option2': None,
'option3': None,
'options': ['XS'],
'price': 2499,
'public_title': 'XS',
'requires_shipping': True,
'sku': 'BGT16073100',
'taxable': True,
'title': 'XS',
'weight': 136}
Now you can easily get any information you need.

Both the current answers don't address the problem of locating the inventory_quantity by the desired size which is not straightforward at the first glance.
The idea is to not dive into string parsing too much, but extract the complete sca_product_info JS array into the Python list via json.loads(), then filter the list by the desired size. Of course, we should first locate the desired JS object - for this we'll use a regular expression - remember, this is not HTML parsing at this point and doing that with a regular expression is pretty much okay - this famous answer does not apply in this case.
Complete implementation:
import json
import re
import requests
DESIRED_SIZE = "XS"
pattern = re.compile(r"freegifts_product_json\s*\((.*?)\);", re.MULTILINE | re.DOTALL)
url = "http://bungiestore.com/collections/featured/products/iron-lords-t-shirt-men"
response = requests.get(url)
match = pattern.search(response.text)
# load the extracted string representing the "sca_product_info" JS array into a Python list
product_info = json.loads(match.group(1))
# look up the desired size in a list of product variants
for variant in product_info["variants"]:
if variant["title"] == DESIRED_SIZE:
print(variant["inventory_quantity"])
break
Prints 16 at the moment.
By the way, we could have also used a JavaScript parser, like slimit - here is a sample working solution:
Extracting text from script tag using BeautifulSoup in Python

Assuming you can get the block of code into a string format, and assuming the format of the code doesn't change too much, you could do something like this:
before = ('"inventory_quantity":')
after = (',"inventory_management"')
start = mystr.index(before) + len(before)
end = mystr.index(after)
print(mystr[start:end])

Related

How to check if a Firestore collection exists without knowing document name/ID with Python [duplicate]

Is there a way to check if a sub collection exists in firestore for nodejs?
Currently I am using doc.exists for documents but I need to check if a subcolletion exists within a document in order to write some data or not.

Yes, there is. You can use docs.length to know if the subcollection exists.
I made a sample to guide you, hope it helps.
this.db.collection('users').doc('uid')
.get().limit(1).then(
doc => {
if (doc.exists) {
this.db.collection('users').doc('uid').collection('friendsSubcollection').get().
then(sub => {
if (sub.docs.length > 0) {
console.log('subcollection exists');
}
});
}
});

Mateus' Answer didn't help me. Probably it has been changed over the time.
.collection(..).get() returns a QuerySnapshot which has the property size, so I just did:
admin.firestore
.collection('users')
.doc('uid')
.collection('sub-collection')
.limit(1)
.get()
.then(query => query.size);

To be more precise:
const querySnapshot = await admin.firestore().collection('users').doc('uid').collection('sub-collection').limit(1).get()
if (querySnapshot.empty) {console.log('sub-collection not existed')}

This is how I was able to check if a collection exists?
I target the document path first, then if it exists, It means the collection afterwards exists and I can access it.
> db.collection("collection_name").document("doc_name").get()
> .addOnCompleteListener(new OnCompleteListener<DocumentSnapshot>() {
> #Override
> public void onComplete(#NonNull Task<DocumentSnapshot> task) {
> if(task.isSuccessful()){
> DocumentSnapshot result = task.getResult();
> if(result.exists()){
> *//this means the document exist first, hence the
> //collection afterwards the doc_name will
> //exist...now I can access the collection*
> db.collection("collection_name").document("doc_name").collection("collection_name2").get()
> .addOnCompleteListener(task1 -> { if(task1.isSuccessful()){
> ...
> } }); } } });

isEmpty property of QuerySnapshot returns true if there are no documents in the QuerySnapshot.
Thus you can simply check if isEmpty is true or false.
const subcolRef = collection(db, "parentCollectionTitle", "parentDocId", "subcollectionTitle")
const subcolSnapshot = await getDocs(subcollectionRef)
if (!subcolSnapshot.empty) {
console.log("subcol does exists!");
} else {
console.log("subcol does NOT exist!");
}
(Firebase v9)

This is NextJS (React) code for checking if a sub-collection "history" exists or not in collection "users" > doc>user-Id,
if it exists then take data in history, else keep have-history == false.
you can then use {havehistory?<></>:<></>} for showing different info, as per data.
const [history, setHistory] = useState([])
const [havehistory, setHavehistory] = useState(false)
if(user){
onSnapshot(query(collection(db, "users", user.uid,"history")), (querySnapshot) => {
if(querySnapshot){
const historyBro = querySnapshot.docs.map((doc) => {
return { ...doc.data(), id: doc.id };
});
setHistory(historyBro)
setHavehistory(true)
}
})
}
make sure your imported the required modules. e.g.
import { useState } from "react";
import {db} from '../firebase'
import { collection,doc, query, onSnapshot} from "firebase/firestore";
import Link from "next/link";

Pattern for re not retriving any results

I'm trying to create a re pattern in python to extract this pattern of text.
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08'
contentId: 'a887526b-ff19-4409-91ff-e1679e418922'
The length of the content ID is 36 characters long and has a mix of lowercase letters and numbers with dashes included at position 8,13,18,23,36.
Any help with this would be much appreciated as I just can't seem to get the results right now.
r1 = re.findall(r'^[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]*{36}$',f.read())
print(r1)
Below is the file I'm trying to pull from
Object.defineProperty(e, '__esModule', { value: !0 }), e.default = void 0;
var t = r(d[0])(r(d[1])), n = r(d[0])(r(d[2])), o = r(d[0])(r(d[3])), c = r(d[0])(r(d[4])), l = r(d[0])(r(d[5])), u = function (t) {
return [
{
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08',
prettyId: 'super',
style: { height: 0.5 * t }
},
{
contentId: 'a887526b-ff19-4409-91ff-e1679e418922',
prettyId: 'zap',
style: { height: t }
}
];
},

Is there a typo in the regex in your question? *{36} after the bracket ] that closes the character group causes an error: multiple repeat. Did you mean r'^[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]{36}$'?
Fixing that, you get no results because ^ anchors the match to the start of the line, and $ to the end of the line, so you'd only get results if this pattern was alone on a single line.
Removing these anchors, we get lots of matches because it matches any string of those characters that is 36-long:
r1 = re.findall(r'[a-zA-Z0-9~##$^*()_+=[\]{}|\\,.?: -]{36}',t)
r1: ['var t = r(d[0])(r(d[1])), n = r(d[0]',
')(r(d[2])), o = r(d[0])(r(d[3])), c ',
'= r(d[0])(r(d[4])), l = r(d[0])(r(d[',
'2301ae56-3b9c-4653-963b-2ad84d06ba08',
' style: { height: 0.5',
'a887526b-ff19-4409-91ff-e1679e418922',
' style: { height: t }']
To only match your ids, only look for alphanumeric characters or dashes.
r1 = re.findall(r'[a-zA-Z0-9\-]{36}',t)
r1: ['2301ae56-3b9c-4653-963b-2ad84d06ba08',
'a887526b-ff19-4409-91ff-e1679e418922']
To make it even more specific, you could specify the positions of the dashes:
r1 = re.findall(r'[a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12}', t, re.IGNORECASE)
r1: ['2301ae56-3b9c-4653-963b-2ad84d06ba08',
'a887526b-ff19-4409-91ff-e1679e418922']
Specifying the re.IGNORECASE flag removes the need to look for both upper- and lower-case characters.
Note:
You should read the file into a variable and use that variable if you're going to use its contents more than once, since f.read() won't give anything after the first .read() unless you f.seek(0)
To avoid creating a new file on disk with those contents, I just defined
t = """Object.defineProperty(e, '__esModule', { value: !0 }), e.default = void 0;
var t = r(d[0])(r(d[1])), n = r(d[0])(r(d[2])), o = r(d[0])(r(d[3])), c = r(d[0])(r(d[4])), l = r(d[0])(r(d[5])), u = function (t) {
return [
{
contentId: '2301ae56-3b9c-4653-963b-2ad84d06ba08',
prettyId: 'super',
style: { height: 0.5 * t }
},
{
contentId: 'a887526b-ff19-4409-91ff-e1679e418922',
prettyId: 'zap',
style: { height: t }
}
];
},"""
and used t in place of f.read() from your question.

Regular Expression find a match in JS objects

I'm scraping a site and the data I want is included in a script tag of an html page, I wrote a re code to find a match but it seems I am doing it the wrong way.
Hub = {};
Hub.config = {
config: {},
get: function(key) {
if (key in this.config) {
return this.config[key];
} else {
return null;
}
},
set: function(key, val) {
this.config[key] = val;
}
};
Hub.config.set('sku', {
valCartInfo : {
itemId : '576938415361',
cartUrl: '//cart.mangolane.com/cart.htm'
},
apiRelateMarket : '//tui.mangolane.com/recommend?appid=16&count=4&itemid=576938415361',
apiAddCart : '//cart.mangolane.com/add_cart_item.htm?item_id=576938415361',
apiInsurance : '',
wholeSibUrl : '//detailskip.mangolane.com/service/getData/1/p1/item/detail/sib.htm?itemId=576938415361&sellerId=499095250&modules=dynStock,qrcode,viewer,price,duty,xmpPromotion,delivery,upp,activity,fqg,zjys,amountRestriction,couponActivity,soldQuantity,page,originalPrice,tradeContract',
areaLimit : '',
bigGroupUrl : '',
valPostFee : '',
coupon : {
couponApi : '//detailskip.mangolane.com/json/activity.htm?itemId=576938415361&sellerId=499095250',
couponWidgetDomain: '//assets.mgcdn.com',
cbUrl : '/cross.htm?type=weibo'
},
valItemInfo : {
defSelected: -1,
skuMap : {";20549:103189693;1627207:811754571;":{"price":"528.00","stock":"2","skuId":"4301611864655","oversold":false},
";20549:59280855;1627207:412796441;":{"price":"528.00","stock":"2","skuId":"4432149803707","oversold":false},
";20549:59280855;1627207:196576508;":{"price":"528.00","stock":"2","skuId":"4018119863100","oversold":false},
";20549:72380707;1627207:28341;":{"price":"528.00","stock":"2","skuId":"4166690818570","oversold":false},
";20549:418624880;1627207:28341;":{"price":"528.00","stock":"2","skuId":"4166690818566","oversold":false},
";20549:418624880;1627207:196576508;":{"price":"528.00","stock":"2","skuId":"4018119863098","oversold":false},
";20549:72380707;1627207:3224419;":{"price":"528.00","stock":"2","skuId":"4166690818571","oversold":false},
";20549:147478970;1627207:196576508;":{"price":"528.00","stock":"2","skuId":"4018119863094","oversold":false},
";20549:72380707;1627207:384366805;":{"price":"528.00","stock":"2","skuId":"4432149803708","oversold":false},
";20549:296172561;1627207:811754571;":{"price":"528.00","stock":"2","skuId":"4301611864659","oversold":false},
";20549:72380707;1627207:1150336209;":{"price":"528.00","stock":"2","skuId":"4301611864664","oversold":false},
";20549:147478970;1627207:93586002;":{"price":"528.00","stock":"2","skuId":"4018119863095","oversold":false}}
,propertyMemoMap: {"1627207:811754571":"黑色单里（预售） 年后2.29发货","1627207:93586002":"黑色加绒 现货","1627207:412796441":"黑色（兔毛） 现货","1627207:384366805":"米白色（兔毛） 现货","1627207:3224419":"驼色 现货","1627207:1150336209":"驼色单里（预售） 年后2.29发货","1627207:28341":"黑色 现货","1627207:196576508":"驼色加绒 现货"}
}
});
I need to get only the data in Hub.config.set('sku'
I did this but it didnt work
config_base_str = re.findall("Hub.config.set ({[\s\S]*?});", config) where config is the string of data

The period and parenthesis have a special meaning in regex. If you want to search for the literal characters, you will need to escape them first with a backslash.
For example assuming the string:
config = """
Hub.config.set('sku', {
valCartInfo : {
itemId : '576938415361',
cartUrl: '//cart.mangolane.com/cart.htm'
},
.........
};
"""
If you only want the key, you can do something like this:
config_base_str = re.findall("Hub\.config\.set\('(\w*)", config) # ['sku']
If you want everything after the key within the brackets, you can do something like this instead:
config_base_str = re.findall("Hub\.config\.set\('\w*',\s*({[\s\S]*})", config) # ["{\n valCartInfo : {} ...}"]
https://regex101.com/r/QHdaG2/3/

Python Hug REST API consumed in .NET, JSON looks weird

When consuming a Hug REST endpoint from .net JSON has embedded characters. A complete failing example posted below. Any help greatly appreciated.
Python
#hug.post('/test')
def test(response, body=None):
input = body.get('input')
print('INSIDE TEST ' + input)
if input:
dict = {"lastname":"Jordan"}
dict["firstname"] = input
return json.dumps(dict, sort_keys=True, default=str)
.NET (can only use .net 3.5)
private static object GetParsedData(string data)
{
var posturl = "http://localhost:8000/test";
try
{
using (var client = new WebClient())
{
// upload values is the POST verb
var values = new NameValueCollection()
{
{ "input", data },
};
var response = client.UploadValues(posturl, values);
var responseString = Encoding.UTF8.GetString(response);
var settings = new JsonSerializerSettings
{
NullValueHandling = NullValueHandling.Ignore,
MissingMemberHandling = MissingMemberHandling.Ignore
};
JObject rss = JObject.Parse(responseString);
Console.WriteLine((string)rss["lastname"]);
}
}
catch (WebException ex)
{
if (ex.Response is HttpWebResponse)
{
var code = ((HttpWebResponse)ex.Response).StatusCode;
var desc = ((HttpWebResponse)ex.Response).StatusDescription;
}
//_logger.Error(ex.Message);
}
return false;
}
responseString looks like this:
"\"{\\\"firstname\\\": \\\"Mike\\\", \\\"lastname\\\": \\\"Jordan\\\"}\""
JObject.Parse throws error:
Newtonsoft.Json.JsonReaderException:
'Error reading JObject from JsonReader. Current JsonReader item is not an object: String. Path '', line 1, position 53.
Workaround - If I do something horrible like this to responseString JObject parses correctly:
str = str.Replace("\\", "");
str = str.Substring(1, len - 2);
Whats going on?

The default hug output format is json; it is not necessary to call json.dumps on return values, hug will do this automatically.

Best way to refresh graph without page refresh (Python Django, ajax)

A bit of a general question - I am looking for ways to refresh a graph on a Django page based on user choices. The page has a graph, a few drop boxes where you can select parameters and a refresh button. Currently, I can capture the selections via ajax to my Django view and generate new data from database for the graph. I now need to feed that newly-generated data back into the graph and refresh it without a page refresh. Could anyone recommend the best methods of doing this?

Use JQuery to refresh graph without refreshing page.
I am using chart.js to create graph. first create a graph and on change event get updated data using Ajax URL call and assign values to chart data sets.
/** Graph Start Here */
window.chart = null;
$(document).on('change', '.graph-year-earning', function () {
var year = $(this).val();
$.get($('.graph-ajaxload-context').data('href'), { 'year': year, 'number': Math.floor(Math.random() * (1000000 - 10 + 1) + 10) }, function (response) {
window.chart.data.labels = response.labels;
window.chart.data.datasets[0].soldProductLabel = response.product_sold_label;
window.chart.data.datasets[0].totalCommissionLabel = response.monthly_commission_label;
window.chart.data.datasets[0].dataLabel = response.your_share_label;
if (response.total_commission == 0) {
window.chart.options.scales.yAxes[0].ticks.suggestedMin = 0;
window.chart.options.scales.yAxes[0].ticks.suggestedMax = 140000;
} else {
window.chart.options.scales.yAxes[0].ticks.suggestedMin = '';
window.chart.options.scales.yAxes[0].ticks.suggestedMax = '';
}
$.each(response.data, function (index, value) {
window.chart.data.datasets[0].soldProduct[index] = value[2];
window.chart.data.datasets[0].data[index] = Math.round(value[0]);
});
window.chart.update();
$(".txt-total-commission-by-year").html(response.total_commission)
$('.graph-ajaxload-context .inline-loader').hide();
});
});
if ($('.graph-ajaxload-context').length > 0) {
showLoader()
$('.graph-year-earning').trigger('change');
var ctx = $('#userEarningGraph');
window.chart = new Chart(ctx, {
type: 'bar',
data: {
labels: [],
datasets: [{
soldProductLabel: '',
soldProduct: [],
dataLabel: '',
data: [],
backgroundColor: '#ADAEB1',
hoverBackgroundColor: '#48C6B9'
}]
},
options: {
legend: {
display: false
},
scales: {
yAxes: [{
ticks: {
beginAtZero: true,
maxTicksLimit: 8,
userCallback: function (value, index, values) {
value = value.toString();
value = value.split(/(?=(?:...)*$)/);
value = value.join(',');
var currency_code = ' ₩'
if ($('.graph-ajaxload-context').data('currency-code') && $('.graph-ajaxload-context').data('currency-code') != 'None') {
currency_code = $('.graph-ajaxload-context').data('currency-code')
}
return value + ' ' + currency_code;
}
},
}]
},
tooltips: {
mode: 'label',
callbacks: {
label: function (tooltipItem, data) {
var soldProduct = data.datasets[tooltipItem.datasetIndex].soldProduct[tooltipItem.index];
var soldProductLabel = data.datasets[tooltipItem.datasetIndex].soldProductLabel;
var dataPro = data.datasets[tooltipItem.datasetIndex].data[tooltipItem.index];
var dataLabel = data.datasets[tooltipItem.datasetIndex].dataLabel;
return [soldProductLabel + ':' + soldProduct, dataLabel + ':' + dataPro + ' ₩',];
}
}
}
}
});
}
$(document).on('click', '.showgraph', function (e) {
$('.graph-year-earning').trigger('change');
});
/** Graph End Here */

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping element <script> for strings in Python - python

Related

How to check if a Firestore collection exists without knowing document name/ID with Python [duplicate]

Pattern for re not retriving any results

Regular Expression find a match in JS objects

Python Hug REST API consumed in .NET, JSON looks weird

Best way to refresh graph without page refresh (Python Django, ajax)

Categories

Resources