Django case insensitive search in multilevel jsonb field using ORM methods - python

here is my sample jsonb field:
{
"name": "XXXXX",
"duedate": "Wed Aug 31 2022 17:23:13 GMT+0530",
"structure": {
"sections": [
{
"id": "0",
"temp_id": 9,
"expanded": true,
"requests": [
{
"title": "entity onboarding form", # I need to lookup at this level (or key)
"agents": [
{
"email": "ak#xxx.com",
"user_name": "Akhild",
"review_status": 0
}
],
"req_id": "XXXXXXXX",
"status": "Requested"
},
{
"title": "onboarding", # I need to lookup at this level (or key)
"agents": [
{
"email": "adak#xxx.com",
"user_name": "gaajj",
"review_status": 0
}
],
"req_id": "XXXXXXXX",
"status": "Requested"
}
],
"labelText": "Pan Card"
}
]
},
"Agentnames": "",
"clientowners": "Admin",
"collectionname": "Bank_duplicate"
}
In this JSON i need to do case insensitive match for structure->section->request(array)-> title inside each object of request array
I have tried this query filter
(Q(requests__structure__sections__contains=[{'requests':[{"title": query}]}]))
but it becomes case sensitive. Also i have tried
self.get_queryset().annotate(search=SearchVector(Cast('requests__structure__sections', TextField()))
which does gives case insensitive result but also lookup among the keys other than title.
also i tried raw sql where i cannot go beyond the request array.
Im expecting any other method or any other approach in django orm that can be used to achieve the require result.

Related

Python call to JIRA Rest Api search not returning Created (Date) or Resolution fields

Testing out using python to make a rest api call to pull defects/bugs from my jira instance. Using the code provided in the api_docs, I put together this query:
payload = json.dumps( {
"jql": "issuetype in (Bug,Defect) AND CreatedDate >= 2021\\u002f01\\u002f01",
"maxResults": 1,
#"fieldsByKeys": false,
"fields": [
"summary",
"assignee",
"reporter",
"status",
"resolution"
"created",
"updated"
],
"startAt": 0
} )
The call is successful, except it returns every field but created and resolution. I used /rest/api/3/field to ensure that the field was spelled correctly and it was. Also tried capitalizing Created.
{
"id": "created",
"name": "Created",
"custom": false,
"orderable": false,
"navigable": true,
"searchable": true,
"clauseNames": ["created",
"createdDate"],
"schema": {
"type": "datetime",
"system": "created"
}
}
API output example:
{
"expand": "names,schema",
"issues": [
{
"expand": "operations,versionedRepresentations,editmeta,changelog,renderedFields",
"fields": {
"assignee": null,
"reporter": example,
"status": example,
"summary": "DEFECT: Test 1",
"updated": "2021-03-07T11:14:31.000-0500"
},
"id": "123456",
"key": "Example-4",
"self": "example_link"
}
],
"maxResults": 1,
"startAt": 0,
"total": 100
}
Alternatively, when I leave fields blank I do get all the fields including created and resolution. However, I don't want to do that as we have hundreds of custom fields that get pulled in as well.
I can see a typo in your code sample. You are missing the comma after the resolution field. You code should be:
payload = json.dumps( {
"jql": "issuetype in (Bug,Defect) AND CreatedDate >= 2021\\u002f01\\u002f01",
"maxResults": 1,
"fields": [
"summary",
"assignee",
"reporter",
"status",
"resolution", # Note the comma here
"created",
"updated"
],
"startAt": 0
} )

How to load json nested data into bigquery

I'm trying to load the json data from an API into bigquery table on GCP however I got an issue that the json data seem to miss a square bracket so it got an error '"Repeated record with name trip_update added outside of an array."}]'. I don't know how
Here is the data sample:
{
"header": {
"gtfs_realtime_version": "1.0",
"timestamp": 1607630971
},
"entity": [
{
"id": "65.5.17-120-cm1-1.18.O",
"trip_update": {
"trip": {
"trip_id": "65.5.17-120-cm1-1.18.O",
"start_time": "18:00:00",
"start_date": "20201210",
"schedule_relationship": "SCHEDULED",
"route_id": "17-120-cm1-1"
},
"stop_time_update": [
{
"stop_sequence": 1,
"departure": {
"delay": 0
},
"stop_id": "8220B1351201",
"schedule_relationship": "SCHEDULED"
},
{
"stop_sequence": 23,
"arrival": {
"delay": 2340
},
"departure": {
"delay": 2340
},
"stop_id": "8260B1025301",
"schedule_relationship": "SCHEDULED"
}
]
}
}
]
}
Here is a schema and code:
schema
[
{ "name":"header",
"type": "record",
"fields": [
{ "name":"gtfs_realtime_version",
"type": "string",
"description": "version of speed specification"
},
{ "name": "timestamp",
"type": "integer",
"description": "The moment where this dataset was generated on server e.g. 1593102976"
}
]
},
{"name":"entity",
"type": "record",
"mode": "REPEATED",
"description": "Multiple entities can be included in the feed",
"fields": [
{"name":"id",
"type": "string",
"description": "unique identifier for the entity"
},
{"name": "trip_update",
"type": "struct",
"mode": "REPEATED",
"description": "Data about the realtime departure delays of a trip. At least one of the fields trip_update, vehicle, or alert must be provided - all these fields cannot be empty.",
"fields": [
{ "name":"trip",
"type": "record",
"mode": "REPEATED",
"fields": [
{"name": "trip_id",
"type": "string",
"description": "selects which GTFS entity (trip) will be affected"
},
{ "name":"start_time",
"type": "string",
"description": "The initially scheduled start time of this trip instance 13:30:00"
},
{ "name":"start_date",
"type": "string",
"description": "The start date of this trip instance in YYYYMMDD format. Whether start_date is required depends on the type of trip: e.g. 20200625"
},
{ "name":"schedule_relationship",
"type": "string",
"description": "The relation between this trip and the static schedule e.g. SCHEDULED"
},
{ "name":"route_id",
"type": "string",
"description": "The route_id from the GTFS feed that this selector refers to e.g. 10-263-e16-1"
}
]
}
]
},
{ "name":"stop_time_update",
"type": "record",
"mode": "REPEATED",
"description": "Updates to StopTimes for the trip (both future, i.e., predictions, and in some cases, past ones, i.e., those that already happened). The updates must be sorted by stop_sequence, and apply for all the following stops of the trip up to the next specified stop_time_update. At least one stop_time_update must be provided for the trip unless the trip.schedule_relationship is CANCELED - if the trip is canceled, no stop_time_updates need to be provided.",
"fields": [
{"name":"stop_sequence",
"type": "string",
"description": "Must be the same as in stop_times.txt in the corresponding GTFS feed e.g 3"
},
{ "name":"arrival",
"type": "record",
"mode": "REPEATED",
"fields": [
{ "name":"delay",
"type": "string",
"description": "Delay (in seconds) can be positive (meaning that the vehicle is late) or negative (meaning that the vehicle is ahead of schedule). Delay of 0 means that the vehicle is exactly on time e.g 5"
}
]
},
{ "name": "departure",
"type": "record",
"mode": "REPEATED",
"fields": [
{ "name":"delay",
"type": "integer"
}
]
},
{ "name":"stop_id",
"type": "string",
"description": "Must be the same as in stops.txt in the corresponding GTFS feed e.g. 8430B2552301"
},
{"name":"schedule_relationship",
"type": "string",
"description": "The relation between this StopTime and the static schedule e.g. SCHEDULED , SKIPPED or NO_DATA"
}
]
}
]
}
]
function (following google guideline https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions?authuser=2#before-you-begin)
def _insert_into_bigquery(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
row = json.loads(blob.download_as_string())
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,
json_rows=row,
ignore_unknown_values=True,
retry=retry.Retry(deadline=30))
if errors != []:
raise BigQueryError(errors)
Your schema definition is wrong. trip_update isn't a struct repeated, but a record nullable (or not, but not repeated)
{"name": "trip_update",
"type": "record",
"mode": "NULLABLE",
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
I believe that your "trip_update" and "trip" fields must contain an array of values (indicated by square brackets), the same as you did for "stop_time_update".
"trip_update": [
{
"trip": [
{
"trip_id
I am not sure that will be enough though to load your data flawlessly.
Your example row has many newline characters in the middle of your JSON row, and when you are loading data from JSON files, the rows must be newline delimited. BigQuery expects newline-delimited JSON files to contain a single record per line (the parser is trying to interpret each line as a separate JSON row) (Reference).
Example of how your JSON data file should look like.

Elastic search: Partial search not working properly

Partial search is not working on multiple fields.
Data: - "Sales inquiries generated".
{
"query_string": {
"fields": ["name", "title", "description", "subject"],
"query": search_data+"*"
}
}
Case1: When I pass search data as "inquiri" it works fine,
But when I pass search data as "inquirie" it's not working .
Case2: When I pass search data as "sale" it works fine,
But when I pass search data as "sales" it's not working.
Case3: When I pass search data as "generat" it works fine,
But when I pass search data as "generate" it's not working.
I defined my field this way.
text_analyzer = analyzer("text_analyzer", tokenizer="standard", filter=["lowercase", "stop", "snowball"])
name = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
title = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
subject = Text(analyzer=text_analyzer, fields={"raw": Keyword()})
What is the issue in my code? Any help would be much appreciated!
Thanks in advance.
This is happening due to the use of snowball token filter which stems the words, please refer official snowball doc for more info.
I create the same analyzer with your setting to see the generated tokens for your text, as at the end search happens when the index token matches the search term tokens.
ES provides nice REST apis and you can easily reproduce the issue:
Create index with your setting
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"snowball",
"stop"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
once index is created you can use the analyze API to see generated tokens for your text.
POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
{
"analyzer": "my_analyzer",
"text": "Sales inquiries generated"
}
{
"tokens": [
{
"token": "sale",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "inquiri",
"start_offset": 6,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "generat",
"start_offset": 16,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 2
}
]
}
You can see all the tokens are the same which matches your search query, hence you are getting result for other search terms, which means while querying instead of raw you are using the keyword part of your text field

How to get Wikidata ID for an entity of a property ? Is there an API available for python?

The MediaWiki API is able to find ID for an item with the request URL:
/w/api.php?action=query&format=json&prop=pageprops&titles=skype&formatversion=2&ppprop=wikibase_item
The result is:
{
"batchcomplete": true,
"query": {
"normalized": [
{
"fromencoded": false,
"from": "skype",
"to": "Skype"
}
],
"pages": [
{
"pageid": 424589,
"ns": 0,
"title": "Skype",
"pageprops": {
"wikibase_item": "Q40984"
}
}
]
}
}
However, it does not work well when querying about a property, e.g., developer P178. The result is Q409857 rather than the desired P178:
{
"batchcomplete": true,
"query": {
"normalized": [
{
"fromencoded": false,
"from": "developer",
"to": "Developer"
}
],
"pages": [
{
"pageid": 179684,
"ns": 0,
"title": "Developer",
"pageprops": {
"wikibase_item": "Q409857"
}
}
]
}
}
Is there any way to get the ID for an entity which could be an item, a property or even a lexeme?
You could use on Wikidata the search API.
For example, to find properties with the name "developer" inside, use
https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=developer&srnamespace=120
120 is the property namespace. To find a lexeme, use srnamespace=146.
Note, this API guesses your language and adapts the results correspondingly. If you don't live in an English speaking country, the above example may thus fail.

python: query mongodb database to find a specific value under a unknown field

I am using python to generate a mongodb database collection and I need to find some specific values from the database, the document is like:
{
"_id":ObjectId(215454541245),
"category":food
"venues":{"Thai Restaurant":251, "KFC":124, "Chinese Restaurant":21,.....}
}
My question is that, I want to query this database and find all venues which have a value smaller than 200, so in my example, "KFC" and "Chinese Restaurant" will be returned from this query.
Anyone knows how to do that?
If you can change your schema it would be much easier to issue queries against your collection. As it is, having dynamic values as your keys is considered a bad design pattern with MongoDB as they are extremely difficult to query.
A recommended approach would be to follow an embedded model like this:
{
"_id": ObjectId("553799187174b8c402151d06"),
"category": "food",
"venues": [
{
"name": "Thai Restaurant",
"value": 251
},
{
"name": "KFC",
"value": 124
},
{
"name": "Chinese Restaurant",
"value": 21
}
]
}
Thus with this structure you could then issue the query to find all venues which have a value smaller than 200:
db.collection.findOne({"venues.value": { "$lt": 200 } },
{
"venues": { "$elemMatch": { "value": { "$lt": 200 } } },
"_id": 0
});
This will return the result:
/* 0 */
{
"venues" : [
{
"name" : "KFC",
"value" : 124
}
]
}

Categories