Elasticsearch - Reindex single field with different analyzer using Python

Elasticsearch - Reindex single field with different analyzer using Python - python

I use dynamic mapping in elasticsearch to load my json file into elasticsearch, like this:
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
def extract():
f = open('tmdb.json')
if f:
return json.loads(f.read())
movieDict = extract()
def index(movieDict={}):
for id, body in movieDict.items():
es.index(index='tmdb', id=id, doc_type='movie', body=body)
index(movieDict)
How can I update mapping for single field? I have field title to which I want to add different analyzer.
title_settings = {"properties" : { "title": {"type" : "text", "analyzer": "english"}}}
es.indices.put_mapping(index='tmdb', body=title_settings)
This fails.
I know that I cannot update already existing index, but what is proper way to reindex mapping generated from json file? My file has a lot of fields, creating mapping/settings manually would be very troublesome.
I am able to specify analyzer for an query, like this:
query = {"query": {
"multi_match": {
"query": userSearch, "analyzer":"english", "fields": ['title^10', 'overview']}}}
How do I specify it for index or field?
I am also able to put analyzer to settings after closing and opening index
analysis = {'settings': {'analysis': {'analyzer': 'english'}}}
es.indices.close(index='tmdb')
es.indices.put_settings(index='tmdb', body=analysis)
es.indices.open(index='tmdb')
Copying exact settings for english analyzers doesn't do 'activate' it for my data.
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis-lang-analyzer.html#english-analyzer
By 'activate' I mean, search is not returned in a form processed by english analyzer ie. there are still stopwords.

Solved it with massive amount of googling....
You cannot change analyzer on already indexed data. This includes opening/closing of index. You can specify new index, create new mapping and load your data (quickest way)
Specifying analyzer for whole index isn't good solution, as 'english' analyzer is specific to 'text' fields. It's better to specify analyzer by field.
If analyzers are specified by field you also need to specify type.
You need to remember that analyzers are used at can be used at/or index and search time. Reference Specifying analyzers
Code:
def create_index(movieDict={}, mapping={}):
es.indices.create(index='test_index', body=mapping)
start = time.time()
for id, body in movieDict.items():
es.index(index='test_index', id=id, doc_type='movie', body=body)
print("--- %s seconds ---" % (time.time() - start))
Now, I've got mapping from dynamic mapping of my json file. I just saved it back to json file for ease of processing (editing). That's because I have over 40 fields to map, doing it by hand would be just tiresome.
mapping = es.indices.get_mapping(index='tmdb')
This is example of how title key should be specified to use english analyzer
'title': {'type': 'text', 'analyzer': 'english','fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}

Related

Using additional domain fields in WHMCS with python

I'm trying to register a domain for customers using WHMCS apis with python.
The issue I'm having is with the additional domain fields, in this case for the .eu domain:
additional_domainfields = base64.b64encode(phpserialize.dumps({'Entity Type':'INDIVIDUAL', 'EU Country of Citizenship':'IT'}))
post_data = {
'identifier':'xxx',
'secret':'xxx',
'action':'AddOrder',
'clientid':client_id,
'domain':domain,
'domaintype':"register",
'billingcycle':"annually",
'regperiod':1,
'noinvoice':True,
'noinvoiceemail':True,
'noemail':True,
'paymentmethod':"stripe",
'domainpriceoverride':0,
'domainrenewoverride':0,
'domainfields':additional_domainfields,
'responsetype':'json'
}
postfields = urlencode(post_data)
c.setopt(c.POSTFIELDS, postfields) # type: ignore
c.perform()
c.reset
I can't find what to use for the array of serialized data, so far all of my tests were for nothing.
I tried any combination of fields on the line:
additional_domainfields = base64.b64encode(phpserialize.dumps({'Entity Type':'INDIVIDUAL', 'EU Country of Citizenship':'IT'})).
Also, tried cheching for the php overrides for field names, so far without success.
Any help would be immensely appreciated

How to speed up returning a 20MB Json file from a Python-Flask application?

I am trying to call an API which in turn triggers a store procedure from our sqlserver database. This is how I coded it.
class Api_Name(Resource):
def __init__(self):
pass
#classmethod
def get(self):
try:
engine = database_engine
connection = engine.connect()
sql = "DECLARE #return_value int EXEC #return_value = [dbname].[dbo].[proc_name])
return call_proc(sql, apiname, starttime, connection)
except Exception as e:
return {'message': 'Proc execution failed with error => {error}'.format(error=e)}, 400
pass
call_proc is the method where I return the JSON from database.
def call_proc(sql: str, connection):
try:
json_data = []
rv = connection.execute(sql)
for result in rv:
json_data.append(dict(zip(result.keys(), result)))
return Response(json.dumps(json_data), status=200)
except Exception as e:
return {'message': '{error}'.format(error=e)}, 400
finally:
connection.close()
The problem with the output is the way JSON is returned and the size of it.
At first the API used to take 1minute 30seconds: when the return statement was like this:
case1: return Response(json.dumps(json_data), status=200, mimetype='application/json')
After looking online, I found that the above statement is trying to prettify JSON. So I removed mimetype from the response & made it as
case2: return Response(json.dumps(json_data), status=200)
The API runs for 30seconds, although the JSON output is not aligned properly but its still JSON.
I see the output size of the JSON returned from the API is close 20MB. I observed this on postman response:
Status: 200 OK Time: 29s Size: 19MB
The difference in Json output:
case1:
[ {
"col1":"val1",
"col2":"val2"
},
{
"col1":"val1",
"col2":"val2"
}
]
case2:
[{"col1":"val1","col2":"val2"},{"col1":"val1","col2":"val2"}]
Will the difference in output from the two aforementioned cases are different ? If so, how can I fix the problem ?
If there is no difference, is there any way I speed up this further and reduce the run time further more, like compressing the JSON which I am returning ?

You can use gzip compression to make your plain text weight from Megabytes to even Kilobytes. Or even use flask-compress library for that.
Also I'd suggest to use ujson to make dump() call faster.
import gzip
from flask import make_response
import ujson as json
#app.route('/data.json')
def compress():
compression_level = 5 # of 9 max
data = [
{"col1": "val1", "col2": "val2"},
{"col1": "val1", "col2": "val2"}
]
content = gzip.compress(json.dumps(data).encode('utf8'), compression_level)
response = make_response(content)
response.headers['Content-length'] = len(content)
response.headers['Content-Encoding'] = 'gzip'
return response
Documentation:
https://docs.python.org/3/library/gzip.html
https://github.com/colour-science/flask-compress
https://pypi.org/project/ujson/

First of all, profile: if 90% the time is being spent transferring across the network then optimising processing speed is less useful than optimising transfer speed (for example, by compressing the response as wowkin recommended (though the web server may be configured to do this automatically, if you are using one)
Assuming that constructing the JSON is slow, if you control the database code you could use its JSON capabilities to serialise the data, and avoid doing it at the Python layer. For example,
SELECT col1, col2
FROM tbl
WHERE col3 > 42
FOR JSON AUTO
would give you
[
{
"col1": "foo",
"col2": 1
},
{
"col1": "bar",
"col2": 2
},
...
]
Nested structures can be created too, described in the docs.
If the requester only needs the data, return it as a download using flask's send_file feature and avoid the cost of constructing an HTML response:
from io import BytesIO
from flask import send_file
def call_proc(sql: str, connection):
try:
rv = connection.execute(sql)
json_data = rv.fetchone()[0]
# BytesIO expects encoded data; if you can get the server to encode
# the data instead it may be faster.
encoded_json = json_data.encode('utf-8')
buf = BytesIO(encoded_json)
return send_file(buf, mimetype='application/json', as_attachment=True, conditional=True)
except Exception as e:
return {'message': '{error}'.format(error=e)}, 400
finally:
connection.close()

You need to implement pagination on your API. 19MB is absurdly large and will lead to some very annoyed users.
gzip and clevererness with the JSON responses will sadly not be enough, you'll need to put in a bit more legwork.
Luckily, there's many pagination questions and answers, and Flasks modular approach to things will mean that someone probably wrote up a module that's applicable to your problem. I'd start off by re-implementing the method with an ORM. I heard that sqlalchemy is quite good.

To answer your question:
1 - Both JSON are semantically identical.
You can make use of http://www.jsondiff.com to compare two JSON.
2 - I would recommend you to make chunks of your data and send it across network.
This might help:
https://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html

TL;DR; Try restructuring your JSON payload (i.e. change schema)
I see that you are constructing the JSON response in one of your APIs. Currently, your JSON payload looks something like:
[
{
"col0": "val00",
"col1": "val01"
},
{
"col0": "val10",
"col1": "val11"
}
...
]
I suggest you restructure it in such a way that each (first level) key in your JSON represents the entire column. So, for the above case, it will become something like:
{
"col0": ["val00", "val10", "val20", ...],
"col1": ["val01", "val11", "val21", ...]
}
Here are the results from some offline test I performed.
Experiment variables:
NUMBER_OF_COLUMNS = 10
NUMBER_OF_ROWS = 100000
LENGTH_OF_STR_DATA = 5
#!/usr/bin/env python3
import json
NUMBER_OF_COLUMNS = 10
NUMBER_OF_ROWS = 100000
LENGTH_OF_STR_DATA = 5
def get_column_name(id_):
return 'col%d' % id_
def random_data():
import string
import random
return ''.join(random.choices(string.ascii_letters, k=LENGTH_OF_STR_DATA))
def get_row():
return {
get_column_name(i): random_data()
for i in range(NUMBER_OF_COLUMNS)
}
# data1 has same schema as your JSON
data1 = [
get_row() for _ in range(NUMBER_OF_ROWS)
]
with open("/var/tmp/1.json", "w") as f:
json.dump(data1, f)
def get_column():
return [random_data() for _ in range(NUMBER_OF_ROWS)]
# data2 has the new proposed schema, to help you reduce the size
data2 = {
get_column_name(i): get_column()
for i in range(NUMBER_OF_COLUMNS)
}
with open("/var/tmp/2.json", "w") as f:
json.dump(data2, f)
Comparing sizes of the two JSONs:
$ du -h /var/tmp/1.json
17M
$ du -h /var/tmp/2.json
8.6M
In this case, it almost got reduced by half.
I would suggest you do the following:
First and foremost, profile your code to see the real culprit. If it is really the payload size, proceed further.
Try to change your JSON's schema (as suggested above)
Compress your payload before sending (either from your Flask WSGI app layer or your webserver level - if you are running your Flask app behind some production grade webserver like Apache or Nginx)

For large data that you can't paginate using something like ndjson (or any type of delimited record format) can really reduce the server resources needed since you'd be preventing holding the JSON object in memory. You would need to get access to the response stream to write each object/line to the response though.
The response
[ {
"col1":"val1",
"col2":"val2"
},
{
"col1":"val1",
"col2":"val2"
}
]
Would end up looking like
{"col1":"val1","col2":"val2"}
{"col1":"val1","col2":"val2"}
This also has advantages on the client since you can parse and process each line on it's own as well.
If you aren't dealing with nested data structures responding with a CSV is going to be even smaller.

I want to note that there is a standard way to write a sequence of separate records in JSON, and it's described in RFC 7464. For each record:
Write the record separator byte (0x1E).
Write the JSON record, which is a regular JSON document that can also contain inner line breaks, in UTF-8.
Write the line feed byte (0x0A).
(Note that the JSON text sequence format, as it's called, uses a more liberal syntax for parsing text sequences of this kind; see the RFC for details.)
In your example, the JSON text sequence would look as follows, where \x1E and \x0A are the record separator and line feed bytes, respectively:
\x1E{"col1":"val1","col2":"val2"}\x0A\x1E{"col1":"val1","col2":"val2"}\x0A
Since the JSON text sequence format allows inner line breaks, you can write each JSON record as you naturally would, as in the following example:
\x1E{
"col1":"val1",
"col2":"val2"}
\x0A\x1E{
"col1":"val1",
"col2":"val2"
}\x0A
Notice that the media type for JSON text sequences is not application/json, but application/json-seq; see the RFC.

Google Sheet API - Get Data Validation

I'm trying to set data validation rules for my current spreadsheet. One thing that would help me would to be able to view the rules in JSON from data validation rules I have already set (In the spreadsheet UI or within an API call).
Example.
request = {
"requests": [
{
"setDataValidation": {
"range": {
"sheetId": SHEET_ID,
"startRowIndex": 1,
"startColumnIndex": 0,
"endColumnIndex":1
},
"rule": {
"condition": {
"type": "BOOLEAN"},
"inputMessage": "Value MUST BE BOOLEAN",
"strict": "True"
}
}
}
]
}
service.spreadsheets().batchUpdate(spreadsheetId=SPREADSHEET_ID body=request).execute()
But what API calls do I use to see the Data Validation on these range of cells? This is useful for if I set the Data Validation rules in the spreadsheet and I want to see how google interprets them. I'm having a lot of trouble setting complex Datavalidations through the API.
Thank you

To obtain only the "Data Validation" components of a given spreadsheet, you simply request the appropriate field in the call to spreadsheets.get:
service = get_authed_sheets_service_somehow()
params = {
spreadsheetId: 'your ssid',
#range: 'some range',
fields: 'sheets(data/rowData/values/dataValidation,properties(sheetId,title))' }
request = service.spreadsheets().get(**params)
response = request.execute()
# Example print code (not tested :p )
for sheet in response['sheets']:
for range in sheet['data']:
for r, row in enumerate(range['rowData']):
for c, col in enumerate(row['values']):
if 'dataValidation' in col:
# print "Sheet1!R1C1" & associated data validation object.
# Assumes whole grid was requested (add appropriate indices if not).
print(f'\'{sheet["properties"]["title"]}\'!R{r}C{c}', col['dataValidation'])
By specifying fields, includeGridData is not required to obtain data on a per-cell basis from the range you requested. By not supplying a range, we target the entire file. This particular fields specification requests the rowData.values.dataValidation object and the sheetId and title of the properties object, for every sheet in the spreadsheet.
You can use the Google APIs Explorer to interactively determine the appropriate valid "fields" specification, and additionally examine the response:
https://developers.google.com/apis-explorer/#p/sheets/v4/sheets.spreadsheets.get
For more about how "fields" specifiers work, read the documentation: https://developers.google.com/sheets/api/guides/concepts#partial_responses
(For certain write requests, field specifications are not optional so it is in your best interest to determine how to use them effectively.)

I think I found the answer. IncludeGridData=True in your spreadsheet().get
from pprint import pprint
response = service.spreadsheets().get(
spreadsheetId=SPREADSHEETID, fields='*',
ranges='InputWorking!A2:A',includeGridData=True).execute()
You get a monster datastructure back. So to look at the very first data in your range you could do.
pprint(response['sheets'][0]['data'][0]['rowData'][0]['values'][0]['dataValidation'])
{'condition': {'type': 'BOOLEAN'},
'inputMessage': 'Value MUST BE BOOLEAN',
'strict': True}

RavenDB Object properly saved but some empty attributes when querying

I'm currently trying to save some python objects (websites) via PyRavenDB in a RavenDB database. The problem is that data are saved properly, but when I test it by querying the results, some of the attributes are returned empty.
The code is simple, I can't properly find be the problem.
The JSON object in the database is the following (verified via the DB web UI).
{
"htmlCode": "<code>TEST HTML</code>",
"added": "2017-02-21",
"uniqueid": "262e4584f3e546afa2c67045a0096b54",
"url": "www.example.com",
"myHash": "d41d8cd98f00b204e9800998ecf8427e",
"lastaccessed": "2017-02-21"
}
When I use this code to query
from pyravendb.store import document_store
store = document_store.documentstore(url="http://somewhere:someport", database="websites")
store.initialize()
with store.open_session() as session:
query_result = list(session.query().where_equals("www.example.com", url))
print query_result
print type(query_result)
return query_result
It returns this object :
{
'uniqueid': 'f942e86f965d4709a2d69caca3001f2a',
'url': '',
'myHash': 'd41d8cd98f00b204e9800998ecf8427e',
'htmlCode': '',
'added': '2017-02-21',
'lastaccessed': '2017-02-21'
}
As you can see, url and html code are empty. They should be okey since in DB they are properly stored.
Thanks.

The problem here is that you don't use the where_equal right.
where_equal first argument is the field name you want to query with and then the value (def where_equals(self, field_name, value)).
Just change this line query_result = list(session.query().where_equals("www.example.com", url))
To this query_result = list(session.query().where_equals("url", "www.example.com"))
This will fix your problem

Google Adwords API query the number of conversions per click

I would like to query the number of conversions per click in a google adwords report using the SOAP API. Unfortuately the following query (Python),
# Create report definition.
report = {
'reportName': 'Last 30 days ADGROUP_PERFORMANCE_REPORT',
'dateRangeType': 'LAST_30_DAYS',
'reportType': 'ADGROUP_PERFORMANCE_REPORT',
'downloadFormat': 'CSV',
'selector': {
'fields': ['CampaignId', 'AdGroupId', 'Id',
'Impressions', 'Clicks', 'Cost',
'Conv1PerClick',
'CampaignName','AdGroupName']
},
# Enable to get rows with zero impressions.
'includeZeroImpressions': 'false'
}
results in the following error
AdWordsReportError: HTTP code: 400, type: 'ReportDefinitionError.INVALID_FIELD_NAME_FOR_REPORT', trigger: 'Conv1PerClick', field path: ''
Google documentation (https://developers.google.com/adwords/api/docs/appendix/reports) seems to indicate that such a report should have a conv1PerClick field (I tryed also removing capitalization of the first letter, similar error occurs ).
Does anybody knows a way to query the ad group statistics about conversions per click?

Not sure if I am understanding you correctly, but the field is called Conversions, not Conv1PerClick. If you download the report in XML format, then the corresponding field attribute name used to be conv1PerClick, but this changed in v201402 in line with some changes to the way these metrics are computed:
http://adwords.blogspot.ch/2014/02/a-new-way-to-count-conversions-in.html
https://developers.google.com/adwords/api/docs/guides/migration/v201402

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elasticsearch - Reindex single field with different analyzer using Python - python

Related

Using additional domain fields in WHMCS with python

How to speed up returning a 20MB Json file from a Python-Flask application?

Google Sheet API - Get Data Validation

RavenDB Object properly saved but some empty attributes when querying

Google Adwords API query the number of conversions per click

Categories

Resources