Python Elasticsearch create index mapping

Python Elasticsearch create index mapping - python

I am trying to create a ES index with custom mapping with elasticsearch python to increase the size of text in each document:
mapping = {"mapping":{
"properties":{
"Apple":{"type":"text","ignore_above":1000},
"Mango":{"type":"text","ignore_above":1000}
}
}}
Creation:
from elasticsearch import Elasticsearch
es1 = Elasticsearch([{"host":"localhost","port":9200}])
es1.indices.create(index="hello",body=mapping)
Error:
RequestError: RequestError(400, 'mapper_parsing_exception', 'Mapping definition for [Apple] has unsupported parameters: [ignore_above : 10000]')
But I checked the elasticsearch website on how to increase the text length limit and ignore_above was the option given there.
https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html
Any suggestions on how to rectify this will be great.

The ignore_above setting is only for keyword types not text, so just change your mapping to this and it will work:
mapping = {"mapping":{
"properties":{
"Apple":{"type":"text"},
"Mango":{"type":"text"}
}
}}
If you absolutely need to be able to specify ignore_above then you need to change the type to keyword, like this:
mapping = {"mapping":{
"properties":{
"Apple":{"type":"keyword","ignore_above":1000},
"Mango":{"type":"keyword","ignore_above":1000}
}
}}

Related

Elasticsearch - Reindex single field with different analyzer using Python

I use dynamic mapping in elasticsearch to load my json file into elasticsearch, like this:
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
def extract():
f = open('tmdb.json')
if f:
return json.loads(f.read())
movieDict = extract()
def index(movieDict={}):
for id, body in movieDict.items():
es.index(index='tmdb', id=id, doc_type='movie', body=body)
index(movieDict)
How can I update mapping for single field? I have field title to which I want to add different analyzer.
title_settings = {"properties" : { "title": {"type" : "text", "analyzer": "english"}}}
es.indices.put_mapping(index='tmdb', body=title_settings)
This fails.
I know that I cannot update already existing index, but what is proper way to reindex mapping generated from json file? My file has a lot of fields, creating mapping/settings manually would be very troublesome.
I am able to specify analyzer for an query, like this:
query = {"query": {
"multi_match": {
"query": userSearch, "analyzer":"english", "fields": ['title^10', 'overview']}}}
How do I specify it for index or field?
I am also able to put analyzer to settings after closing and opening index
analysis = {'settings': {'analysis': {'analyzer': 'english'}}}
es.indices.close(index='tmdb')
es.indices.put_settings(index='tmdb', body=analysis)
es.indices.open(index='tmdb')
Copying exact settings for english analyzers doesn't do 'activate' it for my data.
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis-lang-analyzer.html#english-analyzer
By 'activate' I mean, search is not returned in a form processed by english analyzer ie. there are still stopwords.

Solved it with massive amount of googling....
You cannot change analyzer on already indexed data. This includes opening/closing of index. You can specify new index, create new mapping and load your data (quickest way)
Specifying analyzer for whole index isn't good solution, as 'english' analyzer is specific to 'text' fields. It's better to specify analyzer by field.
If analyzers are specified by field you also need to specify type.
You need to remember that analyzers are used at can be used at/or index and search time. Reference Specifying analyzers
Code:
def create_index(movieDict={}, mapping={}):
es.indices.create(index='test_index', body=mapping)
start = time.time()
for id, body in movieDict.items():
es.index(index='test_index', id=id, doc_type='movie', body=body)
print("--- %s seconds ---" % (time.time() - start))
Now, I've got mapping from dynamic mapping of my json file. I just saved it back to json file for ease of processing (editing). That's because I have over 40 fields to map, doing it by hand would be just tiresome.
mapping = es.indices.get_mapping(index='tmdb')
This is example of how title key should be specified to use english analyzer
'title': {'type': 'text', 'analyzer': 'english','fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}}

How to create a Dataproc cluster with time to live using Python SDK

I try to create a Dataproc cluster which has a time to live of 1 day using python SDK. For this purpose, v1beta2 of the Dataproc API introduces the LifecycleConfig object which is child of the ClusterConfig object.
I use this object in the JSON file which I pass to the create_cluster method. To set the particular TTL, I use the field auto_delete_ttl which shall have the value 86,400 seconds (one day).
The documentation of Google Protocol Buffers is rather specific about how to represent a duration in the JSON file: Durations shall be represented as string with suffix s for seconds and there shall be 0,3,6 or 9 fractional seconds:
However, if I pass the duration using this format, I get the error:
Parameter to MergeFrom() must be instance of same class: expected google.protobuf.Duration got str
This is how I create the cluster:
from google.cloud import dataproc_v1beta2
project = "your_project_id"
region = "europe-west4"
cluster = "" #see below for cluster JSON file
client = dataproc_v1beta2.ClusterControllerClient(client_options={
'api_endpoint': '{}-dataproc.googleapis.com:443'.format(region)
})
# Create the cluster
operation = client.create_cluster(project, region, cluster)
The variable cluster holds the JSON object describing the desired cluster:
{
"cluster_name":"my_cluster",
"config":{
"config_bucket":"my_conf_bucket",
"gce_cluster_config":{
"zone_uri":"europe-west4-a",
"metadata":{
"PIP_PACKAGES":"google-cloud-storage google-cloud-bigquery"
},
"subnetwork_uri":"my subnet",
"service_account_scopes":[
"https://www.googleapis.com/auth/cloud-platform"
],
"tags":[
"some tags"
]
},
"master_config":{
"num_instances":1,
"machine_type_uri":"n1-highmem-4",
"disk_config":{
"boot_disk_type":"pd-standard",
"boot_disk_size_gb":200,
"num_local_ssds":0
},
"accelerators":[
]
},
"software_config":{
"image_version":"1.4-debian9",
"properties":{
"dataproc:dataproc.allow.zero.workers":"true",
"yarn:yarn.log-aggregation-enable":"true",
"dataproc:dataproc.logging.stackdriver.job.driver.enable":"true",
"dataproc:dataproc.logging.stackdriver.enable":"true",
"dataproc:jobs.file-backed-output.enable":"true"
},
"optional_components":[
]
},
"lifecycle_config":{
"auto_delete_ttl":"86400s"
},
"initialization_actions":[
{
"executable_file":"gs://some-init-script"
}
]
},
"project_id":"project_id"
}
Package versions I am using:
google-cloud-dataproc: 0.6.1
protobuf: 3.11.3
googleapis-common-protos: 1.6.0
Am I doing something wrong here, is it an issue with wrong package versions or is it even a bug?

You should use 100s format for a duration type when you construct protobuf in a text format (i.e. json, etc), but you are using a Python object to construct API request body, that's why you need to create a Duration object instead of a string:
duration_message.FromSeconds(86400)

How to create terraform backend.tf file from python before execution to eliminiate interpolation state file issue

Actually we bulit webapp from there we are passing variables to the terraform by
like below
terraform apply -input=false -auto-approve -var ami="%ami%" -var region="%region%" -var icount="%count%" -var type="%instance_type%"
Actually the problem here was backend does not support variables i need to pass there values also form app.
TO resolve this I find some solution like we need to create backend.tf before execution.
But I am unable to get the idea how to do it if anyone having any exmaples regarding this please help me.
Thanks in advance..
I need to create backend.tf file from python by using below variables.
And need to replace key="${profile}/tfstate
for each profile the profile need to replace
i am thinking of using git repo by using git we create files and pull the values and again commit and execute
Please help me with some examples and ideas.
Code is like below:
My main.tf like below
terraform {
backend “s3” {
bucket = “terraform-007”
key = “key”
region = “ap-south-1”
profile=“venu”
}
}
provider “aws” {
profile = “ var.awsprofile"
region="{var.aws_region}”
}
resource “aws_instance” “VM” {
count = var.icount
ami = var.ami
instance_type = var.type
tags = {
Environment = “${var.env_indicator}”
}
}
vars.tf like
variable “aws_profile” {
default = “default”
description = “AWS profile name, as set in ~/.aws/credentials”
}
variable “aws_region” {
type = “string”
default = “ap-south-1”
description = “AWS region in which to create resources”
}
variable “env_indicator” {
type = “string”
default = “dev”
description = “What environment are we in?”
}
variable “icount” {
default = 1
}
variable “ami” {
default =“ami-54d2a63b”
}
variable “bucket” {
default=“terraform-002”
}
variable “type” {
default=“t2.micro”
}
output.tf like:
output “ec2_public_ip” {
value = ["${aws_instance.VM.*.public_ip}"]
}
output “ec2_private_ip” {
value = ["${aws_instance.VM.*.private_ip}"]
}
Actually the problem here was backend does not support variables i need to pass there values also form app.
TO resolve this I find some solution like we need to create backend.tf before execution.
But I am unable to get the idea how to do it if anyone having any exmaples regarding this please help me.
Thanks in advance..

Since the configuration for the backend cannot use interpolation, we have used a configuration by convention approach.
The terraform for all of our state collections (microservices and other infrastructure) use the same S3 bucket for state storage and the same DynamoDB table for locking.
When executing terraform, we use the same IAM role (a dedicated terraform only user).
We define the key for the state via convention, so that it does not need to be generated.
key = "platform/services/{name-of-service}/terraform.tfstate"
I would avoid a process that results in changes to the infrastructure code as it is being deployed to ensure maximum understand-ability by the engineers reading/maintaining the code.
EDIT: Adding key examples
For the user service:
key = "platform/services/users/terraform.tfstate"
For the search service:
key = "platform/services/search/terraform.tfstate"
For the product service:
key = "platform/services/products/terraform.tfstate"

elasticsearch python api overwrites the existing field

I'm using ElasticSearch Python API, I found if the _id is same the old data would be overwritten. e.g. I had name="Tom", right now I index the same _id with field age=30. I found the name="Tom" was removed after the reindex. The right result I hope age=30 only appended to the existing index. Should I tune any parameters please?
I'm using the following code:
from elasticsearch import Elasticsearch
es = Elasticsearch("http://10.0.0.1:9200")
res = es.index(index="panavstream", doc_type='panav', id="123", body=doc)
Thanks in Advance

update function with script body can append a field in data. elasticsearch-py update
the sample:
doc = {
'script' : 'ctx._source.age = 30'
}
es.update(index="panavstream", doc_type='panav', id="123", body=doc)

Remove attribute from all MongoDB documents using Python and PyMongo

In my MongoDB, a bunch of these documents exist:
{ "_id" : ObjectId("5341eaae6e59875a9c80fa68"),
"parent" : {
"tokeep" : 0,
"toremove" : 0
}
}
I want to remove the parent.toremove attribute in every single one.
Using the MongoDB shell, I can accomplish this using:
db.collection.update({},{$unset: {'parent.toremove':1}},false,true)
But how do I do this within Python?
app = Flask(__name__)
mongo = PyMongo(app)
mongo.db.collection.update({},{$unset: {'parent.toremove':1}},false,true)
returns the following error:
File "myprogram.py", line 46
mongo.db.collection.update({},{$unset: {'parent.toremove':1}},false,true)
^
SyntaxError: invalid syntax

Put quotes around $unset, name the parameter you're including (multi) and use the correct syntax for true:
mongo.db.collection.update({}, {'$unset': {'parent.toremove':1}}, multi=True)

Just found weird to have to attach an arbitrary value for the field to remove, such as a small number (1), an empty string (''), etc, but it's really mentioned in MongoDB doc, with sample in JavaScript:
$unset
The $unset operator deletes a particular field. Consider the following syntax:
{ $unset: { field1: "", ... } }
The specified value in the $unset expression (i.e. "") does not impact
the operation.
For Python/PyMongo, I'd like to put a value None:
{'$unset': {'field1': None}}
So, for OP's question, it would be:
mongo.db.collection.update({}, {'$unset': {'parent.toremove': None}}, multi=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Elasticsearch create index mapping - python

Related

Elasticsearch - Reindex single field with different analyzer using Python

How to create a Dataproc cluster with time to live using Python SDK

How to create terraform backend.tf file from python before execution to eliminiate interpolation state file issue

elasticsearch python api overwrites the existing field

Remove attribute from all MongoDB documents using Python and PyMongo

Categories

Resources