Elasticsearch HTTP API or python API - python

I am newbie in the real-time distributed search engine elasticsearch, but I would like to ask a technical question.
I have written a python module-crawler that parses a web page and creates JSON objects with native information. The next step for my module-crawler is to store the native information, using the elasticsearch.
The real question is the following.
Which technique is better for my occasion? The elasticsearch RESTful API or the python API for elastic search (elasticsearch-py) ?

If you already have Python code, then the most natural way for you would be to use the elasticsearch-py client.
After installing the elasticsearch-py library via pip install elatsicsearch, you can find a simple code example to get you going:
# import the elasticsearch library
from elasticsearch import Elasticsearch
# get your JSON data
json_page = {...}
# create a new client to connect to ES running on localhost:9200
es = Elasticsearch()
# index your JSON data
es.index(index="webpages", doc_type="webpage", id=1, body=json_page)

You may also try elasticsearch_dsl it is a high level wraper of elasticsearch.

Related

connection between kafka and elasticsearch using python

I'm new about using Kafka and elasticsearch. I've been trying to use Elastic search but I've some problem. I've grow up a docker compose file with all the images needed for building the environment then using kafka I've product into a specific topic the data and then I need to take from Kafka 's consumer data into a pub/sub system for sending data for the ingestion into elasticsearch.
I implement all this using python. I've seen that into the port and localhost as ip elasticsearch appear instead for kibana in the page appear the following sentence:
kibana server is not ready yet
the consumer python is something similar to it from which I take data from a topic:
from kafka import KafkaConsumer
# Import sys module
import sys
# Import json module to serialize data
import json
# Initialize consumer variable and set property for JSON decode
consumer = KafkaConsumer ('JSONtopic',bootstrap_servers = ['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8')))
for message in consumer:
print("Consumer records:\n")
print(message)
print("\nReading from JSON data\n")
print("Name:",message[6]['name'])
print("Email:",message[6]['email'])
# Terminate the script
sys.exit()
The goal is to use elasticsearch for doing analysis so I need to use it as backend as for visualize data into kibana. It could be really appreciate also a tutorial to follow for understanding what I should do for link this informations.
(P.s. data follow without problem from a topic to another one but the problem is to take this information and insert into elastic and have the possibility to visualize these informations)
If you're pushing data from Kafka to Elasticsearch then doing it with the Consumer API is typically not a good idea, since there are tools that exist that do it much better and handle more functionality.
For example:
Kafka Connect (e.g. 🎥 https://rmoff.dev/kafka-elasticsearch-video)
Logstash

Need help retrieving Google cloudSQL metadata and logs using Python

I am new to Google cloud and would like to know if there is a way to retrieve cloudSQL (MySQL) instance metadata and error logs using Python.
I installed the Google cloud SDK and ran the following commands to retrieve metadata and I got detailed metadata like IP, region, disk, etc.
gcloud sql instances describe my-instance-id
I need to check this data periodically for compliance. I have worked on AWS and I use boto3 Python package for these kind of tasks. I googled for boto3 equivalent in Google but the docs for Google API client are really confusing to me.
I also need to fetch MySQL error logs from cloudSQL instance (for alerting in case any errors are found).
Can anyone show me how to perform these operations using google API for python or point me in the right direction?
Here is a sample code on how to retrieve the Cloud SQL MySQL error logs using Cloud Logging API. For testing I logged in with a wrong password to generate error logs.
The filter used is a sample filter in the Cloud Logging docs.
from google.cloud.logging import Client
projectName = 'your-project-here'
myFilter = 'resource.type = "cloudsql_database" AND log_id("cloudsql.googleapis.com/mysql.err")'
client = Client(project = projectName)
entries = client.list_entries(filter_ = myFilter)
for entry in entries:
print(entry)
Output snippet:
Here's how to get SQL instance metadata:
import json
from googleapiclient import discovery
service = discovery.build('sqladmin', 'v1beta4')
req = service.instances().list(project="project-name")
resp = req.execute()
print(json.dumps(resp, indent=2))
credit to #AKX, found the answer at cloud.google.com/sql/docs/mysql/admin-api/libraries#python
No luck on the 2nd part tough i.e. retrieving MySQL error log

Cannot find optimize function for indices in elasticsearch python client

I am using elasticsearch python client 6.4.0
I want to use the optimize API
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-optimize.html
But I could not find anything about it in the elasticsearch python api doc
https://elasticsearch-py.readthedocs.io/en/master/api.html
I tried using es.indices.optimize(...) but that function does not exist.
I will prefer to use the python client instead of direct API call.
You are looking at a very old version of elastic (1.7). Optimize API was renamed to forcemerge and under this name it is available in python client (docs)

Sending data to document db in python

I'm currently trying to send data to a azure document db collection on python (using pydocumentdb lib).
Actually i have to send about 100 000 document on this collection and this takes a very long time (about 2 hours).
I send each document one by one using :
for document in documents :
client.CreateDocument(collection_link, document)
Am i doing wrong, is there another faster way to do it or it's just normal that it takes so long.
Thanks !
On Azure, there are many ways to help importing data to CosmosDB faster than using PyDocumentDB API which be wrappered the related REST APIs via HTTP.
First, to be ready a json file includes your 10,000 documents for importing, then you can follow the documents below to import data.
Refer to the document How to import data into Azure Cosmos DB for the DocumentDB API? to import json data file via DocumentDB Data Migration Tool.
Refer to the document Azure Cosmos DB: How to import MongoDB data? to import json data file via the mongoimport tool of MongoDB.
Upload the json data file to Azure Blob Storage, then to copy data using Azure Data Factory from Blob Storage to CosmosDB, please see the section Example: Copy data from Azure Blob to Azure Cosmos DB to know more details.
If you just want to import data in programming, you can try to use Python MongoDB driver to connect Azure CosmosDB to import data via MongoDB wire protocol, please refer to the document Introduction to Azure Cosmos DB: API for MongoDB.
Hope it helps.

Is it faster to query CouchDB via an adapter instead of REST?

Let's say I have some data in a CouchDB database. The overall size is about 100K docs.
I have a _design doc which stores a 'get all entities' view.
Assuming the requests are done on a local machine against a local database:
via curl: curl -X GET http://127.0.0.1/mydb/_design/myexample/_view/all
via Couchdbkit: entities = Entity.view('mydb/all’)
Does 1 have to perform any additional calculations compared to 2 (JSON encoding/decoding, HTTP request parsing, etc.) and how can that affect the performance of querying 'all' entities from the database?
I guess that directly querying the database (option 2) should be faster than wrapping request/response into JSON, but I am not sure about that.
Under the API covers, Couchdbkit uses the restkit package, which is a REST library.
In other words, Couchdbkit is a pythonic API to the CouchDB REST API, and will do the same amount of work as using the REST API yourself.

Categories