Logstash codec and character encoding problem

Logstash codec and character encoding problem - python

I send logs from a desktop python application (Python 3.6) to Logstash (7.5.0), when I want to log an error message for example with the text ">>>>>>>> ERROR <<<<<<<", in the logstash log file I see the following entry:
[2020-01-22T13:25:02,330][WARN ][logstash.codecs.line ][main] Received an event that has a different character encoding than you configured. {:text=>"\u0000\u0000\u0000MainThreadq\u001AX\v\u0000\u0000\u0000processNameq\eX\v\u0000\u0000\u0000MainProcessq\u001CX\a\u0000\u0000\u0000processq\u001DM\u001D\xEDu.\u0000\u0000\u0002\u001D}q\u0000(X\u0004\u0000\u0000\u0000nameq\u0001X\b\u0000\u0000\u0000__main__q\u0002X\u0003\u0000\u0000\u0000msgq\u0003X\u0018\u0000\u0000\u0000>>>>>>>> ERROR <<<<<<"UTF-8"}
And in Kibana, when I query the received messages, I see that some (in this case, 6) individual messages have been sent to Logstash per each log message that I sent (in this case, ">>>>>>>> ERROR <<<<<<<") as follows:
{
"_index" : "logstash-2020.01.23",
"_type" : "doc",
"_id" : "lNXhz28BzTlrr0WBIjwA",
"_score" : 1.0,
"_source" : {
"host" : "localhost",
"port" : 50197,
"message" : """\u0000\u0000\u0000stack_infoq\u0011NX\u0006\u0000\u0000\u0000linenoq\u0012K'X\b\u0000\u0000\u0000funcNameq\u0013X\b\u0000\u0000\u0000<module>q\u0014X\a\u0000\u0000\u0000createdq\u0015GA\u05CA;vϯWX\u0005\u0000\u0000\u0000msecsq\u0016G#n\xA2u\xEC\u0000\u0000\u0000X\u000F\u0000\u0000\u0000relativeCreatedq\u0017G#E\u001DM\xD0\u0000\u0000\u0000X\u0006\u0000\u0000\u0000threadq\u0018L4437804480L""",
"#version" : "1",
"#timestamp" : "2020-01-23T00:50:35.362Z"
}
},
{
"_index" : "logstash-2020.01.23",
"_type" : "doc",
"_id" : "k9Xhz28BzTlrr0WBITyc",
"_score" : 1.0,
"_source" : {
"host" : "localhost",
"port" : 50197,
"message" : """threadNameqX""",
"#version" : "1",
"#timestamp" : "2020-01-23T00:50:35.362Z"
}
},
{
"_index" : "logstash-2020.01.23",
"_type" : "doc",
"_id" : "kdXhz28BzTlrr0WBITyc",
"_score" : 1.0,
"_source" : {
"host" : "localhost",
"port" : 50197,
"message" : """MainThreadqXprocessNameqXMainProcessqXprocessqMC0u.""",
"#version" : "1",
"#timestamp" : "2020-01-23T00:50:35.369Z"
}
},
{
"_index" : "logstash-2020.01.23",
"_type" : "doc",
"_id" : "ktXhz28BzTlrr0WBITyc",
"_score" : 1.0,
"_source" : {
"host" : "localhost",
"port" : 50197,
"message" : "X",
"#version" : "1",
"#timestamp" : "2020-01-23T00:50:35.362Z"
}
},
{
"_index" : "logstash-2020.01.23",
"_type" : "doc",
"_id" : "j9Xhz28BzTlrr0WBITyc",
"_score" : 1.0,
"_source" : {
"host" : "localhost",
"port" : 50197,
"message" : """XfilenameqXtest2.pyqXmoduleq
Xtest2qXexc_infoqNXexc_textqNX""",
"#version" : "1",
"#timestamp" : "2020-01-23T00:50:35.345Z"
}
},
{
"_index" : "logstash-2020.01.23",
"_type" : "doc",
"_id" : "kNXhz28BzTlrr0WBITyc",
"_score" : 1.0,
"_source" : {
"host" : "localhost",
"port" : 50197,
"message" : """}q(XnameqX__main__qXmsgqX>>>>>>>> ERROR <<<<<<<qXargsqNX levelnameqXERRORqXlevelnoqK2Xpathnameq X1/Users/e0h014b/PycharmProjects/logstash2/test2.pyq""",
"#version" : "1",
"#timestamp" : "2020-01-23T00:50:35.331Z"
}
}
The logstash config file which I’m using is as the following:
input {
tcp {
port => 5959
codec => plain {
charset => "UTF-8"
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
What should I do to have a normal format of logging in Logstash? Which codec and character encoding are proper in this application?
Thanks,
Elahe

If your log-messages just contain simple lines, you should go with the default codec, namely line.
I would always start with the default codec, test, verify the indexed data and then do some fine-tuning/changes to the codec if necessary.
Refer to this documentation about all available codecs.
I hope I could help you.

Related

Different/partial json outputs when read from each line in a file vs read from list variable within the code

Being new to python, I am unable to resolve the following issue.
Below is my python code, which returns different json outputs compared to when executed with list passed in a variable vs each line read from a file.
Code with lines read from file which throws partial/corrupted output:
import requests
import json
import re
repo_name = "repo"
with open('file_list_new.txt') as file:
for line in file:
url = "http://fqdn/repository/{0}/{1}?describe=json".format(repo_name, line)
response = requests.get(url)
json_data = response.text
data = json.loads(json_data)
print(data)
for size in data['items']:
if size['name'] == 'Payload':
value_size= size['value']['Size']
if value_size != -1:
print(value_size)
content of file_list_new.txt
mysql.odbc/5.1.14
mysql.odbc/5.1.11
corrupted output
{
"parameters" : {
"path" : "/mysql.odbc/5.1.14\n",
"nexusUrl" : "http://fqdn"
},
"items" : [ {
"name" : "Exception during handler processing",
"type" : "topic",
"value" : "Exception during handler processing"
}, {
"name" : "java.lang.IllegalArgumentException",
"type" : "table",
"value" : {
"Message" : "Illegal character in path at index 40: Packages(Id='mysql.odbc',Version='5.1.14\n')"
}
}, {
"name" : "java.net.URISyntaxException",
"type" : "table",
"value" : {
"Message" : "Illegal character in path at index 40: Packages(Id='mysql.odbc',Version='5.1.14\n')"
}
}, {
"name" : "Request",
"type" : "topic",
"value" : "Request"
}, {
"name" : "Details",
"type" : "table",
"value" : {
"Action" : "GET",
"path" : "/mysql.odbc/5.1.14\n"
}
}, {
"name" : "Parameters",
"type" : "table",
"value" : {
"describe" : "json"
}
}, {
"name" : "Headers",
"type" : "table",
"value" : {
"Accept" : "*/*",
"User-Agent" : "python-requests/2.27.1",
"Connection" : "keep-alive",
"Host" : "fqdn",
"Accept-Encoding" : "gzip, deflate"
}
}, {
"name" : "Attributes",
"type" : "table",
"value" : {
"org.apache.shiro.subject.support.DefaultSubjectContext.SESSION_CREATION_ENABLED" : false,
"Key[type=org.sonatype.nexus.security.SecurityFilter, annotation=[none]].FILTERED" : true,
"authcAntiCsrf.FILTERED" : true,
"nx-authc.FILTERED" : true,
"org.apache.shiro.web.servlet.ShiroHttpServletRequest_SESSION_ID_URL_REWRITING_ENABLED" : true,
"javax.servlet.include.servlet_path" : "/repository/repo/mysql.odbc/5.1.14%0A",
"nx-anonymous.FILTERED" : true,
"org.sonatype.nexus.security.anonymous.AnonymousFilter.originalSubject" : "org.apache.shiro.web.subject.support.WebDelegatingSubject#33c429ba",
"nx-apikey-authc.FILTERED" : true
}
}, {
"name" : "Payload",
"type" : "table",
"value" : {
"Content-Type" : "",
"Size" : -1
}
} ]
}
Code with variable of list with in the code:
import requests
import json
import re
repo_name = "repo"
file_list = ["mysql.odbc/5.1.11","mysql.odbc/5.1.14"]
for i in file_list:
url = "http://fqdn/repository/{0}/{1}?describe=json".format(repo_name, i)
response = requests.get(url)
json_data = response.text
data = json.loads(json_data)
for size in data['items']:
if size['name'] == 'Payload':
value_size= size['value']['Size']
if value_size != -1:
print(value_size)
[Expected output]Output with list passed within the code as
{
"parameters" : {
"path" : "/mysql.odbc/5.1.14",
"nexusUrl" : "http://fqdn"
},
"items" : [ {
"name" : "Request",
"type" : "topic",
"value" : "Request"
}, {
"name" : "Details",
"type" : "table",
"value" : {
"Action" : "GET",
"path" : "/mysql.odbc/5.1.14"
}
}, {
"name" : "Parameters",
"type" : "table",
"value" : {
"describe" : "json"
}
}, {
"name" : "Headers",
"type" : "table",
"value" : {
"Accept" : "*/*",
"User-Agent" : "python-requests/2.27.1",
"Connection" : "keep-alive",
"Host" : "fqdn",
"Accept-Encoding" : "gzip, deflate"
}
}, {
"name" : "Attributes",
"type" : "table",
"value" : {
"org.apache.shiro.subject.support.DefaultSubjectContext.SESSION_CREATION_ENABLED" : false,
"Key[type=org.sonatype.nexus.security.SecurityFilter, annotation=[none]].FILTERED" : true,
"authcAntiCsrf.FILTERED" : true,
"nx-authc.FILTERED" : true,
"org.apache.shiro.web.servlet.ShiroHttpServletRequest_SESSION_ID_URL_REWRITING_ENABLED" : true,
"javax.servlet.include.servlet_path" : "/repository/repo/mysql.odbc/5.1.14",
"nx-anonymous.FILTERED" : true,
"org.sonatype.nexus.security.anonymous.AnonymousFilter.originalSubject" : "org.apache.shiro.web.subject.support.WebDelegatingSubject#1433a6c9",
"nx-apikey-authc.FILTERED" : true
}
}, {
"name" : "Payload",
"type" : "table",
"value" : {
"Content-Type" : "",
"Size" : -1
}
}, {
"name" : "Response",
"type" : "topic",
"value" : "Response"
}, {
"name" : "Status",
"type" : "table",
"value" : {
"Code" : 200,
"Message" : ""
}
}, {
"name" : "Headers",
"type" : "table",
"value" : {
"ETag" : "\"df4f013db18103f1b9541cdcd6ba8632\"",
"Content-Disposition" : "attachment; filename=mysql.odbc.5.1.14.nupkg",
"Last-Modified" : "Tue, 13 Oct 2015 03:54:48 GMT"
}
}, {
"name" : "Attributes",
"type" : "table",
"value" : { }
}, {
"name" : "Payload",
"type" : "table",
"value" : {
"Content-Type" : "application/zip",
"Size" : 3369
}
} ]
}
I am not sure if I am doing something wrong or missing something simple.
Any help is much appreciated.

It looks like the newlines are being passed into the url string
"Message" : "Illegal character in path at index 40: Packages(Id='mysql.odbc',Version='5.1.14**\n**')"
You can do something like this to remove them
with open('file_list_new.txt') as file:
for line in file:
url = "http://fqdn/repository/{0}/{1}?describe=json".format(repo_name, line.strip())

Group an elasticsearch query with similar field and fetch both documents but together

I am new to elasticsearch and trying to make queries.
I have an index where among other fields two fields are Sno and request_sno.
I want to make query where document/row with certain Sno should be followed by doc/row which has request_sno exactly same for previous Sno.
example,
Sno:1, name:'a', address:'b',..., request_sno:''
Sno:2, name:'', address:'',...., request_sno:1
These two should come together one row filled by other.
At first I thought of group by but I don't want aggregation.
Any help will be highly appreciable.

You can resolve this usecase with collapse functionality of Elasticsearch.
For implement Collapse, your document should have one unique field. So lets create field call collpase_id which have value of request_sno and if request_sno is empty or null then copy value of sno field.
So your final document will be look like this:
{
"_index" : "collapse",
"_type" : "_doc",
"_id" : "kNNEbH4Bb7CAaZKC_vAY",
"_score" : 1.0,
"_source" : {
"sno" : 1,
"name" : "a",
"address" : "b",
"collapse_id" : 1
}
},
{
"_index" : "collapse",
"_type" : "_doc",
"_id" : "kdNFbH4Bb7CAaZKC-PBM",
"_score" : 1.0,
"_source" : {
"sno" : 2,
"name" : "a",
"address" : "b",
"request_sno" : 1,
"collapse_id" : 1
}
You can use below query to get collpase result
POST collapse/_search
{
"_source": false,
"query": {
"match_all": {}
},
"collapse": {
"field": "collapse_id",
"inner_hits": {
"name": "sno_reqsno_match",
"size": 10
}
}
}
your result will look like below:
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "collapse",
"_type" : "_doc",
"_id" : "kNNEbH4Bb7CAaZKC_vAY",
"_score" : 1.0,
"fields" : {
"collapse_id" : [
1
]
},
"inner_hits" : {
"sno_reqsno_match" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "collapse",
"_type" : "_doc",
"_id" : "kNNEbH4Bb7CAaZKC_vAY",
"_score" : 1.0,
"_source" : {
"sno" : 1,
"name" : "a",
"address" : "b",
"collapse_id" : 1
}
},
{
"_index" : "collapse",
"_type" : "_doc",
"_id" : "kdNFbH4Bb7CAaZKC-PBM",
"_score" : 1.0,
"_source" : {
"sno" : 2,
"name" : "a",
"address" : "b",
"request_sno" : 1,
"collapse_id" : 1
}
}
]
}
}
}
},
{
"_index" : "collapse",
"_type" : "_doc",
"_id" : "ktNNbH4Bb7CAaZKC5PC8",
"_score" : 1.0,
"fields" : {
"collapse_id" : [
3
]
},
"inner_hits" : {
"sno_reqsno_match" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "collapse",
"_type" : "_doc",
"_id" : "ktNNbH4Bb7CAaZKC5PC8",
"_score" : 1.0,
"_source" : {
"sno" : 3,
"name" : "a",
"address" : "b",
"collapse_id" : 3
}
},
{
"_index" : "collapse",
"_type" : "_doc",
"_id" : "k9NObH4Bb7CAaZKCGfAc",
"_score" : 1.0,
"_source" : {
"sno" : 4,
"name" : "a",
"address" : "b",
"request_sno" : 3,
"collapse_id" : 3
}
}
]
}
}
}
}
]
}

Write queryDSL to find unique error messages from sys log data?

Is there a way to configure the elasticsearch analyzer so that it is possible to get unique error messages in different scenarios?
1."...July 2020 23:00:00.674z... same message....."
2. slight changes in the string :
message1: "....message_details.. (unknown error 20004)
message2: "....message_details.. (unknown error 278945)
OR
message1:"....a::::: message_details ...."
message2:"....a:f23ed:fff:ff:: message_details ...."
The above two messages are the same apart from the character differnce.
Here is the query :
GET log_stash_2020.06.16/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"message": "Error"
}
},
{
"match_phrase": {
"type": "lab_id"
}
}
]
}
},
"aggs": {
"log_message": {
"significant_text": {
"field": "message",
"filter_duplicate_text": "true"
}
}
},
"size": 1000
}
I have added the sample log file.
{
"_index" : "logstash_2020.06.16",
"_type" : "doc",
"_id" : "################",
"_score" : 1.0,
"_source" : {
"logsource" : "router_id",
"timestamp" : "Jun 15 20:00:00",
"program" : "some_program",
"host" : "#############",
"priority" : "27",
"#timestamp" : "2020-06-16T00:00:01.020Z",
"type" : "lab_id",
"pid" : "####",
"message" : ": ############### send failed with error: ENOENT -- Item not found (No error: 0)",
"#version" : "1"
}
}
{
"_index" : "logstash_2020.06.16",
"_type" : "doc",
"_id" : "################",
"_score" : 1.0,
"_source" : {
"host" : "################",
"#timestamp" : "2020-06-16T00:00:02.274Z",
"type" : "####",
"tags" : [
"_grokparsefailure"
],
"message" : "################:Jun 15 20:00:18.908 EDT: mediasvr[2546]: %MEDIASVR-MEDIASVR-4-PARTITION_USAGE_ALERT : High disk usage alert : host ##### exceeded 100% \n",
"#version" : "1"
}
}
Is there a way to do it in python ?(If elasticsearch does not have above mentioned functionality)

You can use the Elasticsearch Python client like so:
from elasticsearch import Elasticsearch
es = Elasticsearch(...)
resp = es.search(index="log_stash_2020.06.16", body={<dsl query>})
print(resp)
where is whatever query you want to run like the one you gave in the question.
<disclosure: I'm the maintainer of the Elasticsearch client and employed by Elastic>

post request with \n-delimited JSON in python

I'm trying to use the bulk API from Elasticsearch and I see that this can be done using the following request which is special because what is given as a "data" is not a proper JSON, but a JSON that uses \n as delimiters.
curl -XPOST 'localhost:9200/_bulk?pretty' -H 'Content-Type: application/json' -d '
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
'
My question is how can I perform such request within python? The authors of ElasticSearch suggest to not pretty print the JSON but I'm not sure what it means (see https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html)
I know that this is a valid python request
import requests
import json
data = json.dumps({"field":"value"})
r = requests.post("localhost:9200/_bulk?pretty", data=data)
But what do I do if the JSON is \n-delimited?

What this really is is a set of individual JSON documents, joined together with newlines. So you could do something like this:
data = [
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } },
{ "field1" : "value1" },
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" }, },
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" }, },
{ "field1" : "value3" },
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "test"} },
{ "doc" : {"field2" : "value2"} }
]
data_to_post = '\n'.join(json.dumps(d) for d in data)
r = requests.post("localhost:9200/_bulk?pretty", data=data_to_post)
However, as pointed out in the comments, the Elasticsearch Python client is likely to be more useful.

As a follow-up to Daniel's answer above, I had to add an additional '\n' to the end of the data_to_post, and add a {Content-Type: application/x-ndjson} header to get it work in Elasticsearch 6.3.
data_to_post = '\n'.join(json.dumps(d) for d in data) + "\n"
headers = {"Content-Type": "application/x-ndjson"}
r = requests.post("http://localhost:9200/_bulk?pretty", data=data_to_post, headers=headers)
Otherwise, I will receive the error:
"The bulk request must be terminated by a newline [\\n]"

You can use python ndjson library to do it.
https://pypi.org/project/ndjson/
It contains JSONEncoder and JSONDecoder classes for easy use with other libraries, such as requests:
import ndjson
import requests
response = requests.get('https://example.com/api/data')
items = response.json(cls=ndjson.Decoder)

Elasticsearch: Aggregation returns garbage values sometimes

I'm relatively new to Elasticsearch. I'd really be grateful if anyone can help me with an issue in the aggregation query. I have the default setup of 5 clusters on a single node. My index contains the mapping as follows:
mapping = {
"my_index": {
"date_detection": False,
"dynamic_templates": [{
"string_fields": {
"mapping": {
"type": "string",
"fields": {
"raw": {
"index": "not_analyzed",
"ignore_above": 256,
"type": "string"
}
}
},
"match_mapping_type": "string",
"match": "*"
}
}]
}
}
I have already created aggregation queries for my requirement as follows:
records = es.search(index="my_index",doc_type="marksheet",body={ "aggs": { "student_name": { "terms": { "field": "name.raw","order": { "total_score" : "desc" } }, "aggs": { "total_score": { "sum": { "field": "score" } } } } } } )
This query works perfectly fine, just as i need it, most of the times. But sometimes, due to reasons unknown, this same query returns very large or very small results, like: 1.4e-322.
I haven't been able to find out a proper reason as to why this would happen on its own. Would really appreciate it someone could help me out with this. Thank You!
UPDATE:
After running the following aggregation:
{"aggs":{"score_stats":{"stats":{"field":"score"}}}}
I get the results in the aggregation key as:
{u'score_stats': {u'count': 1186, u'max': 5e-323, u'sum': 4.5187e-320, u'avg': 4e-323, u'min': 2e-323}}
UPDATE 2:
After running the query as the one below:
curl -XGET localhost:9200/my_index/marksheet/_search?_source=score&size=100&pretty&filter_p‌ath=hits.hits.score
The hits key in the output is as follows:
"hits" : [ {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alGT0VlANyomm3HT",
"_score" : 1.0,
"_source":{"score":10}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alGV0VlANyomm3HV",
"_score" : 1.0,
"_source":{"score":10}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alGa0VlANyomm3Ha",
"_score" : 1.0,
"_source":{"score":8}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alGf0VlANyomm3Hh",
"_score" : 1.0,
"_source":{"score":8}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alGk0VlANyomm3Hn",
"_score" : 1.0,
"_source":{"score":6}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alGp0VlANyomm3Hu",
"_score" : 1.0,
"_source":{"score":10}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alGu0VlANyomm3H0",
"_score" : 1.0,
"_source":{"score":10}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alGz0VlANyomm3H7",
"_score" : 1.0,
"_source":{"score":10}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alHA0VlANyomm3IN",
"_score" : 1.0,
"_source":{"score":8}
}, {
"_index" : "my_index",
"_type" : "marksheet",
"_id" : "AU61alHD0VlANyomm3IR",
"_score" : 1.0,
"_source":{"score":10}
},
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Logstash codec and character encoding problem - python

Related

Different/partial json outputs when read from each line in a file vs read from list variable within the code

Group an elasticsearch query with similar field and fetch both documents but together

Write queryDSL to find unique error messages from sys log data?

post request with \n-delimited JSON in python

Elasticsearch: Aggregation returns garbage values sometimes

Categories

Resources