elasticsearch scrolling using python client

elasticsearch scrolling using python client - python

When scrolling in elasticsearch it is important to provide at each scroll the latest scroll_id:
The initial search request and each subsequent scroll request returns
a new scroll_id — only the most recent scroll_id should be used.
The following example (taken from here) puzzle me. First, the srolling initialization:
rs = es.search(index=['tweets-2014-04-12','tweets-2014-04-13'],
scroll='10s',
search_type='scan',
size=100,
preference='_primary_first',
body={
"fields" : ["created_at", "entities.urls.expanded_url", "user.id_str"],
"query" : {
"wildcard" : { "entities.urls.expanded_url" : "*.ru" }
}
}
)
sid = rs['_scroll_id']
and then the looping:
tweets = [] while (1):
try:
rs = es.scroll(scroll_id=sid, scroll='10s')
tweets += rs['hits']['hits']
except:
break
It works, but I don't see where sid is updated... I believe that it happens internally, in the python client; but I don't understand how it works...

This is an old question, but for some reason came up first when searching for "elasticsearch python scroll". The python module provides a helper method to do all the work for you. It is a generator function that will return each document to you while managing the underlying scroll ids.
https://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan
Here is an example of usage:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
query = {
"query": {"match_all": {}}
}
es = Elasticsearch(...)
for hit in scan(es, index="my-index", query=query):
print(hit["_source"]["field"])

Using python requests
import requests
import json
elastic_url = 'http://localhost:9200/my_index/_search?scroll=1m'
scroll_api_url = 'http://localhost:9200/_search/scroll'
headers = {'Content-Type': 'application/json'}
payload = {
"size": 100,
"sort": ["_doc"]
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
r1 = requests.request(
"POST",
elastic_url,
data=json.dumps(payload),
headers=headers
)
# first batch data
try:
res_json = r1.json()
data = res_json['hits']['hits']
_scroll_id = res_json['_scroll_id']
except KeyError:
data = []
_scroll_id = None
print 'Error: Elastic Search: %s' % str(r1.json())
while data:
print data
# scroll to get next batch data
scroll_payload = json.dumps({
'scroll': '1m',
'scroll_id': _scroll_id
})
scroll_res = requests.request(
"POST", scroll_api_url,
data=scroll_payload,
headers=headers
)
try:
res_json = scroll_res.json()
data = res_json['hits']['hits']
_scroll_id = res_json['_scroll_id']
except KeyError:
data = []
_scroll_id = None
err_msg = 'Error: Elastic Search Scroll: %s'
print err_msg % str(scroll_res.json())
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#search-request-scroll

In fact the code has a bug in it - in order to use the scroll feature correctly you are supposed to use the new scroll_id returned with each new call in the next call to scroll(), not reuse the first one:
Important
The initial search request and each subsequent scroll request returns
a new scroll_id — only the most recent scroll_id should be used.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html
It's working because Elasticsearch does not always change the scroll_id in between calls and can for smaller result sets return the same scroll_id as was originally returned for some time. This discussion from last year is between two other users seeing the same issue, the same scroll_id being returned for awhile:
http://elasticsearch-users.115913.n3.nabble.com/Distributing-query-results-using-scrolling-td4036726.html
So while your code is working for a smaller result set it's not correct - you need to capture the scroll_id returned in each new call to scroll() and use that for the next call.

self._elkUrl = "http://Hostname:9200/logstash-*/_search?scroll=1m"
self._scrollUrl="http://Hostname:9200/_search/scroll"
"""
Function to get the data from ELK through scrolling mechanism
"""
def GetDataFromELK(self):
#implementing scroll and retriving data from elk to get more than 100000 records at one search
#ref :https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html
try :
dataFrame=pd.DataFrame()
if self._elkUrl is None:
raise ValueError("_elkUrl is missing")
if self._username is None:
raise ValueError("_userNmae for elk is missing")
if self._password is None:
raise ValueError("_password for elk is missing")
response=requests.post(self._elkUrl,json=self.body,auth=(self._username,self._password))
response=response.json()
if response is None:
raise ValueError("response is missing")
sid = response['_scroll_id']
hits = response['hits']
total= hits["total"]
if total is None:
raise ValueError("total hits from ELK is none")
total_val=int(total['value'])
url = self._scrollUrl
if url is None:
raise ValueError("scroll url is missing")
#start scrolling
while(total_val>0):
#keep search context alive for 2m
scroll = '2m'
scroll_query={"scroll" : scroll, "scroll_id" : sid }
response1=requests.post(url,json=scroll_query,auth=(self._username,self._password))
response1=response1.json()
# The result from the above request includes a scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results
sid = response1['_scroll_id']
hits=response1['hits']
data=response1['hits']['hits']
if len(data)>0:
cleanDataFrame=self.DataClean(data)
dataFrame=dataFrame.append(cleanDataFrame)
total_val=len(response1['hits']['hits'])
num=len(dataFrame)
print('Total records recieved from ELK=',num)
return dataFrame
except Exception as e:
logging.error('Error while getting the data from elk', exc_info=e)
sys.exit()

from elasticsearch import Elasticsearch
elasticsearch_user_name ='es_username'
elasticsearch_user_password ='es_password'
es_index = "es_index"
es = Elasticsearch(["127.0.0.1:9200"],
http_auth=(elasticsearch_user_name, elasticsearch_user_password))
query = {
"query": {
"bool": {
"must": [
{
"range": {
"es_datetime": {
"gte": "2021-06-21T09:00:00.356Z",
"lte": "2021-06-21T09:01:00.356Z",
"format": "strict_date_optional_time"
}
}
}
]
}
},
"fields": [
"*"
],
"_source": False,
"size": 2000,
}
resp = es.search(index=es_index, body=query, scroll="1m")
old_scroll_id = resp['_scroll_id']
results = resp['hits']['hits']
while len(results):
for i, r in enumerate(results):
# do something whih data
pass
result = es.scroll(
scroll_id=old_scroll_id,
scroll='1m' # length of time to keep search context
)
# check if there's a new scroll ID
if old_scroll_id != result['_scroll_id']:
print("NEW SCROLL ID:", result['_scroll_id'])
# keep track of pass scroll _id
old_scroll_id = result['_scroll_id']
results = result['hits']['hits']

Related

How do a make a API request in Swift like the working Python request I'm currently using?

I'm looking to convert this Python request to a Swift script.
Here is my working python script that returns the accessToken!
#!/usr/bin/python
import requests
import json
#MAKE THE REQUEST
URL = "http://this/is/the/url"
headers = {
'Accept': "application/json",
"Accept-Language": "en_US"
}
data = {
"grant_type": "password",
"username" : "GROUP\SITE\USERNAME",
"password" : "somepassword"
}
r = requests.get(url = URL, params = headers, data = data)
data = r.json()
accessToken = data['access_token']
print(accessToken)
When I run the Swift Playground for the code below nothing is returned!
It seems the script exits at guard let data = data else { return }
How could I get the same results as the Python Script above.
I've tried implementing URLComponents using this tutorial...
import UIKit
var url = "http://just/the/url"
extension Dictionary {
func percentEncoded() -> Data? {
return map { key, value in
let escapedKey = "\(key)"
let escapedValue = "\(value)"
print(escapedKey + "=" + escapedValue)
return escapedKey + "=" + escapedValue
}
.joined(separator: "&")
.data(using: .utf8)
}
}
extension CharacterSet {
static let urlQueryValueAllowed: CharacterSet = {
let generalDelimitersToEncode = ":#[]#" // does not include "?" or "/" due to RFC 3986 - Section 3.4
let subDelimitersToEncode = "$&'()*+,;="
var allowed = CharacterSet.urlQueryAllowed
allowed.remove(charactersIn: "\(generalDelimitersToEncode)\(subDelimitersToEncode)")
return allowed
}()
}
var request = URLRequest(url: URL(string:url)!)
request.httpMethod = "GET"
let parameters: [String: String] = [
"grant_type":"password",
"username":"GROUP\\SITE\\USER",
"password":"somePassword"
]
request.httpBody = parameters.percentEncoded()
request.setValue("application/x-www-form-urlencoded", forHTTPHeaderField: "Content-Type")
request.setValue("application/XML", forHTTPHeaderField: "Accept")
let config = URLSessionConfiguration.default
URLSession(configuration: config).dataTask(with: request) { (data, response, err) in
guard let data = data else { return }
print(data)
guard let dataAsString = String(data: data, encoding: .utf8)else {return}
print(dataAsString)
guard let httpResponse = response as? HTTPURLResponse,
(200...299).contains(httpResponse.statusCode) else {
print("Bad Credentials")
return
}
//HTTP Status Code!
print("HTTP RESPONSE:"+"\(httpResponse.statusCode)")
//
}.resume()

If I remember correctly, starting in iOS 13, you cant have httpBody for a GET call, so you'll either need to switch to a POST/PUT or add the params into the url string (See below)
You also had different Accept headers in your python vs. swift. One was xml the other was json.
var urlComponents = URLComponents(string: "http://this/is/the/url")
urlComponents?.queryItems = [
URLQueryItem(name: "grant_type", value: "password"),
URLQueryItem(name: "username", value: "username"),
URLQueryItem(name: "password", value: "somepassword")
]
guard let url = urlComponents?.url else { return } // You can print url here to see how it looks
var request = URLRequest(url: url)
request.httpMethod = "GET"
request.setValue("application/json", forHTTPHeaderField: "Accept")
request.setValue("en_US", forHTTPHeaderField: "Accept-Language")
let task = URLSession.shared.dataTask(with: request) { data, response, error in
guard let data = data,
let response = response as? HTTPURLResponse,
error == nil else {
print("error", error ?? "Unknown error")
return
}
print(response)
guard (200 ... 299) ~= response.statusCode else {
print("response = \(response)")
return
}
let responseString = String(data: data, encoding: .utf8)
print(responseString)
}
task.resume()

The problem was the following...
request.httpMethod = "GET"
I had to change the get to "POST" and now I have the token!!!!
I was confused because the python script used GET. I had a bash script that that used curl to get the token displayed the logged post.
In short my above Swift Playground now works by changing the request.httpMethod to "POST". THANKS FOR ALL THE HELP

Python GraphQL query issue

I'm using Python to make requests to Pipefy GraphQL API.
I already read the documentation and make search in pipefy forum, but
I could not figure what is wrong with the query bellow:
pipeId = '171258'
query ="""
{
"query": "{allCards(pipeId: %s, first: 30, after: 'WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0'){pageInfo{endCursor hasNextPage}edges{node{id title}}}}"
}
"""%(pipeid)
The query worked pretty well until I added the after parameter.
I already tried variations like:
after: "WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0"
after: \"WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0\"
after: \n"WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0\n"
I know the issue is related with the escaping, because the API return messages like this:
'{"errors":[{"locations":[{"column":45,"line":1}],"message":"token recognition error at: \'\'\'"},{"locations":[{"column":77,"line":1}],"message":"token recognition error at: \'\'\'"}]}\n'
(this message is returned when the request is made with after: 'WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0')
Any help here would be immensely handful!
Thanks

I had the same problem as you today (and saw your post on Pipefy's Support page). I personally entered in contact with Pipefy's developers but they weren't helpful at all.
I solved it by escaping the query correctly.
Try like this:
query = '{"query": "{ allCards(pipeId: %s, first: 30, after: \\"WyIxLjAiLCI1ODAuMCIsMzI0OTU0NF0\\"){ pageInfo{endCursor hasNextPage } edges { node { id title } } } }"}'
Using single quotes to define the string and double-backslashes before the doublequotes included in the cursor.

With the code snippet below you are able to call the function get_card_list passing the authentication token (as String) and the pipe_id (as integer) and retrieve the whole card list of your pipe.
The get_card_list function will call the function request_card_list until the hasNextpage is set to False, updating the cursor in each call.
# Function responsible to get cards from a pipe using Pipefy's GraphQL API
def request_card_list(auth_token, pipe_id, hasNextPage=False, endCursor=""):
url = "https://api.pipefy.com/graphql"
headers = {
'Content-Type': 'application/json',
'Authorization': 'Bearer %s' %auth_token
}
if not hasNextPage:
payload = '{"query": "{ allCards(pipeId: %i, first: 50) { edges { node { id title phases_history { phase { name } firstTimeIn lastTimeOut } } cursor } pageInfo { endCursor hasNextPage } } }"}' %pipe_id
else:
payload = '{"query": "{ allCards(pipeId: %i, first: 50, after: \\"%s\\") { edges { node { id title phases_history { phase { name } firstTimeIn lastTimeOut } } cursor } pageInfo { endCursor hasNextPage } } }"}' % (pipe_id, endCursor)
response = requests.request("POST", url, data=payload, headers=headers)
response_body = response.text
response_body_dict = json.loads(response_body)
response_dict_list = response_body_dict['data']['allCards']['edges']
card_list = []
for d in response_dict_list:
for h in d['node']['phases_history']:
h['firstTimeIn'] = datetime.strptime(h['firstTimeIn'], date_format)
if h['lastTimeOut']:
h['lastTimeOut'] = datetime.strptime(h['lastTimeOut'], date_format)
card_list.append(d['node'])
return_list = [card_list, response_body_dict['data']['allCards']['pageInfo']['hasNextPage'], response_body_dict['data']['allCards']['pageInfo']['endCursor']]
return return_list
# Function responsible to get all cards from a pipe using Pipefy's GraphQL API and pagination
def get_card_list(auth_token, pipe_id):
card_list = []
response = request_card_list(auth_token, pipe_id)
card_list = card_list + response[0]
while response[1]:
response = request_card_list(auth_token, pipe_id, response[1], response[2])
card_list = card_list + response[0]
return(card_list)

Thanks for Lodi answer, I was able to do the next step.
How to use a variable to pass the "after" parameter for the query
As it was quite difficult I decide to share it here for those facing the same challenge.
end_cursor = 'WyIxLjAiLCI2NTcuMCIsNDgwNDA2OV0'
end_cursor = "\\" + "\"" + end_cursor + "\\" + "\""
# desired output: end_cursor = '\"WyIxLjAiLCI2NTcuMCIsNDgwNDA2OV0\"'
query ="""
{
"query": "{allCards(pipeId: %s, first: 50, after: %s){pageInfo{endCursor hasNextPage}edges{node{id title}}}}"
}
"""%(pipeid, end_cursor)

Data not matching when fetching from Google Analytics API (python)

I'm making a script to pull data from Google Analytics API v4. The script works fine. However, when validating the data by comparing GA with my fetched data I can see some discrepancies. Not too different but I don't understand why is not the same.
Just to mention that I'm using dynamic segments on my script which has the exact same condition as the segment I have in my GA view.
The segment is just filtering spam traffic by only including traffic where session duration > 1sec.
Here is the structure I'm pulling:
body={
"reportRequests":[
{
"viewId": view_id,
"dimensions":[{"name": "ga:date"},{"name": "ga:sourceMedium"},{"name": "ga:campaign"},{"name": "ga:adContent"},{"name": "ga:channelGrouping"},{"name": "ga:segment"}],
"dateRanges":[
{
"startDate":"2018-12-16",
"endDate":"2018-12-20"
}],
"metrics":[{"expression":"ga:sessions","alias":"sessions"}],
"segments":[
{
"dynamicSegment":
{
"name": "sessions_no_spam",
"userSegment":
{
"segmentFilters":[
{
"simpleSegment":
{
"orFiltersForSegment":
{
"segmentFilterClauses": [
{
"metricFilter":
{
"metricName":"ga:sessionDuration",
"operator":"GREATER_THAN",
"comparisonValue":"1"
}
}]
}
}
}]
}
}
}]
}]
}).execute()
Not sure if the answer to my question will be more conceptual rather than technical but just in case I'm also including the function where I bulk the results in my database:
def print_results(no_spam_traffic):
connection = psycopg2.connect(database = 'web_insights_data', user = 'XXXX', password = 'XXXXX', host = 'XXX', port = 'XXXXX')
cursor = connection.cursor()
for report in no_spam_traffic.get('reports', []):
for row in report.get('data', {}).get('rows', []):
gadate = row['dimensions'][0]
gadate = gadate[0:4]+'/'+gadate[4:6]+'/'+gadate[6:8]
gasourcemedium = row['dimensions'][1]
gacampaign = row['dimensions'][2]
gaadcontent = row['dimensions'][3]
gachannel = row['dimensions'][4]
gasessions = row['metrics'][0]['values'][0]
cursor.execute("SELECT * from GA_no_spam_traffic where gadate = %s AND sourcemedium = %s AND campaign = %s AND adcontent = %s", (str(gadate),str(gasourcemedium),str(gacampaign),str(gaadcontent)))
if len(cursor.fetchall())>0: #update old entries
cursor.execute("UPDATE GA_no_spam_traffic set sessions = %s where gadate = %s AND sourcemedium = %s AND campaign = %s AND adcontent = %s", (str(gasessions),str(gadate),str(gasourcemedium),str(gacampaign),str(gaadcontent)))
connection.commit()
else: #Insert new rows
cursor.execute("INSERT INTO GA_no_spam_traffic (gadate,sourcemedium,campaign,adcontent,channel,sessions) VALUES (%s,%s,%s,%s,%s,%s)", (gadate,gasourcemedium,gacampaign,gaadcontent,gachannel,gasessions))
connection.commit()
connection.close()
Any ideas what the issue might be?
Thanks!!

I managed to improve it, although it's not exact. But well, it's an acceptable discrepancy. I had a problem with the page size so I increased the pagesize parameter.
Here's the link to the pagination section from a google guide: https://developers.google.com/analytics/devguides/reporting/core/v4/migration#pagination
Thanks

KeyError: 'Bytes_Written' python

I do not understand why I get this error Bytes_Written is in the dataset but why can't python find it? I am getting this information(see dataset below) from a VM, I want to select Bytes_Written and Bytes_Read and then subtract the previous values from current value and print a json object like this
{'Bytes_Written': previousValue-currentValue, 'Bytes_Read': previousValue-currentValue}
here is what the data looks like:
{
"Number of Devices": 2,
"Block Devices": {
"bdev0": {
"Backend_Device_Path": "/dev/disk/by-path/ip-192.168.26.1:3260-iscsi-iqn.2010-10.org.openstack:volume-d1c8e7c6-8c77-444c-9a93-8b56fa1e37f2-lun-010.0.0.142",
"Capacity": "2147483648",
"Guest_Device_Name": "vdb",
"IO_Operations": "97069",
"Bytes_Written": "34410496",
"Bytes_Read": "363172864"
},
"bdev1": {
"Backend_Device_Path": "/dev/disk/by-path/ip-192.168.26.1:3260-iscsi-iqn.2010-10.org.openstack:volume-b27110f9-41ba-4bc6-b97c-b5dde23af1f9-lun-010.0.0.146",
"Capacity": "2147483648",
"Guest_Device_Name": "vdb",
"IO_Operations": "93",
"Bytes_Written": "0",
"Bytes_Read": "380928"
}
}
}
This is the complete code that I am running.
FIELDS = ("Bytes_Written", "Bytes_Read", "IO_Operation")
def counterVolume_one(state):
url = 'http://url'
r = requests.get(url)
data = r.json()
for field in FIELDS:
state[field] += data[field]
return state
state = {"Bytes_Written": 0, "Bytes_Read": 0, "IO_Operation": 0}
while True:
counterVolume_one(state)
time.sleep(1)
for field in FIELDS:
print("{field:s}: {count:d}".format(field=field, count=state[field]))
counterVolume_one(state)

Your returned JSON structure does not have any of these FIELDS = ("Bytes_Written", "Bytes_Read", "IO_Operation") keys directly.
You'll need to modify your code slightly.
data = r.json()
for block_device in data['Block Devices'].iterkeys():
for field in FIELDS:
state[field] += int(data['Block Devices'][block_device][field])

Querying ElasticSearch with Python Requests not working fine

I'm trying to do full-text search on a mongodb db with the Elastic Search engine but I ran into a problem: no matters what search term I provide(or if I use query1 or query2), the engine always returns the same results. I think the problem is in the way I make the requests, but I don't know how to solve it.
Here is the code:
def search(search_term):
query1 = {
"fuzzy" : {
"art_text" : {
"value" : search_term,
"boost" : 1.0,
"min_similarity" : 0.5,
"prefix_length" : 0
}
},
"filter": {
"range" : {
"published": {
"from" : "20130409T000000",
"to": "20130410T235959"
}
}
}
}
query2 = {
"match_phrase": { "art_text": search_term }
}
es_query = json.dumps(query1)
uri = 'http://localhost:9200/newsidx/_search'
r = requests.get(uri, params=es_query)
results = json.loads( r.text )
data = [res['_source']['api_id'] for res in results['hits']['hits'] ]
print "results: %d" % len(data)
pprint(data)

The params parameter is not for data being sent. If you're trying to send data to the server you should specifically be using the data parameter. If you're trying to send query parameters, then you shouldn't be JSON-encoding them and just give it to params as a dict.
I suspect your first request should be the following:
r = requests.get(uri, data=es_query)
And before someone downvotes me, yes the HTTP/1.1 spec allows data to be sent with GET requests and yes requests does support it.

search = {'query': {'match': {'test_id':13} }, 'sort' {'date_utc':{'order':'desc'}} }
data = requests.get('http://localhost:9200/newsidx/test/_search?&pretty',params = search)
print data.json()
http://docs.python-requests.org/en/latest/user/quickstart/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

elasticsearch scrolling using python client - python

Related

How do a make a API request in Swift like the working Python request I'm currently using?

Python GraphQL query issue

Data not matching when fetching from Google Analytics API (python)

KeyError: 'Bytes_Written' python

Querying ElasticSearch with Python Requests not working fine

Categories

Resources