Question. I have been tasked with researching how to backfill data in Elasticsearch. So far coming up a bit empty. The basic gist is:
Notes: All documents are stored under daily indexes, with ~200k documents per day.
I need to be able to reindex about 60 days worth of data.
I need to take two fields for each doc payload.time_sec and payload.time_nanosec, take there values and do some math on them (time_sec * 10**9 + time_nanosec) and then return this as a single field into the reindexed document
I am looking at the Python API documentation with bulk helpers:
http://elasticsearch-py.readthedocs.io/en/master/helpers.html
But I am wondering if this is even possible.
My thoughts were to use:
Bulk helpers to pull a scroll ID (bulk _update?), iterate over each doc id, pull that data in from the two fields for each dock, do the math, and finish the update request with the new field data.
Anyone done this? Maybe something with a groovy script?
Thanks!
Bulk helpers to pull a scroll ID (bulk _update?), iterate over each doc id, pull that data in from the two fields for each dock, do the math, and finish the update request with the new field data.
Basically, yes:
use /_search?scroll to fetch the docs
perform your operation
send /_bulk update requests
Other options are:
use the /_reindex APIProbably not so good if you don't want to create a new index
use the /_update_by_query API
Both support scripting which, if I understood it correctly, wold be the perfect choice because your update does not depend on external factors so this could as well be done directly within the server.
Here is where I am at (roughly):
Ive been working with a Python and the bulk helpers and so far am around here:
doc = helpers.scan(es, query={
"query": {
"match_all": {}
},
"size":1000
},index=INDEX, scroll='5m', raise_on_error=False)
for x in doc:
x['_index'] = NEW_INDEX
try:
time_sec = x['_source']['payload']['time_sec']
time_nanosec=x['_source']['payload']['time_nanosec']
duration = (time_sec * 10**9) + time_nanosec
except KeyError: pass
count = count + 1
x['_source']['payload']['duration'] = duration
new_index_data.append(x)
helpers.bulk(es,new_index_data)
From here I am just using the bulk python helper to insert into a new index. However I will experiment changing and testing with bulk update to an existing index.
This look like a right approach?
Related
right now I have a problem on how to make a find query return the results faster and I don't know exactly how I can do it. I'm fairly new to both MongoDB and python so please bear with me. I have a collection on MongoDB with 250000 objects. Each one is fairly nested. They are something like this:
_id: Objectid
shopper: Shopperid
data:{
report:{
items:{
accounts:{
account_1:{
mask:
type:
subtype:
inst_id:
historical_balances:{
object_1
object_2
...}
}
account_2:{
mask:
type:
subtype:
inst_id:
historical_balances:{
object_1
object_2
...}
...
}
}
}
}
And I need to get those data and the sum of historical balances for each account on each object. Right now it is taking forever and I don't know exactly what to do. I am trying to download locally the data from MongoDB but every time my internet has a connection problem all is lost and it's taking more than 10 hours to get to half of it. I tried using list(find(query)) so I could deal with it later but I don't have enough ram. Right now what I am doing is:
for k in cursor:
for j in range(len(k['data']['report']['items'][0]['accounts'])):
e.append([k['_id'], k['shopper'],k['data']['report']['items'][0]['institution_id'],
k['data']['report']['items'][0]['institution_name'],
k['data']['report']['items'][0]['accounts'][j]['mask'],
k['data']['report']['items'][0]['accounts'][j]['type'],
k['data']['report']['items'][0]['accounts'][j]['subtype'], len(k['data']['report']['items'][0]['accounts'][j]['transactions'])])
In summary: Each object has a data field that has a report field, that has a items field that has an account field, each account field has multiple accounts objects that have a mask, type, subtype and historical balances. I need all this data and a sum of historical balances for each account.
Right now I am using the code above to get data, put into a list and then turn it into a Panda Dataframe so I can save as a csv file which is what I need. I know it isn't the prettiest code but it was the first idea I came up with. Any ideas on how I can improve this performance that is being really slow for my needs?
In my Cloud Bigtable table, I have millions of requests per second. I get a unique row key and then I need to modify the row with an atomic mutation.
When I filter by column to get the key, will it be atomic for each request?
col1_filter = row_filters.ColumnQualifierRegexFilter(b'customerId')
label1_filter = row_filters.ValueRegexFilter('')
chain1 = row_filters.RowFilterChain(filters=[col1_filter, label1_filter])
partial_rows = table.read_rows(filter_=chain1)
for data in partial_rows:
row_cond = table.row(data.cell[row_key])
row_cond.set_cell(u'data', b'customerId', b'value', state=True)
row_cond.commit()
CheckAndMutateRow operations are atomic, BUT it is check and mutate row not rows. So the way you have this set up wont create the atomic operation.
You need to create a conditional row object using the rowkey and your filter, supply the modification, then commit. Like so:
col1_filter = row_filters.ColumnQualifierRegexFilter(b'customerId')
label1_filter = row_filters.ValueRegexFilter('')
chain1 = row_filters.RowFilterChain(filters=[col1_filter, label1_filter])
partial_rows = table.read_rows()
for data in partial_rows:
row_cond = table.row(data.cell[row_key], filter_=chain1) # Use filter here
row_cond.set_cell(u'data', b'customerId', b'value', state=True)
row_cond.commit()
So you would have to do a full table scan and apply the filter to each row. If you are applying that filter, you'd be doing a full scan already, so there shouldn't be performance differences. For best practices with Cloud Bigtable, you want to avoid full table scans. If this is a one time program you need to run that would be fine, otherwise you may want to figure out a different way to do this if you're going to be doing it regularly.
Note that we are updating the API to be provide more clarity on the different kinds of mutations.
I have a MongoDB database that contains a number of tweets. I want to be able to get all the tweets in JSON list through my API that contain a number of hashtags greather than that specified by the user in the url (eg http://localhost:5000/tweets?morethan=5, which is 5 in this case) .
The hashtags are contained inside the entities column in the database, along with other columns such as user_mentions, urls, symbols and media. Here is the code I've written so far but doesnt return anything.
#!flask/bin/python
app = Flask(__name__)
#app.route('/tweets', methods=['GET'])
def get_tweets():
# Connect to database and pull back collections
db = client['mongo']
collection = db['collection']
parameter = request.args.get('morethan')
if parameter:
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) # create the namespace#
cursor = collection.find({key_im_looking_for: {"$exists": True}})
EDIT: IT WORKS!
The code in question is this line
cursor = collection.find({"entities": {"hashtags": parameter}})
This answer explains why it is impossible to directly perform what you ask.
mongodb query: $size with $gt returns always 0
That answer also describes potential (but poor) ideas to get around it.
The best suggestion is to modify all your documents and put a "num_hashtags" key in somewhere, index that, and query against it.
Using The Twitter JSON API you could update all your documents and put a the num_hashtags key in the entities document.
Alternatively, you could solve your immediate problem by doing a very slow full table scan across all documents for every query checking if the hashtag number which is one greater than your parameter exists by abusing MongoDB Dot Notation.
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) #create the namespace#
# py2.7 => key_im_looking_for = "entities.hashtags.%s" %(gt_parameter)
# in this example it would be "entities.hashtags.6"
cursor = collection.find({key_im_looking_for: {"$exists": True}})
The best answer (and the key reason to use a NoSQL database in the first place) is that you should modify your data to suit your retrieval. If possible, you should perform an inplace update adding the num_hashtags key.
I would like to periodically update the data in elasticsearch.
In the file I send in for update, there may be data that already exist in elasticsearh (for update) and data that are new docs (for insert).
Since the data in elasticsearch is managed by auto-created ID,
I have to search the ID by a column "code"(unique) to make sure if a doc already exists, if exists update, otherwise insert.
I wonder if there is any method that is faster than the codes I think of as below.
es = Elasticsearch()
# get doc ID by searching(exact match) a code to check if ID exists
res = es.search(index=index_name, doc_type=doc_type, body=body_for_search)
id_dict = dict([('id', doc['_id'])]) for doc in res['hits']['hits’]
# if id exists, update the current doc by id
# else insert with auto-created id
If id_dict['id']:
es.update(index=index_name, id=id_dict['id'], doc_type=doc_type, body=body)
else:
es.index(index=index_name, doc_type=doc_type, body=body)
For example, could there be a method where elasticsearch search the exact match col["code"] for you and you can simply "upsert" the data without specifying id?
Any advice would be much appreciated and thank you for your reading.
ps- if we make the id = col["code"] it could be much simpler and faster, but for management issue we can't do it at current stage.
As #Archit said, use your own ID to lookup document faster
Use upsert API https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
Be sure your ID structure respects Lucene good practice:
If you are using your own ID, try to pick an ID that is friendly to
Lucene. Examples include zero-padded sequential IDs, UUID-1, and
nanotime; these IDs have consistent, sequential patterns that compress
well. In contrast, IDs such as UUID-4 are essentially random and offer
poor compression and slow down Lucene.
I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.