I have a millions of articles in bigtable and to scan 50,000 articles I have used something as :
for key, data in mytable.scan(limit=50000):
print (key,data)
It works fine for limit upto 10000 but as I exceed the limit of 10000 I get this error
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED)
There was a fix for this problem, where the client automatically retries temporary failures like this one. That fix was not released yet, but will hopefully be released soon.
I had a similar problem, where I had to retrieve data from many rows at the same time. What you're using looks like the hbase client, my solution is using the native one, so I'll try to post both - one which I tested and one which might work.
I never found an example which would demonstrate how to simply iterate over the rows as they come using the consume_next() method, which is mentioned here and I didn't manage to figure it out on my own. Calling consume_all() for too many rows yielded the same DEADLINE EXCEEDED error.
LIMIT = 10000
previous_start_key = None
while start_key != previous_start_key:
previous_start_key = start_key
row_iterator = bt_table.read_rows(start_key=start_key, end_key=end_key,
filter_=filter_, limit=LIMIT)
row_iterator.consume_all()
for _row_key, row in row_iterator.rows.items():
row_key = _row_key.decode()
if row_key == previous_start_key: # Avoid repeated processing
continue
# do stuff
print(row)
start_key = row_key
So basically you can start with whatever start_key, retrieve 10k results, do consume_all(), then retrieve the next batch starting where you left off, and so on, until some reasonable condition applies.
For you it might be something like:
row_start = None
for i in range(5):
for key, data in mytable.scan(row_start=row_start, limit=10000):
if key == row_start: # Avoid repeated processing
continue
print (key,data)
row_start = key
There might be a much better solution, and I really want to know what it is, but this works for me for the time being.
Related
I'm reading data from a json file using requests.. and wish to store those values in an excel file. The issue I'm running into is I don't know the interval at which this json file is being updated, so it's hard to make my scheduler perfect.. I end up with duplicate values. I can't seem to find any libraries that could check for me.
In this scenario, I end up with a duplicate at 1:10 and wish to not add it since it is the same as the one at 1:05. In some instances, there may be more than one duplicate at a time and I wish to check this before it evens gets added rather than added and then deleted.
Is there a simpler way to approach this problem?
Example:
timestamp stock buy sell
1:05 1 10 10
1:10 1 10 10
1:15 1 20 16
def do_this():
previous_data = []
try:
response = requests.get(JSON_LINK)
previous_data = response.content
except Exception:
print('Failed in main')
if(previous_data == response.content):
print('same data')
else:
print('new data')
I realized this isn't going to work at previous_data = [] will always end up taking in whatever the current response is, not the previous response data.
You could try searching the existing data for the same record you have. If the record exists then its an update, if not its a delete.
You could also cache the previous request. When a new request comes in, compare it to the cache and remove duplicates. This approach only works if you care about previous and next requests.
"I don't know the interval at which this json file is being updated" - you can add a custom header to requests with the time the request was store this time as 'latest'. When another request comes in, check the header time value, if its older than latest, throw it away.
Considering that JSON can have the same values at different times.
There are mostly two way to check for equality of objects- one is by char to char checking and other is by matching the hash values.
So, one approach will be to create a HashSet. Whenever a new response comes you compute the hash and check if this hash exists or not. If exists, then simply discard the response, if it doesn't exist you store the hash value in the hashSet and response value in your excel.
NOTE: Hash matching consumes less time than string matching.
I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).
I'm writing a little python app to fetch the followers of a given tumblr, and I think I may have found a bug in the paging logic.
The tumblr I am testing with has 593 followers and I know the API is block limited to 20 per call. After successful authentication, the fetch logic looks like this:
offset = 0
while True:
response = client.followers(blog, limit=20, offset=offset)
bunch = len(response["users"])
if bunch == 0:
break
j = 0
while j < bunch:
print response["users"][j]["name"]
j = j + 1
offset += bunch
What I observe is that on the third call into the API with offset=40, the first name returned on the list is one I saw in the previous group. It's actually the 38th name. This behavior (seeing one or more names I've seen before) repeats randomly from that point on, though not in every call to the API. Some calls give me a fresh 20 names. It's repeatable across multiple test runs. The sequence I see them in is the same as on Tumblr's site, I just see many of them twice.
An interesting coincidence is that the total number of of non-unique followers returned is the same as what the "Followers" count indicates on the blog itself (593). But only 516 of them are unique.
For what it's worth, running the query on Tumblr's console page returns the same results regardless of the language I choose, so I'm not inclined to think this is a bug in the PyTumblr client, but something lower, at the API level.
Any ideas?
So I am working in a project in which I have to read a large database (for me it is large) of 10 million records. I cannot really filter them, because I have to treat them all and individually. For each record I must apply a formula and then write this result into multiple files depending on certain conditions of the record.
I have implemented a few algorithms and finishing the whole processing takes around 2- 3 days. This is a problem because I am trying to optimise a process that already takes this time. 1 day is acceptable.
So far I have tried indexes on the database, threading(of the process upon the record and not I/O operations). I can not get a shorter time.
I am using django, and i fail to measure how much it really takes to really start treating the data due to its lazy behaviour. I would also like to know if i can start treating the data as soon as i receive it and not having to wait for all the data to be loaded unto memory before i can actually process it. It could also be my understanding of writing operations upon python. Lastly it could be that I need a better machine (I doubt it, I have 4 cores and 4GB RAM, it should be able to give better speeds)
Any ideas? I really appreciate the feedback. :)
Edit: Code
Explanation:
The records i talked about are ids of customers(passports), and the conditions are if there are agreements between the different terminals of the company(countries). The process is a hashing.
First strategy tries to treat the whole database... We have at the beginning some preparation for treating the condition part of the algorithm (agreements between countries). Then a large verification by belonging or not in a set.
Since i've been trying to improve it on my own, i tried to cut the problem in parts for the second strategy, treating the query by parts (obtaining the records that belong to a country and writing in the files of those that have an agreement with them)
The threaded strategy is not depicted for it was designed for a single country and i got awful results compared with no threaded. I honestly have the intuition it has to be a thing of memory and sql.
def create_all_files(strategy=0):
if strategy == 0:
set_countries_agreements = set()
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
for each_country in set_countries:
set_agreements = frozenset(get_agreements(each_country))
set_countries_agreements.add(set_agreements)
print("All agreements obtained")
set_passports = Passport.objects.all()
print("All passports obtained")
for each_passport in set_passports:
for each_agreement in set_countries_agreements:
for each_country in each_agreement:
if each_passport.nationality == each_country:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % iter(each_agreement).next()), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print(".")
print("_")
print("-")
print("~")
if strategy == 1:
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
while len(set_countries)!= 0:
country = set_countries.pop()
list_countries = get_agreements(country)
list_passports = Passport.objects.filter(nationality=country)
for each_passport in list_passports:
for each_country in list_countries:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % each_country), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print("r")
print("c")
print("p")
print("P")
In your question, you are describing an ETL process. I suggest you to use an ETL tool.
To mention some python ETL tool I can talk about Pygrametl, wrote by Christian Thomsen, in my opinion it runs nicely and its performance is impressive. Test it and comeback with results.
I can't post this answer without mention MapReduce. This programming model can catch with your requirements if you are planing to distribute task through nodes.
It looks like you have a file for each country that you append hashes to, instead of opening and closing handles to these files 10 million+ times you should open each one once and close them all at the end.
countries = {} # country -> file
with open(os.path.join(PROJECT_ROOT, 'list_countries')) as country_file:
for line in country_file:
country = line.strip()
countries[country] = open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % country), "a")
for country in countries:
agreements = get_agreements(country)
for postcode in Postcode.objects.filter(nationality=country):
for agreement in agreements:
countries[agreement].write(generate_hash(passport.nationality + "<" + passport.id_passport, country_agreement) + "\n")
for country, file in countries.items():
file.close()
I don't how big a list of Postcode objects Postcode.objects.filter(nationality=country) will return, if it is massive and memory is an issue, you will have to start thinking about chunking/paginating the query using limits
You are using sets for your list of countries and their agreements, if that is because your file containing the list of countries is not guaranteed to be unique, the dictionary solution may error when you attempt to open another handle to the same file. This can be avoided by added a simple check to see if the country is already a member of countries
Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.