How to use bulk upsert in a loop?

How to use bulk upsert in a loop? - python

The fields that I have in Mongoldb are;
id, website_url, status.
I need to find the website_url and update its status to 3 and add a new field called err_desc.
I have a list of website_urls, its status and its err_desc.
Below is my code.
client = MongoClient('localhost', 9000)
db1 = client['Company_Website_Crawl']
collection1 = db1['All']
posts1 = collection1.posts
bulk = posts1.initialize_ordered_bulk_op()
website_url = ["http://www.example.com","http://example2.com/"]
err_desc = ["error1","error2"]
for i in website_url:
parsed_uri = urlparse(i)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
final_url = domain
final_url_strip = domain.rstrip("/")
print i,final_url,final_url_strip,"\n"
try:
k = bulk.find({'website_url':i}).upsert().update({'$push':{'err_desc':err_desc,'status':3}})
k = bulk.execute()
print k
except Exception as e:
print "fail"
print e
Error
fail batch op errors occurred
fail Bulk operations can only be executed once.
Initially I used
k = posts1.update({'website_url':final_url_strip},{'$set':{'err_desc':err_desc,'status':3}},multi=True)
It was too slow for 5M records. So I wanted to use bulk update option. Kindly help me to use bulk upsert for this scenario.

The error message is telling you that you need to re-initialize the batch writes operation after calling execute(). But the thing is, you are doing it wrong. In you case, you need to call execute at the end of the for loop like this:
from itertools import count
ct = count()
for url in website_url:
...
try:
bulk.find({'website_url':i}).upsert().update({'$push':{'err_desc':err_desc,'status':3}})
val = next(ct)
except Exception as e:
...
if val > 0:
bulk.execute()
Also note that Bulk() is now deprecated and replaced with bulkwrite

Related

Use cursor.executemany instead of cursor.execute with tying panda header to sql variables

I have a python script that updates rows in an oracle sql table correctly, however I am using cursor.execute and try/except so if one update fails, it kills the whole run.
I want to be able to have it run through the whole update and just log the error and move onto the next one, which is where cursor.executemany comes in.
https://cx-oracle.readthedocs.io/en/latest/user_guide/batch_statement.html
Here is the script, it works great, except for the all or nothing error approach.
#oracle sql update statement for SHRTCKN
banner_shrtckn_update = """
UPDATE SATURN.SHRTCKN A
SET A.SHRTCKN_COURSE_COMMENT = :course_comment,
A.SHRTCKN_REPEAT_COURSE_IND = :repeat_ind,
A.SHRTCKN_ACTIVITY_DATE = SYSDATE,
A.SHRTCKN_USER_ID = 'STU00940',
A.SHRTCKN_DATA_ORIGIN = 'APPWORX'
WHERE A.SHRTCKN_PIDM = gb_common.f_get_pidm(:id) AND
A.SHRTCKN_TERM_CODE = :term_code AND
A.SHRTCKN_SEQ_NO = :seqno AND
A.SHRTCKN_CRN = :crn AND
A.SHRTCKN_SUBJ_CODE = :subj_code AND
A.SHRTCKN_CRSE_NUMB = :crse_numb
"""
def main():
# get time of run and current year
now = datetime.datetime.now()
year = str(now.year)+"40"
# configure connection strings for banner PROD
db_pass = os.environ['DB_PASSWORD']
dsn = cx_Oracle.makedsn(host='FAKE', port='1521', service_name='TEST.FAKE.BLAH')
try: # initiate banner db connection -- PROD
banner_cnxn = cx_Oracle.connect(user=config.db_test['user'], password = db_pass, dsn=dsn)
writeLog("---- Oracle Connection Made ----")
insertCount = 0
for index, row in df.iterrows():
shrtcknupdate(row,banner_cnxn)
insertCount = insertCount + 1
banner_cnxn.commit()
banner_cnxn.close()
writeLog(str(insertCount)+" rows updated")
except Exception as e:
print("Error: "+str(e))
writeLog("Error: "+str(e))
def writeLog(content):
print(content)
log.write(str(datetime.date.today())+" "+content+"\n")
#define the variable connection between panda/csv and SHRTCKN table
def shrtcknupdate(row, connection):
sql = banner_shrtckn_update
variables = {
'id' : row.Bear_Nbr,
'term_code' : row.Past_Term,
'seqno' : row.Seq_No,
'crn' : row.Past_CRN,
'subj_code' : row.Past_Prefix,
'crse_numb' : row.Past_Number,
'course_comment' : row.Past_Course_Comment,
'repeat_ind' : row.Past_Repeat_Ind
}
cursor = connection.cursor()
cursor.execute(sql, variables)
if __name__ == "__main__":
writeLog("-------- Process Start --------")
main()
writeLog("-------- Process End --------")
The executemany option, I can turn on batcherrors=True
and it will do exactly what I need.
The problem I am running into, is if I get rid of the for loop that runs through the excel/panda dataframe to update the oracle sql rows, which is not needed when doing the update in batch, then how do I attach the column headers to the sql update statement.
If I leave in the for loop, I get this error when using executemany:
Error: parameters should be a list of sequences/dictionaries or an integer specifying the number of times to execute the statement

For named binds, you need to provide a list of dictionaries. This list can be obtained by to_dict(orient='records'):
‘records’ : list like [{column -> value}, … , {column -> value}]
banner_shrtckn_update = """
UPDATE SATURN.SHRTCKN A
SET A.SHRTCKN_COURSE_COMMENT = :Past_Course_Comment,
A.SHRTCKN_REPEAT_COURSE_IND = :Past_Repeat_Ind,
A.SHRTCKN_ACTIVITY_DATE = SYSDATE,
A.SHRTCKN_USER_ID = 'STU00940',
A.SHRTCKN_DATA_ORIGIN = 'APPWORX'
WHERE A.SHRTCKN_PIDM = gb_common.f_get_pidm(:Bear_Nbr) AND
A.SHRTCKN_TERM_CODE = :Past_Term AND
A.SHRTCKN_SEQ_NO = :Seq_No AND
A.SHRTCKN_CRN = :Past_CRN AND
A.SHRTCKN_SUBJ_CODE = :Past_Prefix AND
A.SHRTCKN_CRSE_NUMB = :Past_Number
"""
def main():
# get time of run and current year
now = datetime.datetime.now()
year = str(now.year)+"40"
# configure connection strings for banner PROD
db_pass = os.environ['DB_PASSWORD']
dsn = cx_Oracle.makedsn(host='FAKE', port='1521', service_name='TEST.FAKE.BLAH')
try: # initiate banner db connection -- PROD
banner_cnxn = cx_Oracle.connect(user=config.db_test['user'], password = db_pass, dsn=dsn)
writeLog("---- Oracle Connection Made ----")
# batch execute banner_shrtckn_update
cursor = banner_cnxn.cursor()
data = df[['Bear_Nbr', 'Past_Term', 'Seq_No', 'Past_CRN', 'Past_Prefix', 'Past_Number', 'Past_Course_Comment', 'Past_Repeat_Ind']].to_dict('records')
cursor.executemany(banner_shrtckn_update, data, batcherrors=True)
for error in cursor.getbatcherrors():
writeLog(f"Error {error.message} at row offset {error.offset}")
banner_cnxn.commit()
banner_cnxn.close()
except Exception as e:
print("Error: "+str(e))
writeLog("Error: "+str(e))
This isn't described in detail in the documenation, but you can find an example for named binds in python-cx_Oracle/samples/bind_insert.py.
(Please note that I adjusted the variable names in your sql statement to the dataframe column names to avoid renaming of the columns at creating data.)

Elasticsearch helper returns generator class, need dictionary

I'm using the scan API to query elasticsearch and return an unlimited number of results (as opposed to the 10k limit). This returns a generator object, which I then need to turn into a dictionary (because the results will then be turned into a pandas data frame).
I attempted logs = {c.name:c.value for c in logs} but it returned an empty list. I know the results themselves came back because they printed when I attempted for i in logs: \\ print(i). How can I turn my results into a dictionary?
Code
import elasticsearch.helpers
from elasticsearch import Elasticsearch, TransportError, RequestsHttpConnection
def check_query(query, index):
ES_user = environ.get('ES_USER')
ES_pw = environ.get('ES_PW')
ES_cluster = environ.get('ES_CLUSTER')
es = Elasticsearch([ES_cluster],
connection_class=RequestsHttpConnection,
http_auth=(ES_user, ES_pw),
use_ssl=True, verify_certs=False)
try:
logs = elasticsearch.helpers.scan(es, query=query, index=index)
logs = {c.name:c.value for c in logs} # returns empty list
except Exception as e:
logs = 'Please contact the application supporter and provide them the information below! \n'
logs += 'Error: ' + str(e) + '\n'
logs += 'Query: ' + query
return logs

Iterating collections inside collections in Firestore

if I iterate every user.id inside the user collection I get every user.id printed out correctly:
user_ref = db.collection(u'users')
for user_collection in user_ref.get():
print(user_collection.id, file = sys.stderr)
Now, when I try to iterate a collections inside each one of the documents inside the user collection, the original iteration that printsuser.id does not run completely:
user_ref = db.collection(u'users')
for user_collection in user_ref.get():
print(user_collection.id, file = sys.stderr)
s2_ref = user_ref.document(user_collection.id).collection(u'preferences')
for s2 in s2_ref.get():
try:
print(s2.id, file = sys.stderr)
except google.cloud.exceptions.NotFound:
pass
I have included an exception to bypass empty collections.
How can I complete the iteration correctly?

I just had to create an array for the first set of results, and then iterate each id separately:
user_id_array = []
for user_collection in user_ref.get():
user_id_array.append(user_collection.id)
for user_id in user_id_array:
try:
suscription_ref = doc_ref.document(user_id).collection(u'suscriptions').document(user_id).get()
print(suscription_ref.id,file = sys.stderr)
except google.cloud.exceptions.NotFound:
pass
It takes more time, but it'll get you there.

BigQuery async query job - the fetch_results() method returns wrong number of values

I am writing Python code with the BigQuery Client API, and attempting to use the async query code (written everywhere as a code sample), and it is failing at the fetch_data() method call. Python errors out with the error:
ValueError: too many values to unpack
So, the 3 return values (rows, total_count, page_token) seem to be the incorrect number of return values. But, I cannot find any documentation about what this method is supposed to return -- besides the numerous code examples that only show these 3 return results.
Here is a snippet of code that shows what I'm doing (not including the initialization of the 'client' variable or the imported libraries, which happen earlier in my code).
#---> Set up and start the async query job
job_id = str(uuid.uuid4())
job = client.run_async_query(job_id, query)
job.destination = temp_tbl
job.write_disposition = 'WRITE_TRUNCATE'
job.begin()
print 'job started...'
#---> Monitor the job for completion
retry_count = 360
while retry_count > 0 and job.state != 'DONE':
print 'waiting for job to complete...'
retry_count -= 1
time.sleep(1)
job.reload()
if job.state == 'DONE':
print 'job DONE.'
page_token = None
total_count = None
rownum = 0
job_results = job.results()
while True:
# ---- Next line of code errors out...
rows, total_count, page_token = job_results.fetch_data( max_results=10, page_token=page_token )
for row in rows:
rownum += 1
print "Row number %d" % rownum
if page_token is None:
print 'end of batch.'
break
What are the specific return results I should expect from the job_results.fetch_data(...) method call on an async query job?

Looks like you are right! The code no longer return these 3 parameters.
As you can see in this commit from the public repository, fetch_data now returns an instance of the HTTPIterator class (guess I didn't realize this before as I have a docker image with an older version of the bigquery client installed where it does return the 3 values).
The only way that I found to return the results was doing something like this:
iterator = job_results.fetch_data()
data = []
for page in iterator._page_iter(False):
data.extend([page.next() for i in range(page.num_items)])
Notice that now we don't have to manage pageTokens anymore, it's been automated for the most part.
[EDIT]:
I just realized you can get results by doing:
results = list(job_results.fetch_data())
Got to admit it's way easier now then it was before!

not able to undertand cursors in appengine

I'm trying to fetch results in a python2.7 appengine app using cursors, but each time I use with_cursor() it fetches the same result set.
query = Model.all().filter("profile =", p_key).order('-created')
if r.get('cursor'):
query = query.with_cursor(start_cursor = r.get('cursor'))
cursor = query.cursor()
objs = query.fetch(limit=10)
count = len(objs)
for obj in objs:
...
Each time through I'm getting same 10 results. I'm thinkng it has to do with using end_cursor, but how do I get that value if query.cursor() is returning the start_cursor. I've looked through the docs but this is poorly documented.

Your formatting is a bit screwy by the way. Looking at your code (which is incomplete and therefore potentially leaving something out.) I have to assume you have forgotten to store the cursor after fetching results (or return to the user - I am assuming r is a request ?).
So after you have fetched some data you need to call cursor() on the query. e.g This function counts all entities using a cursor.
def count_entities(kind):
c = None
count = 0
q = kind.all(keys_only=True)
while True:
if c:
q.with_cursor(c)
i = q.fetch(1000)
count = count + len(i)
if not i:
break
c = q.cursor()
return count
See how after fetch() has been called the c=q.cursor() call and it's is used as the cursor next time through the loop.

Here's what finally worked:
query = Model.all().filter("profile =", p_key).order('-created')
if request.get('cursor'):
query = query.with_cursor(request.get('cursor'))
objs = query.fetch(limit=10)
cursor = query.cursor()
for obj in objs:
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use bulk upsert in a loop? - python

Related

Use cursor.executemany instead of cursor.execute with tying panda header to sql variables

Elasticsearch helper returns generator class, need dictionary

Iterating collections inside collections in Firestore

BigQuery async query job - the fetch_results() method returns wrong number of values

not able to undertand cursors in appengine

Categories

Resources