I'm trying to set up a simple table existence test for a luigi task using luigi.hive.HiveTableTarget
I create a simple table in hive just to make sure it is there:
create table test_table (a int);
Next I set up the target with luigi:
from luigi.hive import HiveTableTarget
target = HiveTableTarget(table='test_table')
>>> target.exists()
True
Great, next I try it with a table I know doesn't exist to make sure it returns false.
target = HiveTableTarget(table='test_table_not_here')
>>> target.exists()
And it raises an exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/luigi/hive.py", line 344, in exists
return self.client.table_exists(self.table, self.database)
File "/usr/lib/python2.6/site-packages/luigi/hive.py", line 117, in table_exists
stdout = run_hive_cmd('use {0}; describe {1}'.format(database, table))
File "/usr/lib/python2.6/site-packages/luigi/hive.py", line 62, in run_hive_cmd
return run_hive(['-e', hivecmd], check_return_code)
File "/usr/lib/python2.6/site-packages/luigi/hive.py", line 56, in run_hive
stdout, stderr)
luigi.hive.HiveCommandError: ('Hive command: hive -e use default; describe test_table_not_here
failed with error code: 17', '', '\nLogging initialized using configuration in
jar:file:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/hive-common-0.13.1-
cdh5.2.0.jar!/hive-log4j.properties\nOK\nTime taken: 0.822 seconds\nFAILED:
SemanticException [Error 10001]: Table not found test_table_not_here\n')
edited formatting for clarity
I don't understand that last line of the exception. Of course the table is not found, that is the whole point of an existence check. Is this the expected behavior or do I have some configuration issue I need to work out?
Okay so it looks like this may have been a bug in the latest tagged release (1.0.19) but it is fixed on the master branch. The code responsible is the line:
stdout = run_hive_cmd('use {0}; describe {1}'.format(database, table))
return not "does not exist" in stdout
which is changed in the master to be:
stdout = run_hive_cmd('use {0}; show tables like "{1}";'.format(database, table))
return stdout and table in stdout
The latter works fine whereas the former throws a HiveCommandError.
If you want a solution without having to update to the master branch, you could create your own target class with minimal effort:
from luigi.hive import HiveTableTarget, run_hive_cmd
class MyHiveTarget(HiveTableTarget):
def exists(self):
stdout = run_hive_cmd('use {0}; show tables like "{1}";'.format(self.database, self.table))
return self.table in stdout
This will produce the desired output.
Related
I am taking cs50 class. Currently on Week 7.
Prior to this coding, python was working perfectly fine.
Now, I am using SQL command within python file on VS Code.
cs50 module is working fine through venv.
When I execute python file, I should be asked "Title: " so that I can type any titles to see the outcome.
I should be getting an output of the counter, which tracks the number of occurrence of the title from user input.
import csv
from cs50 import SQL
db = SQL("C:\\Users\\wf user\\Desktop\\CODING\\CS50\\shows.db")
title = input("Title: ").strip()
#uses SQL command to return the number of occurrence of the title the user typed.
rows = db.execute("SELECT COUNT(*) AS counter FROM shows WHERE title LIKE ?", title) #? is for title.
#db.execute always returns a list of rows even if it's just one row.
#setting row to the keyword which is is rows[0]. the actual value is in rows[1]
row = rows[0]
#passing the key called counter will print out the value that is in rows[1]
print(row["counter"])
I have shows.db in the path.
But the output is printing "Found". It's not even asking for a Title to input.
PS C:\Users\wf user\Desktop\CODING\CS50> python favoritesS.py
Found
I am expecting the program to ask me "Title: " for me, but instead it's print "Found"
In cs50, the professor encountered the same problem when he was coding phonebook.py, but the way he solved the problem was he put the python file into a separate folder called "tmp"
I tried the same way but then I was given a long error message
PS C:\Users\wf user\Desktop\CODING\CS50> cd tmp
PS C:\Users\wf user\Desktop\CODING\CS50\tmp> python favoritesS.py
Traceback (most recent call last):
File "C:\Users\wf user\Desktop\CODING\CS50\tmp\favoritesS.py", line 5, in <module>
db = SQL("C:\\Users\\wf user\\Desktop\\CODING\\CS50\\shows.db")
File "C:\Users\wf user\AppData\Local\Programs\Python\Python311\Lib\site-packages\cs50\sql.py", line 74, in __init__
self._engine = sqlalchemy.create_engine(url, **kwargs).execution_options(autocommit=False, isolation_level="AUTOCOMMIT")
File "<string>", line 2, in create_engine
File "C:\Users\wf user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sqlalchemy\util\deprecations.py", line 309, in warned
return fn(*args, **kwargs)
File "C:\Users\wf user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sqlalchemy\engine\create.py", line 518, in create_engine
u = _url.make_url(url)
File "C:\Users\wf user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sqlalchemy\engine\url.py", line 732, in make_url
return _parse_url(name_or_url)
File "C:\Users\wf user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sqlalchemy\engine\url.py", line 793, in _parse_url
raise exc.ArgumentError(
sqlalchemy.exc.ArgumentError: Could not parse SQLAlchemy URL from string 'C:\Users\wf user\Desktop\CODING\CS50\shows.db'
Here is the proof that the code I posted is the same code I am working on.
I use Start Debugging under Run menu on VSCode and it's working! But not when I don't use debugging.
Is this the library you are using? https://cs50.readthedocs.io/
It may be that one of your intermediate results is not doing what you think it is. I would recommend you put print() statements at every step of the way to see the values of the intermediate variables.
If you have learned how to use a debugger, that is even better.
I have a problem with attempting to pipeline some entries into a Postgresql database. The loader is in this file movie_loader.py provided to me:
import csv
"""
This program generates direct SQL statements from the source Netflix Prize files in order
to populate a relational database with those files’ data.
By taking the approach of emitting SQL statements directly, we bypass the need to import
some kind of database library for the loading process, instead passing the statements
directly into a database command line utility such as `psql`.
"""
# The INSERT approach is best used with a transaction. An introductory definition:
# instead of “saving” (committing) after every statement, a transaction waits on a
# commit until we issue the `COMMIT` command.
print('BEGIN;')
# For simplicity, we assume that the program runs where the files are located.
MOVIE_SOURCE = 'movie_titles.csv'
with open(MOVIE_SOURCE, 'r+', encoding='iso-8859-1') as f:
reader = csv.reader(f)
for row in reader:
id = row[0]
year = 'null' if row[1] == 'NULL' else int(row[1])
title = ', '.join(row[2:])
# Watch out---titles might have apostrophes!
title = title.replace("'", "''")
print(f'INSERT INTO movie VALUES({id}, {year}, \'{title}\');')
sys.stdout.reconfigure(encoding='UTF08')
# We wrap up by emitting an SQL statement that will update the database’s movie ID
# counter based on the largest one that has been loaded so far.
print('SELECT setval(\'movie_id_seq\', (SELECT MAX(id) from movie));')
# _Now_ we can commit our transation.
print('COMMIT;')
However, when attempting to pipeline this file into my database, I get the following error, which seems to be some kind of encoder error. I am using git bash as my terminal.
$ python3 movie_loader.py | psql postgresql://localhost/postgres
stdin is not a tty
Traceback (most recent call last):
File "C:\Users\dhuan\relational\movie_loader.py", line 28, in <module>
print(f'INSERT INTO movie VALUES({id}, {year}, \'{title}\');')
OSError: [Errno 22] Invalid argument
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='cp1252'>
OSError: [Errno 22] Invalid argument
It seems as if maybe my dataset has an error? I'm not sure specifically what the error is pointing at. Any insight is appreciated
I have an issue with TransactionRollbackError: could not serialize access due to concurrent update while updating the table.I'm using postgres with odoo V13, the issue occurs while updating the pool with specific record set and context using write() method.I'm migrating the code from odoo v7 to v13 I could see the same works in odoo V7 with no issues. I see no syntax errors but still i get this. I just want to understand is this a bug in the version or related to the concurrence of any data ?
I have a following line of code which is part of the one function.
self.env.get('pk.status').browse(pk_id).with_context(audit_log=True).write(update_vals)
I have a model named pk.status and it has attribute write(self,update_vals), based on conditions it will have to run x1(update_vals) as below.
def x1(self,update_vals):
product_pool = self.env.get('pk.product')
if update_vals:
if isinstance(update_vals, int):
update_vals = [update_vals]
for bs_obj in self.browse(update_vals).read(['End_Date']):
product_ids = product_pool.search([('id_pk_status', '=', bs_obj['id']),
('is_active', '=', 'Y')])
if product_ids:
end_date = bs_obj['End_Date'] or date.today()
force_update = self._context.get('force_update', False)
product_ids.with_context(audit_log=True,force_update=force_update).write(
{'is_active': 'N', 'end_date': end_date})
Product_ids record set has a write(self, val) function for 'pk.product' model.
As part of the write() and its conditions will execute x2()
def x2(self, vals, condition=None):
try:
status_pool = self.env.get('pk.status')
product_pool = self.env.get('pk.product')
result = False
status_obj = status_pool.browse(vals['id_pk_status']).read()[0]
product_obj = product_pool.browse(vals['id_pk_product']).read()[0]
if not product_obj['end_date']:
product_obj['end_date'] = date.today()
extra_check = True
if condition:
statuses = (status_obj['Is_Active'], product_obj['is_active'])
extra_check = statuses in condition
if extra_check:
result = True
if isinstance(vals['start_date'], str):
vals['start_date'] = datetime.strptime(vals['start_date'], '%Y-%m-%d').date()
if not (result and vals['start_date'] >= status_obj['Start_Date']):
result = False
except Exception as e:
traceback.print_exc()
return result
The error occurs while executing the line
status_obj = status_pool.browse(vals['id_pk_status]).read()[0]
Complete Error:
2020-08-09 15:39:11,303 4224 ERROR ek_openerp_dev odoo.sql_db: bad query: UPDATE "pk_status" SET "Is_Active"='N',"write_uid"=1,"write_date"=(now() at time zone 'UTC') WHERE id IN (283150)
ERROR: could not serialize access due to concurrent update
Traceback (most recent call last):
File "/current/addons/models/pk_product.py", line 141, in x2()
status_obj = status_pool.browse(vals['id_pk_status']).read()[0]
File "/current/core/addons/nest_migration_utils/helpers/old_cr.py", line 51, in old_cursor
result = method(*args, **kwargs)
File "/current/odoo/odoo/models.py", line 2893, in read
self._read(stored_fields)
File "/current/odoo/odoo/models.py", line 2953, in _read
self.flush(fields, self)
File "/current/odoo/odoo/models.py", line 5419, in flush
process(self.env[model_name], id_vals)
File "/current/odoo/odoo/models.py", line 5374, in process
recs._write(vals)
File "/current/odoo/odoo/models.py", line 3619, in _write
cr.execute(query, params + [sub_ids])
File "/current/odoo/odoo/sql_db.py", line 163, in wrapper
return f(self, *args, **kwargs)
File "/current/odoo/odoo/sql_db.py", line 240, in execute
res = self._obj.execute(query, params)
psycopg2.extensions.TransactionRollbackError: could not serialize access due to concurrent update
I assume the concurrence in the error states that im doing two write operations in a single thread but im not sure about it. Hope this helps.
Each subprocess needs to have a global connection to the database. If you are using Pool then you can define a function that creates a global connection and a cursor and pass it to the initializer parameter. If you are instead using a Process object then I'd recommend you create a single connection and pass the data via queues or pipes.
Like Klaver said, it would be better if you were to provide code so as to get a more accurate answer.
This happens if the transaction level is set to serializable and two processes are trying to update the same column values.
If your choice is to go with serializable isolation level, then you have to rollback and retry the transaction again.
I am using python to read data from a .xlsm excel file. I have two files that are nearly identical and are saved in the same directory. When I give the python program one excel sheet, it correctly reads the data and solves the problem. However, with the other excel sheet I get the following error.
(I blocked out my name with ####)
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
solve("updated_excel.xlsm")
File "C:\Documents and Settings\#####\My Documents\GlockNew.py", line 111, in solve
prob.solve()
File "C:\Python27\lib\site-packages\pulp-1.5.4-py2.7.egg\pulp\pulp.py", line 1614, in solve
status = solver.actualSolve(self, **kwargs)
File "C:\Python27\lib\site-packages\pulp-1.5.4-py2.7.egg\pulp\solvers.py", line 1276, in actualSolve
return self.solve_CBC(lp, **kwargs)
File "C:\Python27\lib\site-packages\pulp-1.5.4-py2.7.egg\pulp\solvers.py", line 1343, in solve_CBC
raise PulpSolverError, "Pulp: Error while executing "+self.path
PulpSolverError: Pulp: Error while executing C:\Python27\lib\site-packages\pulp-1.5.4-py2.7.egg\pulp\solverdir\cbc.exe
I don't know what ""Pulp: Error while executing "+self.path" means, but both files are stored in the same directory, and the problem only appears once I try to solve the problem. Does anyone have an idea as to what can possible trigger such an error?
EDIT
After further debugging, I have found that the error lies in the solve_CBC method in the COIN_CMD class. The error occurs here:
if not os.path.exists(tmpSol):
raise PulpSolverError, "Pulp: Error while executing "+self.path
When I run the solver for both excel sheets, they have the same value for tmpSol: 4528-pulp.sol
However, when I run it for one excel sheet os.path.exists(tmpSol) returns true, and for the other it returns false. How can that be- tmpSol has the same value both times?
The name is created using the process id, if you have some sort of batch job that launches both solver applications from one process then they will have the same name.
I experienced the same issue when launching multiple instances of the LPSolver class. The issue is caused by the following lines of code within the solvers.py file of pulp:
pid = os.getpid()
tmpLp = os.path.join(self.tmpDir, "%d-pulp.lp" % pid)
tmpMps = os.path.join(self.tmpDir, "%d-pulp.mps" % pid)
tmpSol = os.path.join(self.tmpDir, "%d-pulp.sol" % pid)
which appears in every solver. The problem is that these paths are deleted later on, but may coincide for different instances of the LPSolver class (as the variable pid is not unique).
The solution is to get a unique path for each instance of LPSolver, using, for example, the current time. Replacing the above lines by the following four will do the trick.
currentTime = time()
tmpLp = os.path.join(self.tmpDir, "%f3-pulp.lp" % currentTime)
tmpMps = os.path.join(self.tmpDir, "%f3-pulp.mps" % currentTime)
tmpSol = os.path.join(self.tmpDir, "%f3-pulp.sol" % currentTime)
Don't forget to
from time import time
Cheers,
Tim
Here is the query in Pymongo
import mong #just my library for initializing
collection_1 = mong.init(collect="col_1")
collection_2 = mong.init(collect="col_2")
for name in collection_2.find({"field1":{"$exists":0}}):
try:
to_query = name['something']
actual_id = collection_1.find_one({"something":to_query})['_id']
crap_id = name['_id']
collection_2.update({"_id":id},{"$set":{"new_name":actual_id}},upset=True)
except:
open('couldn_find_id.txt','a').write(name)
All this is doing is taking a field from one collection, finding the id of that field and updating the id of another collection. It works for about 1000-5000 iterations, but periodically fails with this and then I have to restart the script.
> Traceback (most recent call last):
File "my_query.py", line 6, in <module>
for name in collection_2.find({"field1":{"$exists":0}}):
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 814, in next
if len(self.__data) or self._refresh():
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 776, in _refresh
limit, self.__id))
File "/home/user/python_mods/pymongo/pymongo/cursor.py", line 720, in __send_message
self.__uuid_subtype)
File "/home/user/python_mods/pymongo/pymongo/helpers.py", line 98, in _unpack_response
cursor_id)
pymongo.errors.OperationFailure: cursor id '7578200897189065658' not valid at server
^C
bye
Does anyone have any idea what this failure is, and how I can turn it into an exception to continue my script even at this failure?
Thanks
The reason of the problem is described in pymongo's FAQ:
Cursors in MongoDB can timeout on the server if they’ve been open for
a long time without any operations being performed on them. This can
lead to an OperationFailure exception being raised when attempting to
iterate the cursor.
This is because of the timeout argument of collection.find():
timeout (optional): if True (the default), any returned cursor is
closed by the server after 10 minutes of inactivity. If set to False,
the returned cursor will never time out on the server. Care should be
taken to ensure that cursors with timeout turned off are properly
closed.
Passing timeout=False to the find should fix the problem:
for name in collection_2.find({"field1":{"$exists":0}}, timeout=False):
But, be sure you are closing the cursor properly.
Also see:
mongodb cursor id not valid error