Deleting a row in Accumulo using Pyaccumulo

Deleting a row in Accumulo using Pyaccumulo - python

I'm trying to delete a row using the RowDeletingIterator. I'm running Accumulo 1.5.0. Here's what I have
writer = conn.create_batch_writer("my_table")
mut = Mutation("1234")
mut.put(cf="", cq="", cv="", is_delete=True)
writer.add_mutation(mut)
writer.close()
for r in conn.scan("my_table", scanrange=Range(srow="1234", erow="1234"), iterators=[RowDeletingIterator()]):
print(r)
conn.close()
I'm printing the record to verify that the scanner is scanning the appropriate records. Sadly they don't appear to be getting deleted. I'd appreciate any insight as Pyaccumulo's docs aren't the best.
I'm aware that there's a bug (ACCUMULO-1800) that requires one to use timestamps when deleting over Thrift, but when I specify a ts field, I just see a blank record in addition to the existing ones.

I'm not very familiar with Pyaccumulo, but I can tell you that the RowDeletingIterator doesn't use normal deletion entries. It just uses a key with empty column family, qualifier and visibility and a value of DEL_ROW to indicate that an entire row should be deleted. I would try not setting is_delete=True and giving the entry a value of DEL_ROW and see if that does what you want.

There should be a delete_rows method in pyaccumulo.
def delete_rows(self, table, srow, erow):
self.client.deleteRows(self.login, table, srow, erow)
You can use it like this:
conn.delete_rows(table, srow, erow)

Related

Snowflake- How to ignore the row number (first column) in the result set

Whenever i run any select query in snowflake the result set is having auto generated row number column (as a first column).. how to ignore this column from the code...
Like : select * from emp ignore row;

If you're referring to the unnamed column just before TABLE_CATALOG in the below picture.
I'm pretty sure that's not something we can not change -> maybe if you wrote some custom JS to fiddle with the page you might be able to hide it by perhaps changing the TEXT color to white or something. But that seems like a lot of work.
If you extract the data to a CSV (or any file format) this number does not appear in the payload.

When you query Snowflake, regardless of which client you use, that column won't be returned.
It is a pure UI thing in the Snowflake Editor for readability.

If this is happening in python. then below could help. Also refer https://docs.snowflake.com/en/user-guide/python-connector-example.html#using-cursor-to-fetch-values
result = cur.execute("select * from table")
#assign to a variable
result_var = result.fetchall()
#write into a file
result.fetch_pandas_all().to_csv(multiple_table_path, index=False, header=None, sep='|')

python cursor return only those rows which first column is not empty

In Python 3.8, I have a select query.
dbconn.execute("select name, id, date from test_table")
That query returned always wrong number of rows. After too much debugging, I was able to fix it by only replacing id column place with name column and it started working normally.
The issue was with empty value for name column for some rows.
It means, python cursor returns only those rows which first column is not empty. Do I miss anything in my conclusion?

Check your database integrity, it might be that some of the entries are corrupted thus failing. Cuz I had issue before that the query is failing (wrong number of output) at some point due to integrity issue.
1 thing also is instead of returning the row as list/tuple, try it with dictionary-like with key-value pair.
dbconn.row_factory = sqlite3.Row

Remove duplicates from paired rows before counting values

I have a Python program I am trying to convert from CSV to SQLite, I have managed to do everything apart from remove duplicates for counting entries. My database is JOINed. I'm reading the database like this:
df = pd.read_sql_query("SELECT d.id AS is, mac.add AS mac etc etc
I have tried df.drop_duplicates('tablename1','tablename2')
and
df.drop_duplicates('row[1],row[3]')
but it doesn't seem to work.
The below code is what I used with the CSV version & I would like to replicate for the Python SQLite script.
for row in reader:
key = (row[1], row[2])
if key not in entries:
writer.writerow(row)
entries.add(key)
del writer

have you tried running SELECT DISTINCT col1,col2 FROM table first?
In your case it might be as simple as placing the DISTINCT keyword prior to your column names.

You need to use the subset parameter
df.drop_duplicates(subset=['tablename1','tablename2'])

Thank you piRSquared, The missing subset is all i needed, thank you.
You need to use the subset parameter
df.drop_duplicates(subset=['tablename1','tablename2'])
Will also look into SELECT DISTINCT but for now, subset works.

Why is querying a table so much slower after sorting it?

I have a Python program that uses Pytables and queries a table in this simple manner:
def get_element(table, somevar):
rows = table.where("colname == somevar")
row = next(rows, None)
if row:
return elem_from_row(row)
To reduce the query time, I decided to try to sort the table with table.copy(sortby='colname'). This indeed improved the query time (spent in where), but it increased the time spent in the next() built-in function by several orders of magnitude! What could be the reason?
This slowdown occurs only when there is another column in the table, and the slowdown increases with the element size of that other column.
To help me understand the problem and make sure this was not related to something else in my program, I made this minimum working example reproducing the problem:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tables
import time
import sys
def create_set(sort, withdata):
#Table description with or without data
tabledesc = {
'id': tables.UIntCol()
}
if withdata:
tabledesc['data'] = tables.Float32Col(2000)
#Create table with CSI'ed id
fp = tables.open_file('tmp.h5', mode='w')
table = fp.create_table('/', 'myset', tabledesc)
table.cols.id.create_csindex()
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush()
return fp
def get_element(table, i):
#By construction, i always exists in the table
rows = table.where('id == i')
row = next(rows, None)
if row:
return {'id': row['id']}
return None
sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)
start_time = time.time()
table = fp.root.myset
for i in xrange(500):
get_element(table, i)
print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()
And here is some console output showing the figures:
$ ./timedset.py nosort nodata
Queried the set in 0.718s
$ ./timedset.py sort nodata
Queried the set in 0.003s
$ ./timedset.py nosort withdata
Queried the set in 0.597s
$ ./timedset.py sort withdata
Queried the set in 5.846s
Some notes:
The rows are actually sorted in all cases, so it seems to be linked to the table being aware of the sort rather than just the data being sorted.
If instead of creating the file, I read it from disk, same results.
The issue occurs only when the data column is present, even though I never write to it nor read it. I noticed that the time difference increases "in stages" when the size of the column (the number of floats) increases. The slowdown must be linked with internal data movements or I/O:
If I don't use the next function, but instead use a for row in rows and trust that there is only one result, the slowdown still occurs.
Accessing an element from a table by some sort of id (sorted or not) sounds like a basic feature, I must be missing the typical way of doing it with pytables. What is it?
And why such a terrible slowdown? Is it a bug that I should report?

I finally understood what's going on.
Long story short
The root cause is a bug and it was on my side: I was not flushing the data before making the copy in case of sort. As a result, the copy was based on data that was not complete, and so was the new sorted table. This is what caused the slowdown, and flushing when appropriate led to a less surprising result:
...
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
fp.flush() # <--
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush() # <--
return fp
...
But why?
I realized my mistake when I decided to inspect and compare the structure and data of the tables "not sorted" vs "sorted". I noticed that in the sorted case, the table had less rows. The number varied seemingly randomly from 0 to about 450 depending on the size of the data column. Moreover, in the sorted table, the id of all the rows was set to 0. I guess that when creating a table, pytables initializes the columns and may or may not pre-create some of the rows with some initial value. This "may or may not" probably depends on the size of the row and the computed chunksize.
As a result, when querying the sorted table, all queries but the one with id == 0 had no result. I initially thought that raising and catching the StopIteration error was what caused the slowdown, but that would not explain why the slowdown depends on the size of the data column.
After reading some of the code from pytables (notably table.py and tableextension.pyx), I think what happens is the following: when a column is indexed, pytables will first try to use this index to fasten the search. If some matching rows are found, only these rows will be read. But if the index indicates that no row matches the query, for some reason pytables fallbacks to a "in kernel" search, which iterates over and reads all the rows. This requires reading the full rows from disk in multiple I/Os, and this is why the size of the data column mattered. Also under a certain size of that column, pytables did not "pre-create" some rows on disk, resulting in a sorted table with no row at all. This is why on the graph the search is very fast when the column size is under 525: iterating over 0 row doesn't take much time.
I am not clear on why the iterator fallbacks on an "in kernel" search. If the searched id is clearly out of the index bounds, I don't see any reason to search it anyway... Edit: After a closer look at the code, it turns out this is because of a bug. It is present in the version I am using (3.1.1), but has been fixed in 3.2.0.
The irony
What really makes me cry is that I forgot to flush before copying only in the example of the question. In my actual program, this bug is not present! What I also did not know but found out while investigating the question is that by default pytables do not propagate indexes. This has to be required explicitly with propindexes=True. This is why the search was slower after sorting in my application...
So moral of the story:
Indexing is good: use it
But don't forget to propagate them when sorting a table
Make sure your data is on disk before reading it...

Any easy way to alter the data that comes from a mysql database?

so I'm using mysql to grab data from a database and feeding it into a python function. I import mysqldb, connect to the database and run a query like this:
conn.query('SELECT info FROM bag')
x = conn.store_result()
for row in x.fetch_row(100):
print row
but my problem is that my data comes out like this (1.234234,)(1.12342,)(3.123412,)
when I really want it to come out like this: 1.23424, 1.1341234, 5.1342314 (i.e. without parenthesis). I need it this way to feed it into a python function. Does anyone know how I can grab data from the database in a way that doesn't have parenthesis?

Rows are returned as tuples, even if there is only one column in the query. You can access the first and only item as row[0]
The first time around in the for loop, row does indeed refer to the first row. The second time around, it refers to the second row, and so on.
By the way, you say that you are using mySQLdb, but the methods that you are using are from the underlying _mysql library (low level, scarcely portable) ... why??

You could also simply use this as your for loop:
for (info, ) in x.fetch_row(100):
print info

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.