Remove duplicates from paired rows before counting values

Remove duplicates from paired rows before counting values - python

I have a Python program I am trying to convert from CSV to SQLite, I have managed to do everything apart from remove duplicates for counting entries. My database is JOINed. I'm reading the database like this:
df = pd.read_sql_query("SELECT d.id AS is, mac.add AS mac etc etc
I have tried df.drop_duplicates('tablename1','tablename2')
and
df.drop_duplicates('row[1],row[3]')
but it doesn't seem to work.
The below code is what I used with the CSV version & I would like to replicate for the Python SQLite script.
for row in reader:
key = (row[1], row[2])
if key not in entries:
writer.writerow(row)
entries.add(key)
del writer

have you tried running SELECT DISTINCT col1,col2 FROM table first?
In your case it might be as simple as placing the DISTINCT keyword prior to your column names.

You need to use the subset parameter
df.drop_duplicates(subset=['tablename1','tablename2'])

Thank you piRSquared, The missing subset is all i needed, thank you.
You need to use the subset parameter
df.drop_duplicates(subset=['tablename1','tablename2'])
Will also look into SELECT DISTINCT but for now, subset works.

Related

Snowflake- How to ignore the row number (first column) in the result set

Whenever i run any select query in snowflake the result set is having auto generated row number column (as a first column).. how to ignore this column from the code...
Like : select * from emp ignore row;

If you're referring to the unnamed column just before TABLE_CATALOG in the below picture.
I'm pretty sure that's not something we can not change -> maybe if you wrote some custom JS to fiddle with the page you might be able to hide it by perhaps changing the TEXT color to white or something. But that seems like a lot of work.
If you extract the data to a CSV (or any file format) this number does not appear in the payload.

When you query Snowflake, regardless of which client you use, that column won't be returned.
It is a pure UI thing in the Snowflake Editor for readability.

If this is happening in python. then below could help. Also refer https://docs.snowflake.com/en/user-guide/python-connector-example.html#using-cursor-to-fetch-values
result = cur.execute("select * from table")
#assign to a variable
result_var = result.fetchall()
#write into a file
result.fetch_pandas_all().to_csv(multiple_table_path, index=False, header=None, sep='|')

How to work with the python sqlite output data?

I need a help with Python3.7 sqlite3 database. I have created it, inserted some data in it but I am not able to work with these data then.
E. g.
test = db.execute("SELECT MESSAGE from TEST")
for row in test:
print(row[0])
This is the only thing I’ve found. But what if I want to work with the data? What if I now want to make something like:
if (row[0] == 1):
...
I could not do it that way. It does not work. Can you help me? Thank you.

The database queries returned as array of the rows. And each row is array of received column values.
More correctly test is tuple or tuples, but let's keep it simple.
In your example could be many rows, each with a single column of data.
To access first row:
test[0]
To access data in the second row:
test[1][0]
Your example:
if (test[0][0] == 1):
...
I hope that helped.

Deleting a row in Accumulo using Pyaccumulo

I'm trying to delete a row using the RowDeletingIterator. I'm running Accumulo 1.5.0. Here's what I have
writer = conn.create_batch_writer("my_table")
mut = Mutation("1234")
mut.put(cf="", cq="", cv="", is_delete=True)
writer.add_mutation(mut)
writer.close()
for r in conn.scan("my_table", scanrange=Range(srow="1234", erow="1234"), iterators=[RowDeletingIterator()]):
print(r)
conn.close()
I'm printing the record to verify that the scanner is scanning the appropriate records. Sadly they don't appear to be getting deleted. I'd appreciate any insight as Pyaccumulo's docs aren't the best.
I'm aware that there's a bug (ACCUMULO-1800) that requires one to use timestamps when deleting over Thrift, but when I specify a ts field, I just see a blank record in addition to the existing ones.

I'm not very familiar with Pyaccumulo, but I can tell you that the RowDeletingIterator doesn't use normal deletion entries. It just uses a key with empty column family, qualifier and visibility and a value of DEL_ROW to indicate that an entire row should be deleted. I would try not setting is_delete=True and giving the entry a value of DEL_ROW and see if that does what you want.

There should be a delete_rows method in pyaccumulo.
def delete_rows(self, table, srow, erow):
self.client.deleteRows(self.login, table, srow, erow)
You can use it like this:
conn.delete_rows(table, srow, erow)

What's the most efficient way to get this information from the database?

Related to this question:
Wildcards in column name for MySQL
Basically, there are going to be a variable number of columns with the name "word" in them.
What I want to know is, would it be faster to do a separate database call for each row (via getting the column information from the information schema), with a generated Python query per row, or would it be faster to simply SELECT *, and only use the columns I needed? Is it possible to say SELECT * NOT XYZ? As far as I can tell, no, there is no way to specifically exclude columns.
There aren't going to be many different rows at first - only three. But there's the potential for infinite rows in this. It's basically dependent on how many different types of search queries we want to put on our website. Our whole scalability is based around expanding the number of rows.

If all you are doing is limiting the number of columns returned there is no need to do a dynamic query. The hard work for the database is in selecting the rows matching your WHERE clause; it makes little difference to send you 5 columns out of 10, or all 10.
Just use a "SELECT * FROM ..." and use Python to pick out the columns from the result set. You'll use just one query to the database, so MySQL only has to work once, then filter out your columns:
cursor.execute('SELECT * FROM ...')
cols = [i for i, col in enumerate(cursor.description) if col[0].startswith('word')]
for row in cursor:
columns = [row[c] for c in cols]
You may have to use for row in cursor.fetchall() instead depending on your MySQL python module.

How to insert several thousand columns into sqlite3?

Similar to my last question, but I ran into problem lets say I have a simple dictionary like below but its Big, when I try inserting a big dictionary using the methods below I get operational error for the c.execute(schema) for too many columns so what should be my alternate method to populate an sql databases columns? Using the alter table command and add each one individually?
import sqlite3
con = sqlite3.connect('simple.db')
c = con.cursor()
dic = {
'x1':{'y1':1.0,'y2':0.0},
'x2':{'y1':0.0,'y2':2.0,'joe bla':1.5},
'x3':{'y2':2.0,'y3 45 etc':1.5}
}
# 1. Find the unique column names.
columns = set()
for _, cols in dic.items():
for key, _ in cols.items():
columns.add(key)
# 2. Create the schema.
col_defs = [
# Start with the column for our key name
'"row_name" VARCHAR(2) NOT NULL PRIMARY KEY'
]
for column in columns:
col_defs.append('"%s" REAL NULL' % column)
schema = "CREATE TABLE simple (%s);" % ",".join(col_defs)
c.execute(schema)
# 3. Loop through each row
for row_name, cols in dic.items():
# Compile the data we have for this row.
col_names = cols.keys()
col_values = [str(val) for val in cols.values()]
# Insert it.
sql = 'INSERT INTO simple ("row_name", "%s") VALUES ("%s", "%s");' % (
'","'.join(col_names),
row_name,
'","'.join(col_values)
)

If I understand you right, you're not trying to insert thousands of rows, but thousands of columns. SQLite has a limit on the number of columns per table (by default 2000), though this can be adjusted if you recompile SQLite. Never having done this, I do not know if you then need to tweak the Python interface, but I'd suspect not.
You probably want to rethink your design. Any non-data warehouse / OLAP application is highly unlikely to need or be terribly efficient with thousands of columns (rows, yes) and SQLite is not a good solution for a data warehouse / OLAP type situation. You may get a bit further with something like an entity-attribute-value setup (not a normal recommendation for genuine relational databases, but a valid application data model and much more likely to accommodate your needs without pushing the limits of SQLite too far).

If you really are adding a massive number of rows and are running into problems, maybe your single transaction is getting too large.
Do a COMMIT (commit()) after a given number of lines (or even after each insert as a test) if that is acceptable.
Thousands of rows should be easily doable with sqlite. Getting to millions and above, at some point there might be need for more. Depends on a lot of things, of course.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicates from paired rows before counting values - python

have you tried running SELECT DISTINCT col1,col2 FROM table first? In your case it might be as simple as placing the DISTINCT keyword prior to your column names.

You need to use the subset parameter df.drop_duplicates(subset=['tablename1','tablename2'])

Thank you piRSquared, The missing subset is all i needed, thank you. You need to use the subset parameter df.drop_duplicates(subset=['tablename1','tablename2']) Will also look into SELECT DISTINCT but for now, subset works.

Related

Snowflake- How to ignore the row number (first column) in the result set

How to work with the python sqlite output data?

Deleting a row in Accumulo using Pyaccumulo

What's the most efficient way to get this information from the database?

How to insert several thousand columns into sqlite3?

Categories

Resources