how to perform defragmentation on cassandra table

how to perform defragmentation on cassandra table - python

I am playing around with Python and some of NoSql DBs to create file store(mainly because of built in replication), i tried it with MongoDB and its working but due to "Write Greedy" nature of MongoDB i moved to cassandra and implemented the same thing. While its working, i want to know (point me to docs that will be fine) how to defragment the data in cassandra. i will explain this with example, say i upload the 200 MB file, then 20 MB file. now data size in cassandra is ~220MB. If i go and delete the 200MB file then also i see that data size is ~200MB so that space is not gained back. In mongoDB there is a command to gain (re use the same space for new files) i want to know how same can be achieved in cassandra. I am getting confused b/w compress & compaction.
And to store data i am splitting file in part and then storing as "blob" in table.

Cassandra cleans up deleted and expired data using a process called compaction.
While you can force compactions yourself using nodetool compact, I would not recommend this as it is better to tune compaction and let it happen in the background.
That may not completely do the trick as cassandra has a configuration property named 'gc_grace_seconds' which prevents data marked as deleted (with a tombstone) from being deleted until gc_grace_seconds passes. The default is 10 days but you can configure this to a smaller value or even make it 0 to disable tombstones all together.

Related

Need code for importing .csv file via python or ruby code to Cassandra 3.11.3 DB (Production use)

We have 7 node Cassandra 3.11.3 production cluster, we get ticket details dump to a mid server, I need to read from this .csv file and import .csv data to cassandra table. I tried ruby code which was easy for me to write but it does not take care of all the column values (As this .csv will have special characters, enters/different lines, UTF issues, too much of text description as it is in ticketing tool) as data keep changing in each and every row in .csv.
I Want to know if ruby or python is good to perform this activity in production or does anyone have good sample code for mitigating issues mentioned above and performing this kind of activity in production environment?

Both Ruby and Python are perfect for this kind of task, but if your source file is in bad format then any potential tool could fail - there is no magic button tool that could deduce the context from the (broken) data file and fix all the problems for you automatically.
I'd suggest splitting the task into two parts: 1) fix the encoding and data quality problem(s) (and perform any data transformations if necessary) and then 2) import clean data.
Task 2 could be easily done with almost any programming language (that has appropriate cassandra driver available) but if you have a well-formatted csv source you probably don't need any hacking at all (depending on the use case, of course) - Cassandra supports copy ... from command that allows importing data from csv directly (https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html).

How can I create an auto-delete function for CouchDB in Python

I'm new to Python and CouchDB and I'm trying to write an auto-trading script. As such, I only care about the most recent 400 documents in a given database/table (trading data). To fill in a few blanks:
I have one Python program that is reading FOREX trading data and writing summary statistics to a CouchDB database. That thing runs every 20 seconds and works great. I'm just creating some large tables (that I don't need), at this point.
I have another Python program that is going to read the top 400 records from that table. At the tail end of this program, I'd like to do some kind of auto-purge that will delete anything older than the top 400 documents.
I have some flexibility in how I do this, as this is just a pet project to learn some new programming technologies. I'm assuming this can be solved with a collection of using _id = epoch time + views, but I just want something that is bare-bones easy.
Any suggestions?

As the number of docs ramaining is so small compared to the rest that you want to purge, i would try to do a selective replication to a new database with a post to _replicate and adding the 400 relevant ids in the doc_ids array. (http://docs.couchdb.org/en/1.6.1/api/server/common.html#post--_replicate)
Then just swap to the new database and delete the old one. This will be very fast and very error proof, as couchdb just needs to delete the old database and index files instead of finding and deleting thousands of old docs.

Somthing wrong with using CSV as database for a webapp?

I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.

Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.

Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.

I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.

I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..

Speeding up document processing and loading into database

I have a few million documents. What I am trying to do is simple, process the documents to extract the information I need and load it into a database. I am doing it in Python and using SQLAlchemy. Also I am using multiprocessing to make use of all the cores on my machine. The documents are XML with huge chunks of text. The database is MySQL with a custom relation schema defined.
However, it runs very slow and loads only about 50k documents in 6-7 hours.
Is there any way that I can speed this task up?

sometimes RDBMS is not the answer, one sign for such situation is if your data has no relations to one another, for example, if every document stands by itself.
if you'd like to have some unstructured data searchable, consider building a searchable index using pylucene
or maybe put the data in some non-rel database like mongodb
in any case, try to identify what part of your system is slowing down the process, my guess would be the database or the file system, if this is mysql all you can do is throwing more hardware on it.
another way to optimize a system that use IO extensively is to switch to async programming using a library like twisted but it has some learning curve, so better make 100% sure its needed.

is there a limit to the (CSV) filesize that a Python script can read/write?

I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.

Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).

Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.

I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.

The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.