Can't export Cassandra table using Python

Can't export Cassandra table using Python - python

I am trying to export Cassandra table to CSV format using Python. But I couldn't do it. However, I am able to execute 'select' statement from Python. I have used the following code:
from cassandra.cluster import Cluster
cluster = Cluster ()
session = cluster.connect('chandan') ### 'chandan' is the name of the keyspace
## name of the table is 'emp'
session.execute(""" copy emp (id,name) to 'E:\HANA\emp.csv' with HEADER = true """ )
print "Exported to the CSV file"
Please help me in this regard.

This is not working for you because COPY is not a part of CQL.
COPY is a cqlsh-only tool.
You can invoke this via command line or script by using the -e flag:
cqlsh 127.0.0.1 -u username -p password -e "copy chandan.emp (id,name) to 'E:\HANA\emp.csv' with HEADER = true"
Edit 20170106:
export Cassandra table to CSV format using Python
Essentially... How do I export an entire Cassandra table?
I get asked this a lot. The short answer...is DON'T.
Cassandra is best-used to store millions or even billions of rows. It can do this, because it distributes its load (both operational and size) over multiple nodes. What it's not good at, are things like deletes, in-place updates, and unbound queries. I tell people not to do things like full exports (unbound queries) for a couple reasons.
First of all, running an unbound query on a large table in a distributed environment is usually a very bad idea (introducing LOTS of network time and traffic into your query). Secondly, you're taking a large result set that is stored on multiple nodes, and condensing all of that data into a single file...probably also not a good idea.
Bottom line: Cassandra is not a relational database, so why would you treat it like one?
That being said, there are tools out there designed to handle things like this; Apache Spark being one of them.
Please help me to execute the query with session.execute() statement.
If you insist on using Python, then you'll need to do a few things. For a large table, you'll want to query by token range. You'll also want to do that in small batches/pages, so that you don't tip-over your coordinator node. But to keep you from re-inventing the wheel, I'll tell you that there already is a tool (written in Python) that does exactly this: cqlsh COPY
In fact the newer versions of cqlsh COPY have features (PAGESIZE and PAGETIMEOUT) that allow it to avoid timeouts on large data sets. I have used the new cqlsh to successfully export 370 million rows before, so I know it can be done.
Summary: Don't re-invent the wheel. Write a script that uses cqlsh COPY, and leverages all of those things I just talked about.

Related

How to up SQL and compile data frames to a SQL database?

I am new to SQL, I am working on a research project, we have years worth of data from different sources summing up to hundreds of terabytes of data. I currently have them parsed as python data frames, I need help to literally set up SQL from scratch, I also need help to compile all our data into a SQL database. Please tell me everythign I need to know about SQL as a beginner?

Probably the easiest to get started with one of the free RDMS options, MySQL (https://www.mysql.com/) or PostgreSQL (https://www.postgresql.org/).
Once you've got that installed and configured, and have created the tables you wish to load, you can go with one of two routes to get your data in.
Either you can install the appropriate python libraries to connect to the server you've installed and then INSERT the data in.
Or, if there is a lot of data, look at dumping the data out into a flat file (.csv) and then use the bulk loader to push it into your tables (this is more hassle, but for larger data sets it will be faster).

How to create a SQL table from several SQL files?

All explained above is in the context of an ETL process. I have a git repository full of sql files. I need to put all those sql files (once pulled) into a sql table with 2 columns: name and query, so that I can access each file later on using a SQL query instead of loading them from the file path. How can I make this? I am free to use the tool I want to, but I just know python and Pentaho.
Maybe the assumption that this method would require less computation time than simply accessing to the pull file located in the hard drive is wrong. In that case let me know.

You can first define the table you're interested in using something along the lines of (you did not mention the database you are using):
CREATE TABLE queries (
name TEXT PRIMARY KEY,
query TEXT
);
After creating the table, you can use perhaps os.walk to iterate through the files in your repository, and insert both the contents (e.g. file.read()) and the name of the file into the table you created previously.
It sounds like you're trying to solve a different problem though. It seems like you're interested in speeding up some process, because you asked about whether accessing queries using a table would be faster than opening a file on disk. To investigate that (separate!) question further, see this.
I would recommend that you profile the existing process you are trying to speed up using profiling tools. After that, you can see whether IO is your bottleneck. Otherwise, you may do all of this work without any benefit.
As a side note, if you are looking up queries in this way, it may indicate that you need to rearchitect your application. Please consider that possibility as well.

Database Version Control for MySQL

What method do you use to version-control your database? I've committed all our database tables as separate .sql scripts to our respository (mercurial). In that way, if any member of the team makes a change to the employee table, say, I will immediately know which particular table has been modified when I updated my repository.
Such a method was described in: What are the best practices for database scripts under code control.
Presently, I'm writing a python script to execute all the .sql files within the database folder, however, the issue of dependencies due to foreign-key constraints ensures we can't just run the .sql files in just any order.
The python script is to generate a file with the order in which to execute the .sql files. It will execute the .sql files in the order in which they appear in the tableorder.txt file. A table cannot be executed until its foreign key table has been executed, for example:
tableorder.txt
table3.sql
table1.sql
table7.sql and so on
Already, I have generated the dependency list for each table, from code, by parsing the result of the "show create table" mysql command. The dependency list may look thus:
tblstate: tblcountry //tblcountry.sql must be executed before tblstate.sql etc
tblemployee: tbldepartment, tblcountry
To generate the content of the tableorder.txt, I will need an algorithm that will look thus:
function print_table(table):
foreach table in database:
if table.dependencies.count == 0
print to tableorder.txt
if table.dependencies.count > 0
print_table(dependency) //print dependency first
end function
As you will imagine, this involves lots of recursion. I'm beginning to wonder if it's worth the effort? If there's some tool out there? What tool (or algorithm) is there to generate a list of the order to execute separate .sql tables and views taking into consideration dependencies? Is it better to version control separate .sql file for each table/view or better to version control the entire database to a single .sql file? I will appreciate any response as this has taken so many days. Thanks.

I do not use MySQL, but rather SQL Server, however, this is how I version my database:
(This is long, but in the end I hope the reasoning for me abandoning a simple schema dump as the primary way to handle database versioning is made apparent.)
I make a modification to the schema and apply it to a test database.
I generate delta change scripts and a dump of the schema after said scripts. (I use ApexSQL, but there are likely MySQL-specific tools to help.)
The delta change scripts know how to go from the current to target schema version: ALTER TABLE existing, CREATE TABLE new, DROP VIEW old .. Multiple operations can occur within the same .SQL file as the delta is of importance.
The dump of the schema is of the target schema version: CREATE TABLE a, CREATE VIEW b .. there is no "ALTER" or "DROP" here, because it is just a snapshot of the target schema. There is one .SQL file per database object as the schema is of importance.
I use RoundhousE to apply the delta change scripts. (I do not use the RoundhousE "anytime script" feature as this does not correctly handle relationships.)
I learned the hard way that applying database schema changes cannot be reliably done without a comprehensive step-by-step plan and, similarly (as noted in the question), the order of relationship dependencies are important. Just storing the "current" or "end" schema is not sufficient. There are many changes that cannot be retroactively applied A->C without knowing A->B->C and some changes B might involve migration logic or corrections. SQL schema change scripts can capture these changes and allow them to be "replayed".
However, at the same time just saving the delta scripts does not provide a "simple view" of the target schema. This is why I also dump all the schema as well as the change scripts and version both. The view dump could, in theory, be used to construct the database but due to relationship dependencies (the very kind noted in the question), it may take some work and I do not use it as part of an automated schema-restore approach: yet, keeping the schema dump part of the Hg version-control allows quick identification of changes and viewing the target schema at a particular version.
The change deltas thus move forward through the revisions while the schema dump provides a view at the current revision. Because the change deltas are incremental and forward-only it is important to keep the branch dealing with these changes "clean", which is easy to do with Hg.
In one of my projects I am currently at database change number 70 - and happy and productive! - after switching to this setup. (And these are deployed changes, not just development changes!)
Happy coding.

You can use sqitch. Here is a tutorial for MySql, but it is actually database agnostic.
Changes are implemented as scripts native to your selected database engine... Database changes may declare dependencies on other changes—even on changes from other Sqitch projects. This ensures proper order of execution, even when you’ve committed changes to your VCS out-of-order... Change deployment is managed by maintaining a plan file. As such, there is no need to number your changes, although you can if you want. Sqitch doesn’t much care how you name your changes... Up until you tag and release your application, you can modify your change deployment scripts as often as you like. They’re not locked in just because they’ve been committed to your VCS. This allows you to take an iterative approach to developing your database schema. Or, better, you can do test-driven database development.

I'm not sure how well this answers your question, but I tend to just use mysqldump (part of the standard installation). This gives me the sql to create the tables and populate them, effectively serializing the database. Example:
> mysqldump -u username -p yourdatabase > database_dump.sql
To load a database from a dump sql file:
mysql -u username -p -e "source /path/to/database_dump.sql"
To further answer your question, I would version control each table separately only if there are multiple people working on the database in such a way that conflicts are likely to occur with just a single dump being version controlled. I've never hit a project where this is the case (the database tends to be one of the least volatile portions of the system after the initial phases of the project), so I just version control the database dump as a whole rather than each table individually.

I understand the problem but you cannot think of controlling the versions of the databases using git as if it were static code "" since it does not work, in the same way and it is not very useful to generate different files for each programmer since as you say they collide or Well, they do not have traceability, I started a project similar to how you have it, but it was one more huge problem when trying to have control over the versions and the collisions of the programmers, the solution that arrives is to generate a project where the following order is maintained
Enter web Login / password
Administration of users and profiles of what each user can do
Committee area -> the common and current command is sent to the database
Example: alter table ALTER TABLE users ADD por2 varchar (255);
the commit creates a traceability in the control system itself and the structure is sent to git starting from the initial structure for the control of changes
Change Control Area: it is the visualization of the commit itself plus the structure generated after the change
Server configuration area: the server is configured and a gitlab or github repository is added to it to carry version control in a more visual way without problems for developers
Backup restoration area: Send a backup and keep track of each version "Result of the change of the database structure"
This is the best handling I found without leaving the job to someone specific. I hope it helps you, I believe it in phyton which was the best I found since it uses Django and you save a lot of programming from the administrative part .. Greetings

Use Python to load data into Mysql

is it possible to set up tables for Mysql in Python?
Here's my problem, I have bunch of .txt files which I want to load into Mysql database. Instead of creating tables in phpmyadmin manually, is it possible to do the following things all in Python?
Create table, including data type definition.
Load many files one by one. I only know this LOAD DATA LOCAL INFILE command to load one file.
Many thanks

Yes, it is possible, you'll need to read the data from the CSV files using CSV module.
http://docs.python.org/library/csv.html
And the inject the data using Python MySQL binding. Here is a good starter tutorial:
http://zetcode.com/databases/mysqlpythontutorial/
If you already know python it will be easy

It is. Typically what you want to do is use an Object-Retlational Mapping library.
Probably the most widely used in the python ecosystem is SQLAlchemy, but there is a lot of magic going on in it, so if you want to keep a tighter control on your DB schema, or if you are learning about relational DB's and want to follow along what the code does, you might be better off with something lighter like Canonical's storm.
EDIT: Just thought to add. The reason to use ORM's is that they provide a very handy way to manipulate data / interface to the DB. But if all you will ever want to do is to do a script to convert textual data to MySQL tables, than you might get along with something even easier. Check the tutorial linked from the official MySQL website, for example.
HTH!

is there a limit to the (CSV) filesize that a Python script can read/write?

I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.

Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).

Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.

I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.

The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.