I am generating load test data in a Python script for Cassandra.
Is it better to insert directly into Cassandra from the script, or to write a CSV file and then load that via Cassandra?
This is for a couple million rows.
For a few million, I'd say just use CSV (assuming rows aren't huge); and see if it works. If not, inserts it is :)
For more heavy duty stuff, you might want to create sstables and use sstable loader.
Related
I'm doing analysis on data from a MySql database in python. I query the database for about 200,000 rows of data, then analyze in python using Pandas. I will often do many iterations over the same data, changing different variables, parameters, and such. Each time I run the program, I query the remote database (about 10 second query), then discard the query results when the program finishes. I'd like to save the results of the last query in a local file, then check each time I run the program to see if the query is the same, then just use the saved results. I guess I could just write the Pandas dataframe to a csv, but is there a better/easier/faster way to do this?
If for any reason MySQL Query Cache doesn't help, then I'd recommend to save the latest result set either in HDF5 format or in Feather format. Both formats are pretty fast. You may find some demos and tests here:
https://stackoverflow.com/a/37929007/5741205
https://stackoverflow.com/a/42750132/5741205
https://stackoverflow.com/a/42022053/5741205
Just use pickle to write the dataframe to a file, and to read it back out ("unpickle").
https://docs.python.org/3/library/pickle.html
This would be the "easy way".
First of all, I'm a total noob at this. I've been working on setting up a small GUI app to play with database, mostly microsoft Excel files with large numbers of rows. I want to be able to display a portion of it, I wanna be able to choose the columns I'm working with trought the menu so I can perform different task very efficiently.
I've been looking into the .CSV files. I can create some sort of list or dictionnarie with it (Not sure) or I could just import the excel table into a database then do w/e I need to with my GUI. Now my question is, for this type of task i just described, wich of the 2 methods would be best suited ? (Feel free to tell me if there is a better one)
It will depend upon the requirements of you application and how you plan to extend or maintain it in the future.
A few points in favour of sqlite:
standardized interface, SQL - with CSV you would create some custom logic to select columns or filter rows
performance on bigger data sets - it might be difficult to load 10M rows of CSV into memory, whereas handling 10M rows in sqlite won't be a problem
sqlite3 is in the python standard library (but then, CSV is too)
That said, also take a look at pandas, which makes working with tabular data that fits in memory a breeze. Plus pandas will happily import data directly from Excel and other sources: http://pandas.pydata.org/pandas-docs/stable/io.html
I have come across a problem and am not sure which would be the best suitable technology to implement it. Would be obliged if you guys can suggest me some based on your experience.
I want to load data from 10-15 CSV files each of them being fairly large 5-10 GBs. By load data I mean convert the CSV file to XML and then populate around 6-7 stagings tables in Oracle using this XML.
The data needs to be populated such that the elements of the XML and eventually the rows of the table come from multiple CSV files. So for e.g. an element A would have sub-elements coming data from CSV file 1, file 2 and file 3 etc.
I have a framework built on Top of Apache Camel, Jboss on Linux. Oracle 10G is the database server.
Options I am considering,
Smooks - However the problem is that Smooks serializes one CSV at a time and I cant afford to hold on to the half baked java beans til the other CSV files are read since I run the risk of running out of memory given the sheer number of beans I would need to create and hold on to before they are fully populated written to disk as XML.
SQLLoader - I could skip the XML creation all together and load the CSV directly to the staging tables using SQLLoader. But I am not sure if I can a. load multiple CSV files in SQL Loader to the same tables updating the records after the first file. b. Apply some translation rules while loading the staging tables.
Python script to convert the CSV to XML.
SQLLoader to load a different set of staging tables corresponding to the CSV data and then writing stored procedure to load the actual staging tables from this new set of staging tables (a path which I want to avoid given the amount of changes to my existing framework it would need).
Thanks in advance. If someone can point me in the right direction or give me some insights from his/her personal experience it will help me make an informed decision.
regards,
-v-
PS: The CSV files are fairly simple with around 40 columns each. The depth of objects or relationship between the files would be around 2 to 3.
Unless you can use some full-blown ETL tool (e.g. Informatica PowerCenter, Pentaho Data Integration), I suggest the 4th solution - it is straightforward and the performance should be good, since Oracle will handle the most complicated part of the task.
In Informatica PowerCenter you can import/export XML's +5GB.. as Marek response, try it because is work pretty fast.. here is a brief introduction if you are unfamiliar with this tool.
Create a process / script that will call a procedure to load csv files to external Oracle table and another script to load it to the destination table.
You can also add cron jobs to call these scripts that will keep track of incoming csv files into the directory, process it and move the csv file to an output/processed folder.
Exceptions also can be handled accordingly by logging it or sending out an email. Good Luck.
I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.
I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.
The db has IIRC well over half a million rows of data. My questions are:
Is the number of records a cause for concern? (i.e. Will I hit some limits)?
Is there a better file format for the transitory data (instead of CSV)?
I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but
I would like to hear from someone who may have done something similar before.
Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.
For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).
Yet another approach if you have Access available ...
Create a table in MySQL to hold the data.
In your Access db, create an ODBC link to the MySQL table.
Then execute a query such as:
INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;
Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.
I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.
The only limit should be operating system file size.
That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.
I'm using Python in order to save the data row by row... but this is extremely slow!
The CSV contains 70million lines, and with my script I can just store 1thousand a second.
This is what my script looks like
reader = csv.reader(open('test_results.csv', 'r'))
for row in reader:
TestResult(type=row[0], name=row[1], result=row[2]).save()
I reckon that for testing I might have to consider MySQL or PostgreSQL.
Any idea or tips? This is the first time I deal with such massive volumes of data. :)
For MySQL imports:
mysqlimport [options] db_name textfile1 [textfile2 ...]
For SQLite3 imports:
ref How to import load a .sql or .csv file into SQLite?
I don't know if this will make a big enough difference, but since you're dealing with the Django ORM I can suggest the following:
Ensure that DEBUG is False in your Django settings file, since otherwise you're storing every single query in memory.
Put your logic in a main function, and wrap that in the django.db.transactions.commit_on_success decorator. That will prevent each row from needing it's own transaction, which will substantially speed up the process.
If you know that all of the rows in the file do not exist in the database, add force_insert=True to your call to the save() method. This will halve the number of calls to sqlite needed.
These suggestions will probably make an even bigger difference if you do find yourself using a client-server DBMS.