Choice of technology for loading large CSV files to Oracle tables - python

I have come across a problem and am not sure which would be the best suitable technology to implement it. Would be obliged if you guys can suggest me some based on your experience.
I want to load data from 10-15 CSV files each of them being fairly large 5-10 GBs. By load data I mean convert the CSV file to XML and then populate around 6-7 stagings tables in Oracle using this XML.
The data needs to be populated such that the elements of the XML and eventually the rows of the table come from multiple CSV files. So for e.g. an element A would have sub-elements coming data from CSV file 1, file 2 and file 3 etc.
I have a framework built on Top of Apache Camel, Jboss on Linux. Oracle 10G is the database server.
Options I am considering,
Smooks - However the problem is that Smooks serializes one CSV at a time and I cant afford to hold on to the half baked java beans til the other CSV files are read since I run the risk of running out of memory given the sheer number of beans I would need to create and hold on to before they are fully populated written to disk as XML.
SQLLoader - I could skip the XML creation all together and load the CSV directly to the staging tables using SQLLoader. But I am not sure if I can a. load multiple CSV files in SQL Loader to the same tables updating the records after the first file. b. Apply some translation rules while loading the staging tables.
Python script to convert the CSV to XML.
SQLLoader to load a different set of staging tables corresponding to the CSV data and then writing stored procedure to load the actual staging tables from this new set of staging tables (a path which I want to avoid given the amount of changes to my existing framework it would need).
Thanks in advance. If someone can point me in the right direction or give me some insights from his/her personal experience it will help me make an informed decision.
regards,
-v-
PS: The CSV files are fairly simple with around 40 columns each. The depth of objects or relationship between the files would be around 2 to 3.

Unless you can use some full-blown ETL tool (e.g. Informatica PowerCenter, Pentaho Data Integration), I suggest the 4th solution - it is straightforward and the performance should be good, since Oracle will handle the most complicated part of the task.

In Informatica PowerCenter you can import/export XML's +5GB.. as Marek response, try it because is work pretty fast.. here is a brief introduction if you are unfamiliar with this tool.

Create a process / script that will call a procedure to load csv files to external Oracle table and another script to load it to the destination table.
You can also add cron jobs to call these scripts that will keep track of incoming csv files into the directory, process it and move the csv file to an output/processed folder.
Exceptions also can be handled accordingly by logging it or sending out an email. Good Luck.

Related

Processing data from a large data grab

I've downloaded a large (>75GB) data grab from archive.org containing most or all of the tweets from June 2020. The archive itself consists of 31 .tar files, each containing nested folders with the lowest level containing several compressed .json files. I need a way to access the data stored in this archive from my Python application. I would like to use MongoDB since its document-based database structure seems well suited to the type of data in this archive. What would be the best way of doing so?
Here is what the archive looks like (you can find it here):
Any help would be appreciated.
Edit - to be clear, I am not set on using MongoDB. I am open to other database solutions as well.
MongoDB is certainly not a good idea. Because you need to load the database in the RAM. Excepted if you have a cluster or so, you surely not have enough RAM to host this content.
So you may want to filter it if you still want to use mongoDB to this extent.

Need code for importing .csv file via python or ruby code to Cassandra 3.11.3 DB (Production use)

We have 7 node Cassandra 3.11.3 production cluster, we get ticket details dump to a mid server, I need to read from this .csv file and import .csv data to cassandra table. I tried ruby code which was easy for me to write but it does not take care of all the column values (As this .csv will have special characters, enters/different lines, UTF issues, too much of text description as it is in ticketing tool) as data keep changing in each and every row in .csv.
I Want to know if ruby or python is good to perform this activity in production or does anyone have good sample code for mitigating issues mentioned above and performing this kind of activity in production environment?
Both Ruby and Python are perfect for this kind of task, but if your source file is in bad format then any potential tool could fail - there is no magic button tool that could deduce the context from the (broken) data file and fix all the problems for you automatically.
I'd suggest splitting the task into two parts: 1) fix the encoding and data quality problem(s) (and perform any data transformations if necessary) and then 2) import clean data.
Task 2 could be easily done with almost any programming language (that has appropriate cassandra driver available) but if you have a well-formatted csv source you probably don't need any hacking at all (depending on the use case, of course) - Cassandra supports copy ... from command that allows importing data from csv directly (https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlshCopy.html).

How can I reduce the access time on large Excel files?

I would like to process a large data set of a mechanical testing device with Python. The software of this device only allows to export the data as an Excel file. Therefore, I use the xlrd package which works fine for small *.xlsx files.
The problem I have is, that when I want to open a common data set (3-5 MB) by
xlrd.open_workbook(path_wb)
the access time is about 30s to 60s. Is there any more effecitve and faster way to access Excel files?
You could access the file as a database via PyPyODBC instead, which may (or may not) be faster - you'd have to try it out and compare the results.
This method should work for both .xls and .xlsx files. Unfortunately, it comes with a couple of caveats:
As far as I am aware, this will only work on Windows machines, since you're relying on the Microsoft Jet database driver.
The Microsoft Jet database driver can be rather buggy, especially with dates.
It's not possible to create or modify Excel files (a note in the PyPyODBC exceltests.py file says: I have not been able to successfully create or modify Excel files.). Your question seems to indicate that you're only interested in reading files, though, so hopefully this will not be a problem.
I just figured out that it wasn't actually the problem with the access time but I created an object in the same step. Now, by creating the object separately everything works fast and nice.

Somthing wrong with using CSV as database for a webapp?

I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.
Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.
Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.
I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.
I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..

sqlite3 or CSV files

First of all, I'm a total noob at this. I've been working on setting up a small GUI app to play with database, mostly microsoft Excel files with large numbers of rows. I want to be able to display a portion of it, I wanna be able to choose the columns I'm working with trought the menu so I can perform different task very efficiently.
I've been looking into the .CSV files. I can create some sort of list or dictionnarie with it (Not sure) or I could just import the excel table into a database then do w/e I need to with my GUI. Now my question is, for this type of task i just described, wich of the 2 methods would be best suited ? (Feel free to tell me if there is a better one)
It will depend upon the requirements of you application and how you plan to extend or maintain it in the future.
A few points in favour of sqlite:
standardized interface, SQL - with CSV you would create some custom logic to select columns or filter rows
performance on bigger data sets - it might be difficult to load 10M rows of CSV into memory, whereas handling 10M rows in sqlite won't be a problem
sqlite3 is in the python standard library (but then, CSV is too)
That said, also take a look at pandas, which makes working with tabular data that fits in memory a breeze. Plus pandas will happily import data directly from Excel and other sources: http://pandas.pydata.org/pandas-docs/stable/io.html

Categories