Big Data with Blaza and Pandas - python

I want to know if this approach would be an overkill for a project.
I have a 4gb file that obviously my computer cant handle. Would using Blaze to split the file into more manageable file sizes and open with pandas and visualize with Bokeh be an overkill?
I know Pandas has a "chunk" function but the reason i want to split them is because there are specific rows related to specific names that i need to analyze.
is there a different approach you would take that wont crash my laptop and doesnt require setting up Hadoop or any AWS service?

Pandas chunking with pd.read_csv(..., chunksize=...) works well.
Alternatively dask.dataframe mimics the Pandas interface and handles the chunking for you.

Related

Is there a command to import an SPSS data frame (.sav) using dask?

I am trying to import multiple dataframes using Dask; however, it seems that unlike pandas, Dask doesn't have any commands to do so. Or at least I haven't been able to find a way to do it within its documentation/examples.
I know that I could transform the databases to csv or chunk it, but I'd like to rule out this alternative due to the complexity of the database.
Is there any command or plugin that allows me to parallelize the import with dask?
Dask dataframe doesn't support pandas.read_spss. There was a user with a similar problem here (reading sas format with dask): https://github.com/dask/dask/issues/1233

Best approach to extract 1000 Gbs of Data from Teradata

What should be the best approach to download data from Teradata using python ? I have more than 3000 GBs of data to be migrated from teradata to another code service. Any approach or advise would be great.
So far I have explored -
Using Pandas it is slow and inefficient. I have also used {fn teradata_try_fastexport} but it seems to be no difference.
Using tdload I am not able to partition the file and no support for downloading file with byte,clob,blob data type.
Using dask, not able to connect with teradata and get the data using dask dataframe. I have read the sql using pandas and then use dask to write the data. However, that seems to be of little help in comparison to huge amount of data.
Pyspark - I am exploring but it seems difficult with too much file configuration and dependency issue.
Any advise or suggested approach that I should follow for the efficient and effective data extraction.

How can I optimize file I/O in Python when I process GB-sized files via NFS?

I'm manipulating several files via nfs, due to security concerns. The situation is very painful to process something due to slow file I/O. Followings are descriptions of the issue.
I use pandas in Python to do simple processing on data. So I use read_csv() and to_csv() frequently.
Currently, writing of a 10GB csv file requires nearly 30 mins whereas reading consumes 2 mins.
I have enough CPU cores (> 20 cores) and memory (50G~100G).
It is hard to ask more bandwidth.
I need to access data in column-oriented manner, frequently. For example, there would be 100M records with 20 columns (most of them are numeric data). For the data, I frequently read all of 100M records only for 3~4 columns' value.
I've tried with HDF5, but it constructs a larger file and consumes similar time to write. And it does not provide column-oriented I/O. So I've discarded this option.
I cannot store them locally. It would violate many security criteria. Actually I'm working on virtual machine and file system is mounted via nfs.
I repeatedly read several columns. For several columns, no. The task is something like data analysis.
Which approaches can I consider?
In several cases, I use sqlite3 to manipulate data in simple way and exports results into csv files. Can I accelerate I/O tasks by using sqlite3 in Python? If it provide column-wise operation, it would be a good solution, I reckon.
two options: pandas hdf5 or dask.
you can review hdf5 format with format='table'.
HDFStore supports another PyTables format on disk, the table format.
Conceptually a table is shaped very much like a DataFrame, with rows
and columns. A table may be appended to in the same or other sessions.
In addition, delete and query type operations are supported. This
format is specified by format='table' or format='t' to append or put
or to_hdf.
you can use dask read_csv. it read data only when execute()
For purely improve IO performance, i think hdf with compress format is best.

pandas.read_csv vs other csv libraries for loading CSV into a Postgres Database

I am a relatively new user of Python. What is the best way of parsing and processing a CSV and loading it into a local Postgres Database (in Python)?
It was recommended to me to use the CSV library to parse and process the CSV. In particular, the task at hand says:
The data might have errors (some rows may be not be parseable), the
data might be duplicated, the data might be really large.
Is there a reason why I wouldn't be able to just use pandas.read_csv here? Does using the CSV library make parsing and loading it into a local Postgres database easier? In particular, if I just use pandas will I run into problems if rows are unparseable, if the data is really big, or if data is duplicated? (For the last bit, I know that pandas offers some relatively clean solutions for de-dupping).
I feel like pandas.read_csv and pandas.to_sql can do a lot of work for me here, but I'm not sure if using the CSV library offers other advantages.
Just in terms of speed, this post: https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file seems to suggest that pandas.read_csv performs the best?
A quick googling didn't reveal any serious drawbacks in pandas.read_csv regarding its functionality (parsing correctness, supported types etc.). Moreover, since you appear to be using pandas to load the data into the DB, too, reading directly into a DataFrame is a huge boost in both performance and memory (no redundant copies).
There are only memory issues for very large datasets - but these are not library's fault. How to read a 6 GB csv file with pandas has instructions on how to process a large .csv in chunks with pandas.
Regarding "The data might have errors", read_csv has a few facilities like converters, error_bad_lines and skip_blank_lines (specific course of action depends on if and how much corruption you're supposed to be able to recover).
I had a school project just last week that required me to load data from a csv and insert it into a postgres database. So believe me when I tell you this: it's way harder than it has to be unless you use pandas. The issue is sniffing out the data types. Okay, so if your database is all a string datatype, forget what I said, you're golden. But if you have a csv with an assortment of datatypes, either you get to sniff them yourself or you can use pandas which does it efficiently and automatically. Plus pandas has a nifty write to sql method which can be easily adapted to work with postgres via a sql alchemy connection, too.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

sqlite3 or CSV files

First of all, I'm a total noob at this. I've been working on setting up a small GUI app to play with database, mostly microsoft Excel files with large numbers of rows. I want to be able to display a portion of it, I wanna be able to choose the columns I'm working with trought the menu so I can perform different task very efficiently.
I've been looking into the .CSV files. I can create some sort of list or dictionnarie with it (Not sure) or I could just import the excel table into a database then do w/e I need to with my GUI. Now my question is, for this type of task i just described, wich of the 2 methods would be best suited ? (Feel free to tell me if there is a better one)
It will depend upon the requirements of you application and how you plan to extend or maintain it in the future.
A few points in favour of sqlite:
standardized interface, SQL - with CSV you would create some custom logic to select columns or filter rows
performance on bigger data sets - it might be difficult to load 10M rows of CSV into memory, whereas handling 10M rows in sqlite won't be a problem
sqlite3 is in the python standard library (but then, CSV is too)
That said, also take a look at pandas, which makes working with tabular data that fits in memory a breeze. Plus pandas will happily import data directly from Excel and other sources: http://pandas.pydata.org/pandas-docs/stable/io.html

Categories