Is pandas more efficient than the csv module for ETL - python

I have written some python scripts that loads csv files with hundreds of thousands of rows into a database. It is working great but I was wondering if it is more memory efficient to use the csv module to extract the csv's as a list of lists than creating a pandas dataframe?

Pandas DataFrame is definitely more memory efficient than regular Python lists.
You should use Pandas.
Take look at slides from talk by Jeffrey Tratner Pandas Under The Hood
I'm just comparing a few key points between using pandas and lists approach:
DataFrames have flexible interface. If you chose bare bones Pythons list approach you will need to create necessary functions by yourself.
Many number crunching routines in pandas are implemented in C or by using specialized numerical libraries (Numpy) that will be always faster than code you will write in your lists
Deciding to use lists will also mean that with large data lists memory layout will be downgrading performance as opposed to for Dataframe where data are split into blocks of the same types
Pandas Dataframe has indexes which helps you easily lookup/combine/split data based on conditions you choose. Indexes are implemented in C and specialized for each data type.
Pandas can easily read/write data to different formats
There are much more advantages that I probably don't even know about. The key point is: Don't reinvent the wheel, use right tools if you have them

Related

Is pyarrow.Table good for long-term storage of pandas DataFrames?

Trying to come up with a solution for quick Pandas dataframes serialization and long-storage. Dataframe content is tabular, but provided by user, can be arbitrary, so might both completely text columns and completely numeric/boolean columns.
Main goals are:
Serialize dataframe as quickly as possible in order to dump it on disk.
Use format, that i'll be able to load from disk later back into dataframe.
Well, the least memory footprint of serialization and compact output file.
Have ran benchmarks comparing different serialization methods, including:
Parquet: df.to_parquet()
Feather: df.to_feather()
JSON: df.to_json()
CSV: df.to_csv()
PyArrow: pyarrow.default_serialization_context().serialize(df)
PyArrow.Table: pyarrow.default_serialization_context().serialize(pyarrow.Table.from_pandas(df))
Speed of serialization and memory footprint during that are probably biggest factors (read: get rid of data, dump it to disk asap).
Strangely in our benchmarks serializing pyarrow.Table seems the most balanced and quite fast.
Questions:
Is there something I'm missing in understanding difference between serializing dataframe directly using PyArrow and serializing pyarrow.Table, Table shines in case dataframes mostly consists of strings, which is frequent in our cases.
Is pyarrow.Table a valid option for long-storage of dataframes? It seems to "just works", but mostly people just stick to Parquet or something else.
Parquet/Feather are as good as pyarrow.Table in terms of memory / storage size, but quite slower on half-text dataframes, (2-3x slower). Could I be doing something wrong?
In case of mixed-type dataframes JSON still seems like an option according to our benchmarks.
I can provide the numbers if needed.

Is it a good practice to preallocate an empty dataframe with types?

I'm trying to load around 3GB of data into a Pandas dataframe, and I figured that I would save some memory by first declaring an empty dataframe, while enforcing that its float coulms would be 32bit instead of the default 64bit. However, the Pandas dataframe constructor does not allow specifying the types fo multiple columns on an empty dataframe.
I found a bunch of workarounds in the replies to this question, but they made me realize that Pandas is not designed in this way.
This made me wonder whether it was a good strategy at all to declare the empty dataframe first, instead of reading the file and then downcasting the float columns (which seems inefficient memory-wise and processing-wise).
What would be the best strategy to design my program?

What's the purpose of Series instead of lists in Pandas and Python?

Why doesn't Pandas build DataFrames directly from lists? Why was such a thing as a series created in the first place?
Or: If the data in a DataFrame is actually stored in memory as a collection of Series, why not just use a collection of lists?
Yet another way to ask the same question: what's the purpose of Series over lists?
This isn't going to be a very complete answer, but hopefully is an intuitive "general" answer.
Pandas doesn't use a list as the "core" unit that makes up a DataFrame because Series objects make assumptions that lists do not. A list in python makes very little assumptions about what is inside, it could be pretty much anything, which makes it great as a core component of python.
However, if you want to build a more specialized package that gives you extra functionality liked Pandas, then you want to create your own "core" data object and start building extra functionality on top of that. Compared with lists, you can do a lot more with a custom Series object (as witnessed by pulling a single column from a DataFrame and seeing what methods are available to the output).

Python - What are the major improvement of Pandas over Numpy/Scipy

I have been using numpy/scipy for data analysis. I recently started to learn Pandas.
I have gone through a few tutorials and I am trying to understand what are the major improvement of Pandas over Numpy/Scipy.
It seems to me that the key idea of Pandas is to wrap up different numpy arrays in a Data Frame, with some utility functions around it.
Is there something revolutionary about Pandas that I just stupidly missed?
Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful for people with Table-like data or time-series. But, while not revolutionary, Pandas does have key benefits.
For a while I had also perceived Pandas as just utilities on top of NumPy for those who liked the DataFrame interface. However, I now see Pandas as providing these key features (this is not comprehensive):
Array of Structures (independent-storage of disparate types instead of the contiguous storage of structured arrays in NumPy) --- this will allow faster processing in many cases.
Simpler interfaces to common operations (file-loading, plotting, selection, and joining / aligning data) make it easy to do a lot of work in little code.
Index arrays which mean that operations are always aligned instead of having to keep track of alignment yourself.
Split-Apply-Combine is a powerful way of thinking about and implementing data-processing
However, there are downsides to Pandas:
Pandas is basically a user-interface library and not particularly suited for writing library code. The "automatic" features can lull you into repeatedly using them even when you don't need to and slowing down code that gets called over and over again.
Pandas typically takes up more memory as it is generous with the creation of object arrays to solve otherwise sticky problems of things like string handling.
If your use-case is outside the realm of what Pandas was designed to do, it gets clunky quickly. But, within the realms of what it was designed to do, Pandas is powerful and easy to use for quick data analysis.
I feel like characterising Pandas as "improving on" Numpy/SciPy misses much of the point. Numpy/Scipy are quite focussed on efficient numeric calculation and solving numeric problems of the sort that scientists and engineers often solve. If your problem starts out with formulae and involves numerical solution from there, you're probably good with those two.
Pandas is much more aligned with problems that start with data stored in files or databases and which contain strings as well as numbers. Consider the problem of reading data from a database query. In Pandas, you can read_sql_query directly and have a usable version of the data in one line. There is no equivalent functionality in Numpy/SciPy.
For data featuring strings or discrete rather than continuous data, there is no equivalent to the groupby capability, or the database-like joining of tables on matching values.
For time series, there is the massive benefit of handling time series data using a datetime index, which allows you to resample smoothly to different intervals, fill in values and plot your series incredibly easily.
Since many of my problems start their lives in spreadsheets, I am also very grateful for the relatively transparent handling of Excel files in both .xls and .xlsx formats with a uniform interface.
There is also a greater ecosystem, with packages like seaborn enabling more fluent statistical analysis and model fitting than is possible with the base numpy/scipy stuff.
A main point is that it introduces new data structures like dataframes, panels etc. and has good interfaces to other structure and libs. So in generally its more an great extension to the python ecosystem than an improvement over other libs. For me its a great tool among others like numpy, bcolz. Often i use it to reshape my data, get an overview before starting to do data mining etc.

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.
Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

Categories