Read tabular data (rows and named columns) in Python - best practices - python

What is the best way to read data from txt/csv file, separate values based on columns to arrays (no matter how many columns there are) and how skip for example first row if file looks like this:
Considering existing libraries in python.
So far, I've done it this way:
pareto_front_file = open("Pareto Front.txt")
data_pareto_front = pareto_front_file.readlines()
for pareto_front_row in data_pareto_front:
x_pareto.append(float(pareto_front_row.split(' ')[0]))
y_pareto.append(float(pareto_front_row.split(' ')[1]))
but creating more complicated things I see that this way is not very effective

Use the "Pandas" library (or something similar)
For tabular data, one of the most popular libraries is Pandas. Not only will this allow you to read the data easily, there are also methods for nearly all types of data transformation, filtering, visualization, etc. you can imagine.
Pandas is one of the most popular python packages, and although it may seem daunting at first, it is usually a lot easier than re-inventing the wheel yourself.
In case you are familiar with the R language, pandas covers a lot of the tidyverse functionalities and the DataFrame is similar to R's data.frame.
Transfer your data into a python object
It offers a read_csv method where you can specify a custom delimiter if you need it. Your data will come as a pandas "DataFrame", a datatype specifically designed for the kind of data you describe. Among many other things, it will recognize column names such as the ones you have in your data automatically.
Example for csv:
df = pd.read_csv('file.csv', delimiter=',') # choose delimiter you need
print(df.head()) # show first 5 rows
print(df.summary()) # get an overview over your data

Related

Is pandas more efficient than the csv module for ETL

I have written some python scripts that loads csv files with hundreds of thousands of rows into a database. It is working great but I was wondering if it is more memory efficient to use the csv module to extract the csv's as a list of lists than creating a pandas dataframe?
Pandas DataFrame is definitely more memory efficient than regular Python lists.
You should use Pandas.
Take look at slides from talk by Jeffrey Tratner Pandas Under The Hood
I'm just comparing a few key points between using pandas and lists approach:
DataFrames have flexible interface. If you chose bare bones Pythons list approach you will need to create necessary functions by yourself.
Many number crunching routines in pandas are implemented in C or by using specialized numerical libraries (Numpy) that will be always faster than code you will write in your lists
Deciding to use lists will also mean that with large data lists memory layout will be downgrading performance as opposed to for Dataframe where data are split into blocks of the same types
Pandas Dataframe has indexes which helps you easily lookup/combine/split data based on conditions you choose. Indexes are implemented in C and specialized for each data type.
Pandas can easily read/write data to different formats
There are much more advantages that I probably don't even know about. The key point is: Don't reinvent the wheel, use right tools if you have them

Python - What are the major improvement of Pandas over Numpy/Scipy

I have been using numpy/scipy for data analysis. I recently started to learn Pandas.
I have gone through a few tutorials and I am trying to understand what are the major improvement of Pandas over Numpy/Scipy.
It seems to me that the key idea of Pandas is to wrap up different numpy arrays in a Data Frame, with some utility functions around it.
Is there something revolutionary about Pandas that I just stupidly missed?
Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful for people with Table-like data or time-series. But, while not revolutionary, Pandas does have key benefits.
For a while I had also perceived Pandas as just utilities on top of NumPy for those who liked the DataFrame interface. However, I now see Pandas as providing these key features (this is not comprehensive):
Array of Structures (independent-storage of disparate types instead of the contiguous storage of structured arrays in NumPy) --- this will allow faster processing in many cases.
Simpler interfaces to common operations (file-loading, plotting, selection, and joining / aligning data) make it easy to do a lot of work in little code.
Index arrays which mean that operations are always aligned instead of having to keep track of alignment yourself.
Split-Apply-Combine is a powerful way of thinking about and implementing data-processing
However, there are downsides to Pandas:
Pandas is basically a user-interface library and not particularly suited for writing library code. The "automatic" features can lull you into repeatedly using them even when you don't need to and slowing down code that gets called over and over again.
Pandas typically takes up more memory as it is generous with the creation of object arrays to solve otherwise sticky problems of things like string handling.
If your use-case is outside the realm of what Pandas was designed to do, it gets clunky quickly. But, within the realms of what it was designed to do, Pandas is powerful and easy to use for quick data analysis.
I feel like characterising Pandas as "improving on" Numpy/SciPy misses much of the point. Numpy/Scipy are quite focussed on efficient numeric calculation and solving numeric problems of the sort that scientists and engineers often solve. If your problem starts out with formulae and involves numerical solution from there, you're probably good with those two.
Pandas is much more aligned with problems that start with data stored in files or databases and which contain strings as well as numbers. Consider the problem of reading data from a database query. In Pandas, you can read_sql_query directly and have a usable version of the data in one line. There is no equivalent functionality in Numpy/SciPy.
For data featuring strings or discrete rather than continuous data, there is no equivalent to the groupby capability, or the database-like joining of tables on matching values.
For time series, there is the massive benefit of handling time series data using a datetime index, which allows you to resample smoothly to different intervals, fill in values and plot your series incredibly easily.
Since many of my problems start their lives in spreadsheets, I am also very grateful for the relatively transparent handling of Excel files in both .xls and .xlsx formats with a uniform interface.
There is also a greater ecosystem, with packages like seaborn enabling more fluent statistical analysis and model fitting than is possible with the base numpy/scipy stuff.
A main point is that it introduces new data structures like dataframes, panels etc. and has good interfaces to other structure and libs. So in generally its more an great extension to the python ecosystem than an improvement over other libs. For me its a great tool among others like numpy, bcolz. Often i use it to reshape my data, get an overview before starting to do data mining etc.

How can I create a formatted and annotated excel with embedded pandas DataFrames

I want to create a "presentation ready" excel document with embedded pandas DataFrames and additional data and formatting
A typical document will include some titles and meta data, several Data Frames with sum row\column for each data frame.
The DataFrame itself should be formatted
The best thing I found was this which explains how to use pandas with XlsxWriter.
The main problem is that there's no apparent method to get the exact location of the embedded DataFrame to add the summary row below (the shape of the DataFrame is a good estimate, but it might no be exact when rendering complex DataFrames.
If there's a solution that relies on some kind of template, and not hard coding it would be even better.

Using Pandas to create, read, and update hdf5 file structure

We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)

Categories