Stock Data Storage and calculation using python, pandas - python

I am working with stock data which i download using a file everyday. The file contains the same no of columns everyday but the rows would change everyday depending up the stocks in and out of the list. I am looking to compare the files from 2 dates and find the difference between the total quantity column. I want to see the difference between the two files which stocks got in or got out of the list.
I have tried using pandas dataframe and storing it in a hd5 file. Then tried merge function of the dataframes to find the differences between the two file. I am looking for a much elegant solution so that i can compare data frames and find the differences like i do it using index and match(or vlookup) function of excel.

You should use the python difflib library to compare the files.
From the documentation:
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs
Also, look at the answers to this similar question for some examples. One example that may be useful in your case is this one.

Related

Is pandas more efficient than the csv module for ETL

I have written some python scripts that loads csv files with hundreds of thousands of rows into a database. It is working great but I was wondering if it is more memory efficient to use the csv module to extract the csv's as a list of lists than creating a pandas dataframe?
Pandas DataFrame is definitely more memory efficient than regular Python lists.
You should use Pandas.
Take look at slides from talk by Jeffrey Tratner Pandas Under The Hood
I'm just comparing a few key points between using pandas and lists approach:
DataFrames have flexible interface. If you chose bare bones Pythons list approach you will need to create necessary functions by yourself.
Many number crunching routines in pandas are implemented in C or by using specialized numerical libraries (Numpy) that will be always faster than code you will write in your lists
Deciding to use lists will also mean that with large data lists memory layout will be downgrading performance as opposed to for Dataframe where data are split into blocks of the same types
Pandas Dataframe has indexes which helps you easily lookup/combine/split data based on conditions you choose. Indexes are implemented in C and specialized for each data type.
Pandas can easily read/write data to different formats
There are much more advantages that I probably don't even know about. The key point is: Don't reinvent the wheel, use right tools if you have them

Making Dataframe Analysis faster

I am using three dataframes to analyze sequential numeric data - basically numeric data captured in time. There are 8 columns, and 360k entries. I created three identical dataframes - one is the raw data, the second a "scratch pad" for analysis and a third dataframe contains the analyzed outcome. This runs really slowly. I'm wondering if there are ways to make this analysis run faster? Would it be faster if instead of three separate 8 column dataframes I had one large one 24 column dataframe?
Use cProfile and lineprof to figure out where the time is being spent.
To get help from others, post your real code and your real profile results.
Optimization is an empirical process. The little tips people have are often counterproductive.
Most probably it doesn't matter because pandas stores each column separately anyway (DataFrame is a collection of Series). But you might get better data locality (all data next to each other in memory) by using a single frame, so it's worth trying. Check this empirically.
Rereading this post I am realizing I could have been clearer. I have been using write statement like:
dm.iloc[p,XCol] = dh.iloc[x,XCol]
to transfer individual cells of one dataframe (dh) to a different row of a second dataframe (dm). It ran very slowly but I needed this specific file sorted and I just lived with the performance.
According to "Learning Pandas" by Michael Heydt, pg 146, ".iat" is faster than ".iloc" for extracting (or writing) scalar values from a dataframe. I tried it and it works. With my original 300k row files, run time was 13 hours(!) using ".iloc", same datafile using ".iat" ran in about 5 minutes.
Net - this is faster:
dm.iat[p,XCol] = dh.iat[x,XCol]

Out-of-memory join of two csv's with varying number of columns

I have a historic csv that needs to be updated daily (concatenated) with a freshly pulled csv. The issue is that the new csv may have different number of columns from the historic one. If each of them was light, I could just read in both and concatenate with pandas. If the number of columns was the same, I could use cat and do a command-line call. Unfortunately, neither is true.
So, I am wondering if there is a way to do out-of-memory concatenation/join with pandas for something like above, or using one of the command line tools.
Thanks!

How can I create a formatted and annotated excel with embedded pandas DataFrames

I want to create a "presentation ready" excel document with embedded pandas DataFrames and additional data and formatting
A typical document will include some titles and meta data, several Data Frames with sum row\column for each data frame.
The DataFrame itself should be formatted
The best thing I found was this which explains how to use pandas with XlsxWriter.
The main problem is that there's no apparent method to get the exact location of the embedded DataFrame to add the summary row below (the shape of the DataFrame is a good estimate, but it might no be exact when rendering complex DataFrames.
If there's a solution that relies on some kind of template, and not hard coding it would be even better.

Using Pandas to create, read, and update hdf5 file structure

We would like to be able to allow the HDF5 files themselves to define their columns, indexes, and column types instead of maintaining a separate file that defines structure of the HDF5 data.
How can I create an empty HDF5 file from Pandas with a specific table structure like:
Columns
id (Int)
name (Str)
update_date (datetime)
some_float (float)
Indexes
id
name
Once the HDF5 is created and saved to disk, how do I retrieve the column and index information without having to open the file completely each time since it will likely contain several GB of data.
Many thanks in advance...
-- UPDATE --
Thanks for the comments. To clarify a bit more:
We do have some experience with Pandas but by no means are really proficient. The part that is tripping us up is creating an empty data structure and reading that structure from a file that you will not want to fully open. In all of the Pandas examples there is data. The Pandas examples also only show two ways to retrieve data/structure which are to read the entire frame into memory or issue a where clause. In this case, we would like to be able to see the table structure without query operations if possible.
I know this is an odd case. Why the heck would you want an empty dataframe?? Well, we want to have a great deal of flexility in moving data around and want to be able to define a target dataframe structure prior to data writing, which could take place much later (e.g. hours or days). Since the HDF5 specification maintains all that information it seems directionally incorrect to store the table structure information separately. Thus our desire to crack the code on this subject.
-- UPDATE 2 --
To add more detail as #jeff requested.
We would like to abstract some of the common Pandas functions like summing data or merging two frames. Thus we would like to be able to ask each frame what their columns are so we can present a view for the user to select the result frame columns.
For example, if we imported a CSV with columns A, B, C, D, and V and saved the frame to HDF5 as my_csv.hdf then we would be able to determine the columns by opening the file.
However, in our use case it is likely that the import frame for the CSV could be cleared periodically and no longer contain the data. The reason knowing that the my_csv frame has certain columns and types is important because we want to enable a user to then select those columns for summing in a downstream operation. Lets say a user wants to sum column V by the values in columns A and B only and save the frame as my_sum. Since we can't ensure my_csv will always have data we would like to ensure it at least contains the structure.
Open to other suggestions obviously. It is also possible to store the table structure info in the user_block. This, again, is not ideal because the structure is now being kept in two different areas but I guess it would be possible to always update the user_block on save using the latest column and index information for the frame, although I believe the to_* operations in Pandas will blow away the user_block so...blah. I feel like I'm talking myself into maintaining a peer structure definition but I REALLY would love some suggestions to not have to do that.

Categories