Transforming data in pandas - python

What would be the best way to approach this problem using python and pandas?
I have an excel file of electricity usage. It comes in an awkward structure and I want to transform it so that I can compare it to weather data based on date and time.
The structure look like ( foo is a string and xx is a number)
100,foo,foo,foo,foo
200,foo,foo,foo,foo,foo,0000,kWh,15
300,20181101,xx,xx,xx,xx...(96 columns)xx,A
... several hundred more 300 type rows
the 100 and 200 rows identify the meter and provide a partial schema. ie data is in kWh and 15 minute intervals. The 300 rows contain date and 96 (ie 96 = 24hours*4 15min blocks) columns of 15min power consumption and one column with a data quality flag.
I have previously processed all the data in other tools but I'm trying to learn how to do it in Python (jupyter notebook to be precise) and tap into the far more advanced analysis, modeling and visualisation tools available.
I think the thing to do is transform the data into a series of datetime and power. From there I can aggregate filter and compare however I like.
I am at a loss even to know what question to ask or resource to look up to tackle this problem. I could just import the 300 rows as is and loop through the rows and columns to create a new series in the right structure - easy enough to do. However, I strongly suspect there is an inbuilt method for doing this sort of thing and I would greatly appreciate any advise on what might be the best strategies. Possibly I don't need to transform the data at all.

You can read the data easy enough into a DataFrame, you just have to step over the metadata rows, e.g.:
df = pd.read_csv(<file>, skiprows=[0,1], index_col=1, parse_dates=True, header=None)
This will read in the csv, skip over the first 2 lines, make the date column the index and try and parse it to a date type.

Related

Undersampling large dataset under specific conditon applied to other column in python/pandas

I currently working with a large dataset (about 40 coulmns and tens of thousans of rows) and i would like to undersample it to be able to work with it more easily.
For the undersampling, unlike the resample method from pandas that resample according to timedelta, I'm trying to specify conditons on other columns to determine the data points to keep.
I'm not sure it's so clear but for example, let's say I have 3 columns (index, time and temperature) like as followed:
Now for the resampling, I would like to keep a data point every 1s or every 2°C, the resulting dataset would look like this:
I couldn't find a simple way of doing this with pandas. The only way would be to iterate over the rows but it was very slow because of the size of my datasets.
I though about using the diff method but of course it can only determine the difference on a specified period, same for pct_change that could have been use to keep only the points in the regions were the variations are maximal to undersample.
Thanks in advance if you have any suggestions on how to proceed with this resampling.

How to speed up the same calculation on sub-combinations of pandas DataFrame columns?

I'm looking to apply the same function to multiple sub-combinations of a pandas DataFrame. Imagine the full DataFrame having 15 columns, and I want to draw from this full DataFrame a sub-frame containing 10 columns, I would have 3003 such sub-frames in total. My current approach is to use multiprocessing which works well for a full DataFrame with about 20 columns - 184,756 combinations, however the real full frame has 50 columns leading to more than 10 billions combinations, after which it will take too long. Is there any library that would be suitable for this type of calculation ? I have used dask before and it's incredibly powerful but dask is only suitable for calculation on a single DataFrame, not different ones.
Thanks.
It's hard to answer this question without a MVCE. The best path forward depends on what you want to do with your 10 billion DataFrame combinations (write them to disk, train models, aggregations, etc.).
I'll provide some high level advice:
using a columnar file format like Parquet allows for column pruning (grabbing certain columns rather than all of them), which can be memory efficient
Persisting the entire DataFrame in memory with ddf.persist() may be a good way for you to handle this combinations problem so you're not constantly reloading it
Feel free to add more detail about the problem and I can add a more detailed solution.

Pandas and complicated filtering and merge/join multiple sub-data frames

I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')

Making Dataframe Analysis faster

I am using three dataframes to analyze sequential numeric data - basically numeric data captured in time. There are 8 columns, and 360k entries. I created three identical dataframes - one is the raw data, the second a "scratch pad" for analysis and a third dataframe contains the analyzed outcome. This runs really slowly. I'm wondering if there are ways to make this analysis run faster? Would it be faster if instead of three separate 8 column dataframes I had one large one 24 column dataframe?
Use cProfile and lineprof to figure out where the time is being spent.
To get help from others, post your real code and your real profile results.
Optimization is an empirical process. The little tips people have are often counterproductive.
Most probably it doesn't matter because pandas stores each column separately anyway (DataFrame is a collection of Series). But you might get better data locality (all data next to each other in memory) by using a single frame, so it's worth trying. Check this empirically.
Rereading this post I am realizing I could have been clearer. I have been using write statement like:
dm.iloc[p,XCol] = dh.iloc[x,XCol]
to transfer individual cells of one dataframe (dh) to a different row of a second dataframe (dm). It ran very slowly but I needed this specific file sorted and I just lived with the performance.
According to "Learning Pandas" by Michael Heydt, pg 146, ".iat" is faster than ".iloc" for extracting (or writing) scalar values from a dataframe. I tried it and it works. With my original 300k row files, run time was 13 hours(!) using ".iloc", same datafile using ".iat" ran in about 5 minutes.
Net - this is faster:
dm.iat[p,XCol] = dh.iat[x,XCol]

How to store multiple related time series in Pandas

I'm new to Pandas and would like some insight from the pros. I need to perform various statistical analyses (multiple regression, correlation etc) on >30 time series of financial securities' daily Open, High, Low, Close prices. Each series has 500-1500 days of data. As each analysis looks at multiple securities, I'm wondering if it's preferable from an ease of use and efficiency perspective to store each time series in a separate df, each with date as the index, or to merge them all into a single df with a single date index, which would effectively be a 3d df. If the latter, any recommendations on how to structure it?
Any thoughts much appreciated.
PS. I'm working my way up to working with intraday data across multiple timezones but that's a bit much for my first pandas project; this is a first step in that direction.
since you're only dealing with OHLC, it's not that much data to process, so that's good.
for these types of things i usually use a multiindex (http://pandas.pydata.org/pandas-docs/stable/indexing.html) with symbol as the first level and date as the second. then you can have just the columns OHLC and you're all set.
to access multiindex use the .xs function.
Unless you are going to correlate everything with everything, my suggestion is to put this into separate dataframes and put them all in a dictionary, ie {"Timeseries1":df1, "Timeseries 2":df2...}. Then, when you want to correlate some timeseries together, you can merge them and put suffixes in the columns of every different df to differentiate between them.
Probably you are interested in this talk Python for Financial Data Analysis with pandas by the author of pandas himself.

Categories