Extending a pandas panel frame along the major (timeseries) axis - python

I have some time-series data which is stored in a slightly strange format. I want to parse it into a pandas.Panel
The data come from various 'locations'. Data from each location are contiguous in the file but the time series from any given location is split into separate 'chunks'. There should be no overlap between the time-chunks for one location.
I have been reading each location-time-chunk into a pandas.Panel with:
Item axis = location
Major axis = DatetimeIndex
I'd like to extend the Panel's axis to accommodate any new chunks of a location's time axis.
import numpy as np
import pandas as pd
# we'll get data like this from the file
time_chunk_1 = pd.date_range(start='2010-10-01T00:00:00', periods=20,
freq='10S')
fake_data = np.cumsum(np.random.randn(len(time_chunk_1)))
mars_data_1 = pd.DataFrame(data=fake_data, index=time_chunk_1,
columns=['X'])
pluto_data_1 = pd.DataFrame(data=fake_data, index=time_chunk_1,
columns=['X'])
# gather the data in a panel
planet_data = pd.Panel(data={'Mars': mars_data_1, 'Pluto': pluto_data_1})
# further down the file we'll encounter data like this
time_chunk_2 = pd.date_range(start='2010-10-01T00:03:20', periods=20,
freq='10S')
mars_data_2 = pd.DataFrame(data=fake_data[::-1], index=time_chunk_2,
columns=['X'])
# I can make a DataFrame of the whole Mars time-series
mars_data_all = planet_data['Mars'].append(mars_data_2)
# but setting a frame of the panel doesn't extend the major axis
planet_data['Mars'] = mars_data_all
After I've collected the chunks, I'd like the following to be true:
planet_data.Mars.index is mars_data_all.index
I've tried permutations of:
setting a new frame in the panel (planet_data['AllMars'] = mars_data_all)
pandas.Panel.reindex
pandas.Panel.replace
It seems like I'm maybe getting confused between the underlying data and views on it. I've looked a these (1, 2) related questions but I'm still stuck. It feels like I'm probably missing something obvious.

I found that the following works, for some value of 'works'. I didn't have the rep to answer my own question quickly so this answer 1st appeared as a comment.
This works:
planet_data = planet_data.reindex(major_axis=mars_data_all.index)
planet_data['Mars'] = mars_data_all
in the sense that it passes:
assert(0 is all(planet_data.Mars.X - mars_data_all.X))
assert(planet_data.Mars.index is mars_data_all.index)
For a significant dataset, I suspect we would run into the hashtable not getting garbage collected issue raised in this answer. There is probably a better way of doing this.
The real life the data turn out to be much bigger, hairier and more misaligned than those in my example code. So much so, that reindexing will not work.
I'll probably end up using a dict of DataFrames rather than a Panel as my outer data structure.

Related

How to limit index column width/height when displaying a pandas dataframe?

I have a dataframe that looks like this:
df = pd.DataFrame(data=list(range(0,10)),
index=pd.MultiIndex.from_product([[str(list(range(0,1000)))],list(range(0,10))],
names=["ind1","ind2"]),
columns=["col1"])
df['col2']=str(list(range(0,1000)))
Unfortunately, the display of the above dataframe looks like this:
If I try to set: pd.options.display.max_colwidth = 5, then col2 behaves and it is displayed in a single row, but ind1 doesn't behave:
Since ind1 is part of a multiindex, I don't care it occupies multiple rows, but I would like to limit itself in width. If I could prescribe for each row to also occupy at most the height of a single line, that would be great as well. I don't care that individual cells are being truncated on display, because I prefer to have to scroll less, in any direction, to see a cell.
I am aware I can create my own HTML display. That's great and all, but I think it's too complex for my use case of just wanting smaller width columns for data analysis in jupyter notebooks. Nevertheless, such a solution might help other similar use cases, if you are inclined to write one.
What I'm looking for is some setting, which I thought it's pd.options.display.max_colwidth, that limits the column width, even if it's an index. Something that would disable wrapping for long texts would probably help with the same issue as well.
I also tried to just print without the index df.style.hide_index(), in combination with pd.options.display.max_colwidth = 5, but then col2 stops behaving:
About now I run out of ideas. Any suggestions?
Here is one way to do it:
import pandas as pd
df = pd.DataFrame(
data=list(range(0, 10)),
index=pd.MultiIndex.from_product(
[[str(list(range(0, 1000)))], list(range(0, 10))], names=["ind1", "ind2"]
),
columns=["col1"],
)
df["col2"] = str(list(range(0, 1000)))
In the next Jupyter cell, run:
df.style.set_properties(**{"width": "10"}).set_table_styles(
[{"selector": "th", "props": [("vertical-align", "top")]}]
)
Which outputs:

how to make a sparse pandas DataFrame from a csv file

I have a rather large (1.3 GB, unzipped) csv file, with 2 dense columns and 1.4 K sparse columns, about 1 M rows.
I need to make a pandas.DataFrame from it.
For small files I can simply do:
df = pd.read_csv('file.csv')
For the large file I have now, I get a memory error, clearly due to the DataFrame size (tested by sys.getsizeof(df)
Based on this document:
https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating
it looks like I can make a DataFrame with mixed dense and sparse columns.
However, I can only see instructions to add individual sparse columns, not a chunk of them all together, from the csv file.
Reading the csv sparse columns one by one and adding them to df using:
for colname_i in names_of_sparse_columns:
data = pd.read_csv('file.csv', usecols = [colname_i])
df[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])
works, and df stays very small, as desired, but the execution time is absurdly long.
I tried of course:
pd.read_csv(path_to_input_csv, usecols = names_of_sparse_columns, dtype = "Sparse[float]")
but that generates this error:
NotImplementedError: Extension Array: <class 'pandas.core.arrays.sparse.array.SparseArray'> must implement _from_sequence_of_strings in order to be used in parser methods
Any idea how I can do this more efficiently?
I checked several posts, but they all seem to be after something slightly different from this.
EDIT adding a small example, to clarify
import numpy as np
import pandas as pd
import sys
# Create an unpivoted sparse dataset
lengths = list(np.random.randint(low = 1, high = 5, size = 10000))
cols = []
for l in lengths:
cols.extend(list(np.random.choice(100, size = l, replace = False)))
rows = np.repeat(np.arange(10000), lengths)
vals = np.repeat(1, sum(lengths))
df_unpivoted = pd.DataFrame({"row" : rows, "col" : cols, "val" : vals})
# Pivot and save to a csv file
df = df_unpivoted.pivot(index = "row", columns = "col", values = "val")
df.to_csv("sparse.csv", index = False)
This file occupies 1 MB on my PC.
Instead:
sys.getsizeof(df)
# 8080016
This looks like 8 MB to me.
So there is clearly a large increase in size when making a pd.DataFrame from a sparse csv file (in this case I made the file from the data frame, but it's the same as reading in the csv file using pd.read_csv()).
And this is my point: I cannot use pd.read_csv() to load the whole csv file into memory.
Here it's only 8 MB, that's no problem at all; with the actual 1.3 GB csv I referred to, it goes to such a huge size that it crashes our machine's memory.
I guess it's easy to try that, by replacing 10000 with 1000000 and 100 with 1500 in the above simulation.
If I do instead:
names_of_sparse_columns = df.columns.values
df_sparse = pd.DataFrame()
for colname_i in names_of_sparse_columns:
data = pd.read_csv('sparse.csv', usecols = [colname_i])
df_sparse[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])
The resulting object is much smaller:
sys.getsizeof(df_sparse)
# 416700
In fact even smaller than the file.
And this is my second point: doing this column-by-column addition of sparse columns is very slow.
I was looking for advice on how to make df_sparse from a file like "sparse.csv" faster / more efficiently.
In fact, while I was writing this example, I noticed that:
sys.getsizeof(df_unpivoted)
# 399504
So maybe the solution could be to read the csv file line by line and unpivot it. The rest of the handling I need to do however would still require that I write out a pivoted csv, so back to square one.
EDIT 2 more information
Just as well that I describe the rest of the handling I need to do, too.
When I can use a non-sparse data frame, there is an ID column in the file:
df["ID"] = list(np.random.choice(20, df.shape[0]))
I need to make a summary of how many data exist, per ID, per data column:
df.groupby("ID").count()
The unfortunate bit is that the sparse data frame does not support this.
I found a workaround, but it's very inefficient and slow.
If anyone can advise on that aspect, too, it would be useful.
I would have guessed there would be a way to load the sparse part of the csv into some form of sparse array, and make a summary by ID.
Maybe I'm approaching this completely the wrong way, and that's why I am asking this large competent audience for advice.
I don't have the faintest idea why someone would have made a CSV in that format. I would just read it in as chunks and fix the chunks.
# Read in chunks of data, melt it into an dataframe that makes sense
data = [c.melt(id_vars=dense_columns, var_name="Column_label", value_name="Thing").dropna()
for c in pd.read_csv('file.csv', iterator=True, chunksize=100000)]
# Concat the data together
data = pd.concat(data, axis=0)
Change the chunksize and the name of the value column as needed. You could also read in chunks and turn the chunks into a sparse dataframe if needed, but it seems that you'd be better off with a melted dataframe for what you want to do, IMO.
You can always chunk it again going the other way as well. Change the number of chunks as needed for your data.
with open('out_file.csv', mode='w') as out:
for i, chunk in enumerate(np.array_split(df, 100)):
chunk.iloc[:, 2:] = chunk.iloc[:, 2:].sparse.to_dense()
chunk.to_csv(out, header=i==0)
The same file.csv should not be read on every iteration; this line of code:
data = pd.read_csv('file.csv', ...)
should be moved ahead of the for-loop.
To iterate through names_of_sparse_columns:
df = pd.read_csv('file.csv', header = 0).copy()
data = pd.read_csv('file.csv', header = 0).copy()
for colname_i in names_of_sparse_columns:
dataFromThisSparseColumn = data[colname_i]
df[colname_i] = np.reshape(dataFromThisSparseColumn, -1)

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

Filling missing data with different methods

I have a couple of a set of data with timestamp, value and quality flag. The value and quality flag are missing for some of the timestamps, and needs to be filled with a dependence on the surrounding data. I.e.,
If the quality flags on the valid data bracketing the NaN data are different, then set the value and quality flag to the same as the bracketing row with the highest quality flag. In the example below, the first set of NaNs would be replaced with qf=3 and value=3.
If the quality flags are the same, then interpolate the value between the two valid values on either side. In the example, the second set of NaNs would be replaced by qf = 1 and v = 6 and 9.
Code:
import datetime
import pandas as pd
start = datetime.strptime("2004-01-01 00:00","%Y-%m-%d %H:%M")
end = datetime.strptime("2004-01-01 03:00","%Y-%m-%d %H:%M")
df = pd.DataFrame(\
data = {'v' : [1,2,'NaN','NaN','NaN',3,2,1,5,3,'NaN','NaN',12,43,23,12,32,12,12],\
'qf': [1,1,'NaN','NaN','NaN',3,1,5,1,1,'NaN','NaN',1,3,4,2,1,1,1]},\
index = pd.date_range(start, end,freq="10min"))
I have tried to solve this by finding the NA rows and looping through them, to fix the first criteron, then using interpolate to solve the second. However, this is really slow as I am working with a large set.
One approach would just be to do all the possible fills and then choose among them as appropriate. After doing df = df.astype(float) if necessary (your example uses the string "NaN"), something like this should work:
is_null = df.qf.isnull()
fill_down = df.ffill()
fill_up = df.bfill()
df.loc[is_null & (fill_down.qf > fill_up.qf)] = fill_down
df.loc[is_null & (fill_down.qf < fill_up.qf)] = fill_up
df = df.interpolate()
It does more work than is necessary, but it's easy to see what it's doing, and the work that it does do is vectorized and so happens pretty quickly. On a version of your dataset expanded to be ~10M rows (with the same density of nulls), it takes ~6s on my old notebook. Depending on your requirements that might suffice.

Saving/loading a table (with different column lengths) using numpy

A bit of context: I am writting a code to save the data I plot to a text file. This data should be stored in such a way it can be loaded back using a script so it can be displayed again (but this time without performing any calculation). The initial idea was to store the data in columns with a format x1,y1,x2,y2,x3,y3...
I am using a code which would be simplified to something like this (incidentally, I am not sure if using a list to group my arrays is the most efficient approach):
import numpy as np
MatrixResults = []
x1 = np.array([1,2,3,4,5,6])
y1 = np.array([7,8,9,10,11,12])
x2 = np.array([0,1,2,3])
y2 = np.array([0,1,4,9])
MatrixResults.append(x1)
MatrixResults.append(y1)
MatrixResults.append(x2)
MatrixResults.append(y2)
MatrixResults = np.array(MatrixResults)
TextFile = open('/Users/UserName/Desktop/Datalog.txt',"w")
np.savetxt(TextFile, np.transpose(MatrixResults))
TextFile.close()
However, this code gives and error when any of the data sets have different lengths. Reading similar questions:
Can numpy.savetxt be used on N-dimensional ndarrays with N>2?
Table, with the different length of columns
However, this requires to break the format (either with flattening or adding some filling strings to the shorter columns to fill the shorter arrays)
My issue summarises as:
1) Is there any method that at the same time we transpose the arrays these are saved individually as consecutive columns?
2) Or maybe is there anyway to append columns to a text file (given a certain number of rows and columns to skip)
3) Should I try this with another library such as pandas?
Thank you very for any advice.
Edit 1:
After looking a bit more it seems that leaving blank spaces is more innefficient than filling the lists.
In the end I wrote my own (not sure if there is numpy function for this) in which I match the arrays length with "nan" values.
To get the data back I use the genfromtxt method and then I use this line:
x = x[~isnan(x)]
To remove the these cells from the arrays
If I find a better solution I will post it :)
To save your array you can use np.savez and read them back with np.load:
# Write to file
np.savez(filename, matrixResults)
# Read back
matrixResults = np.load(filename + '.npz').items[0][1]
As a side note you should follow naming conventions i.e. only class names start with upper case letters.

Categories