How to limit index column width/height when displaying a pandas dataframe? - python

I have a dataframe that looks like this:
df = pd.DataFrame(data=list(range(0,10)),
index=pd.MultiIndex.from_product([[str(list(range(0,1000)))],list(range(0,10))],
names=["ind1","ind2"]),
columns=["col1"])
df['col2']=str(list(range(0,1000)))
Unfortunately, the display of the above dataframe looks like this:
If I try to set: pd.options.display.max_colwidth = 5, then col2 behaves and it is displayed in a single row, but ind1 doesn't behave:
Since ind1 is part of a multiindex, I don't care it occupies multiple rows, but I would like to limit itself in width. If I could prescribe for each row to also occupy at most the height of a single line, that would be great as well. I don't care that individual cells are being truncated on display, because I prefer to have to scroll less, in any direction, to see a cell.
I am aware I can create my own HTML display. That's great and all, but I think it's too complex for my use case of just wanting smaller width columns for data analysis in jupyter notebooks. Nevertheless, such a solution might help other similar use cases, if you are inclined to write one.
What I'm looking for is some setting, which I thought it's pd.options.display.max_colwidth, that limits the column width, even if it's an index. Something that would disable wrapping for long texts would probably help with the same issue as well.
I also tried to just print without the index df.style.hide_index(), in combination with pd.options.display.max_colwidth = 5, but then col2 stops behaving:
About now I run out of ideas. Any suggestions?

Here is one way to do it:
import pandas as pd
df = pd.DataFrame(
data=list(range(0, 10)),
index=pd.MultiIndex.from_product(
[[str(list(range(0, 1000)))], list(range(0, 10))], names=["ind1", "ind2"]
),
columns=["col1"],
)
df["col2"] = str(list(range(0, 1000)))
In the next Jupyter cell, run:
df.style.set_properties(**{"width": "10"}).set_table_styles(
[{"selector": "th", "props": [("vertical-align", "top")]}]
)
Which outputs:

Related

How to shift multi-index column without repeating subcolumn names?

Let's say I have a dataframe with a multi-index column:
P = pd.DataFrame(
[[100, 101],
[101, 102],
[ 98, 99]],
columns=pd.MultiIndex.from_tuples(
[('price', 'bid'),
('price', 'ask')]
)
)
P
and I want to add a new column which shows me the data from the previous row:
P['price_prev'] = P['price'].shift(1)
This throws the error
ValueError: Cannot set a DataFrame with multiple columns to the single column price_prev
I understand why this happens, and doing
P[[('price_prev', 'bid'), ('price_prev', 'ask')]] = P['price'].shift(1)
gives me what I want without errors:
But is there really no way to do this which avoids repeating the names of the subcolumns? I.e., telling pandas to copy the respective column including all of its subcolumns, renaming the top level to whatever was specified, and then shifting all of the data one row down?
try this:
P.join(P.shift().rename(lambda x: f'{x}_prev', axis=1, level=0))
#ziying35's answer does work, but only if I want to shift my entire dataframe.
Here's a similar and slightly less verbose version that also works for individual columns (in this case price):
P = P.join(P[['price']].shift(), rsuffix='_prev')
The one drawback of this compared to the explicit
P[[('price_prev', 'bid'), ('price_prev', 'ask')]] = P['price'].shift()
is a higher memory usage, so there seems to be a memory leak somewhere when using join. However, this might also just be my Jupyter Notebook acting up.

Appending the each dataframe from a list of dataframe with another list of dataframes

I have 2 sets of split data frames from a big data frame. Say for example,
import pandas as pd, numpy as np
np.random.seed([3,1415])
ind1 = ['A_p','B_p','C_p','D_p','E_p','F_p','N_p','M_p','O_p','Q_p']
col1 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df1 = pd.DataFrame(np.random.randint(10, size=(10, 7)), columns=col1,index=ind1)
ind2 = ['G_l','I_l','J_l','K_l','L_l','M_l','R_l','N_l']
col2 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df2 = pd.DataFrame(np.random.randint(20, size=(8, 7)), columns=col2,index=ind2)
# Split the dataframes into two parts
pc_1,pc_2 = np.array_split(df1, 2)
lnc_1,lnc_2 = np.array_split(df2, 2)
And now, I need to concatenate each split data frames from df1 (pc1, pc2) with each data frames from df2 (ln_1,lnc_2). Currently, I am doing it following,
# concatenate each split data frame pc1 with lnc1
pc1_lnc_1 =pd.concat([pc_1,lnc_1])
pc1_lnc_2 =pd.concat([pc_1,lnc_2])
pc2_lnc1 =pd.concat([pc_2,lnc_1])
pc2_lnc2 =pd.concat([pc_2,lnc_2])
On every concatenated data frame I need to run a correlation analysis function, for example,
correlation(pc1_lnc_1)
And I wanted to save the results separately, for example,
pc1_lnc1= correlation(pc1_lnc_1)
pc1_lnc2= correlation(pc1_lnc_2)
......
pc1_lnc1.to_csv(output,sep='\t')
The question is if there is a way I can automate the above concatenation part, rather than coding it in every line using some sort of loop, currently for every concatenated data frame. I am separately running the function correlation. And I have a pretty long list of the split data frame.
You can loop over the split dataframes:
for pc in np.array_split(df1, 2):
for lnc in np.array_split(df2, 2):
print(correlation(pd.concat([pc,lnc])))
Here is another thought,
def correlation(data):
# do some complex operation..
return data
# {"pc_1" : split_1, "pc_2" : split_2}
pc = {f"pc_{i + 1}": v for i, v in enumerate(np.array_split(df1, 2))}
lc = {f"lc_{i + 1}": v for i, v in enumerate(np.array_split(df2, 2))}
for pc_k, pc_v in pc.items():
for lc_k, lc_v in lc.items():
# (pc_1, lc_1), (pc_1, lc_2) ..
correlation(pd.concat([pc_v, lc_v])). \
to_csv(f"{pc_k}_{lc_k}.csv", sep="\t", index=False)
# will create csv like pc_1_lc_1.csv, pc_1_lc_2.csv.. in the current working dir
If you don't have your individual dataframes in an array (and assuming you have a nontrivial number of dataframes), the easiest way (with minimal code modification) would be to throw an eval in with a loop.
Something like
for counter in range(0,n):
for counter2 in range(0:n);
exec("pc{}_lnc{}=correlation(pd.concat([pc_{},lnc_{}]))".format(counter,counter2,counter,counter2))
eval("pc{}_lnc{}.to_csv(filename,sep='\t')".format(counter,counter2)
The standard disclaimer around eval does still apply (don't do it because it's lazy programming practice and unsafe inputs could cause all kinds of problems in your code).
See here for more details about why eval is bad
edit Updating answer for updated question.

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

Filling missing data with different methods

I have a couple of a set of data with timestamp, value and quality flag. The value and quality flag are missing for some of the timestamps, and needs to be filled with a dependence on the surrounding data. I.e.,
If the quality flags on the valid data bracketing the NaN data are different, then set the value and quality flag to the same as the bracketing row with the highest quality flag. In the example below, the first set of NaNs would be replaced with qf=3 and value=3.
If the quality flags are the same, then interpolate the value between the two valid values on either side. In the example, the second set of NaNs would be replaced by qf = 1 and v = 6 and 9.
Code:
import datetime
import pandas as pd
start = datetime.strptime("2004-01-01 00:00","%Y-%m-%d %H:%M")
end = datetime.strptime("2004-01-01 03:00","%Y-%m-%d %H:%M")
df = pd.DataFrame(\
data = {'v' : [1,2,'NaN','NaN','NaN',3,2,1,5,3,'NaN','NaN',12,43,23,12,32,12,12],\
'qf': [1,1,'NaN','NaN','NaN',3,1,5,1,1,'NaN','NaN',1,3,4,2,1,1,1]},\
index = pd.date_range(start, end,freq="10min"))
I have tried to solve this by finding the NA rows and looping through them, to fix the first criteron, then using interpolate to solve the second. However, this is really slow as I am working with a large set.
One approach would just be to do all the possible fills and then choose among them as appropriate. After doing df = df.astype(float) if necessary (your example uses the string "NaN"), something like this should work:
is_null = df.qf.isnull()
fill_down = df.ffill()
fill_up = df.bfill()
df.loc[is_null & (fill_down.qf > fill_up.qf)] = fill_down
df.loc[is_null & (fill_down.qf < fill_up.qf)] = fill_up
df = df.interpolate()
It does more work than is necessary, but it's easy to see what it's doing, and the work that it does do is vectorized and so happens pretty quickly. On a version of your dataset expanded to be ~10M rows (with the same density of nulls), it takes ~6s on my old notebook. Depending on your requirements that might suffice.

Extending a pandas panel frame along the major (timeseries) axis

I have some time-series data which is stored in a slightly strange format. I want to parse it into a pandas.Panel
The data come from various 'locations'. Data from each location are contiguous in the file but the time series from any given location is split into separate 'chunks'. There should be no overlap between the time-chunks for one location.
I have been reading each location-time-chunk into a pandas.Panel with:
Item axis = location
Major axis = DatetimeIndex
I'd like to extend the Panel's axis to accommodate any new chunks of a location's time axis.
import numpy as np
import pandas as pd
# we'll get data like this from the file
time_chunk_1 = pd.date_range(start='2010-10-01T00:00:00', periods=20,
freq='10S')
fake_data = np.cumsum(np.random.randn(len(time_chunk_1)))
mars_data_1 = pd.DataFrame(data=fake_data, index=time_chunk_1,
columns=['X'])
pluto_data_1 = pd.DataFrame(data=fake_data, index=time_chunk_1,
columns=['X'])
# gather the data in a panel
planet_data = pd.Panel(data={'Mars': mars_data_1, 'Pluto': pluto_data_1})
# further down the file we'll encounter data like this
time_chunk_2 = pd.date_range(start='2010-10-01T00:03:20', periods=20,
freq='10S')
mars_data_2 = pd.DataFrame(data=fake_data[::-1], index=time_chunk_2,
columns=['X'])
# I can make a DataFrame of the whole Mars time-series
mars_data_all = planet_data['Mars'].append(mars_data_2)
# but setting a frame of the panel doesn't extend the major axis
planet_data['Mars'] = mars_data_all
After I've collected the chunks, I'd like the following to be true:
planet_data.Mars.index is mars_data_all.index
I've tried permutations of:
setting a new frame in the panel (planet_data['AllMars'] = mars_data_all)
pandas.Panel.reindex
pandas.Panel.replace
It seems like I'm maybe getting confused between the underlying data and views on it. I've looked a these (1, 2) related questions but I'm still stuck. It feels like I'm probably missing something obvious.
I found that the following works, for some value of 'works'. I didn't have the rep to answer my own question quickly so this answer 1st appeared as a comment.
This works:
planet_data = planet_data.reindex(major_axis=mars_data_all.index)
planet_data['Mars'] = mars_data_all
in the sense that it passes:
assert(0 is all(planet_data.Mars.X - mars_data_all.X))
assert(planet_data.Mars.index is mars_data_all.index)
For a significant dataset, I suspect we would run into the hashtable not getting garbage collected issue raised in this answer. There is probably a better way of doing this.
The real life the data turn out to be much bigger, hairier and more misaligned than those in my example code. So much so, that reindexing will not work.
I'll probably end up using a dict of DataFrames rather than a Panel as my outer data structure.

Categories