way to unpack data within dataframe - python

I have a df that "packed" and I'm trying to find a way to unpack into multiple columns and rows:
input as a df with multiple list within a column
all_labels values labels
[A,B,C] [[10,1,3],[5,6,3],[0,0,0]] [X,Y,Z]
desired output: unpacked df
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
I tried this for all_labels & labels column but not sure how to do it for values column :
df.labels.apply(pd.Series)
df.all_labels.apply(pd.Series)

Setup
packed = pd.DataFrame({
'all_labels': [['A', 'B', 'C']],
'values': [[[10, 1, 3], [5, 6, 3], [0, 0, 0]]],
'labels': [['X', 'Y', 'Z']]
})
Keep It Simple
pd.DataFrame(packed['values'][0], packed['all_labels'][0], packed['labels'][0])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
rename and dict unpacking
The columns are so close to the argument names of the dataframe constructor, I couldn't resist...
rnm = {'all_labels': 'index', 'values': 'data', 'labels': 'columns'}
pd.DataFrame(**packed.rename(columns=rnm).loc[0])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
Without rename and list unpacking instead
Making sure to list the column names in the same order the arguments are expected in the pandas.DataFrame constructor
pd.DataFrame(*packed.loc[0, ['values', 'all_labels', 'labels']])
X Y Z
A 10 1 3
B 5 6 3
C 0 0 0
Bonus Material
The pandas.DataFrame.to_dict method will return a dictionary that looks similar to this.
df = pd.DataFrame(*packed.loc[0, ['values', 'all_labels', 'labels']])
df.to_dict('split')
{'index': ['A', 'B', 'C'],
'columns': ['X', 'Y', 'Z'],
'data': [[10, 1, 3], [5, 6, 3], [0, 0, 0]]}
That we could wrap in another dataframe constructor call to get back something very similar to what we started with.
pd.DataFrame([df.to_dict('split')])
index columns data
0 [A, B, C] [X, Y, Z] [[10, 1, 3], [5, 6, 3], [0, 0, 0]]

Related

Convert column suffixes from pandas join into a MultiIndex

I have two pandas DataFrames with (not necessarily) identical index and column names.
>>> df_L = pd.DataFrame({'X': [1, 3],
'Y': [5, 7]})
>>> df_R = pd.DataFrame({'X': [2, 4],
'Y': [6, 8]})
I can join them together and assign suffixes.
>>> df_L.join(df_R, lsuffix='_L', rsuffix='_R')
X_L Y_L X_R Y_R
0 1 5 2 6
1 3 7 4 8
But what I want is to make 'L' and 'R' sub-columns under both 'X' and 'Y'.
The desired DataFrame looks like this:
>>> pd.DataFrame(columns=pd.MultiIndex.from_product([['X', 'Y'], ['L', 'R']]),
data=[[1, 5, 2, 6],
[3, 7, 4, 8]])
X Y
L R L R
0 1 5 2 6
1 3 7 4 8
Is there a way I can combine the two original DataFrames to get this desired DataFrame?
You can use pd.concat with the keys argument, along the first axis:
df = pd.concat([df_L, df_R], keys=['L','R'],axis=1).swaplevel(0,1,axis=1).sort_index(level=0, axis=1)
>>> df
X Y
L R L R
0 1 2 5 6
1 3 4 7 8
For those looking for an answer to the more general problem of joining two data frames with different indices or columns into a multi-index table:
# Prepend a key-level to the column index
# https://stackoverflow.com/questions/14744068
df_L = pd.concat([df_L], keys=["L"], axis=1)
df_R = pd.concat([df_R], keys=["R"], axis=1)
# Join the two dataframes
df = df_L.join(df_R)
# Reorder levels if needed:
df = df.reorder_levels([1,0], axis=1).sort_index(axis=1)
Example:
# Data:
df_L = pd.DataFrame({'X': [1, 3, 5], 'Y': [7, 9, 11]})
df_R = pd.DataFrame({'X': [2, 4], 'Y': [6, 8], 'Z': [10, 12]})
# Result:
# X Y Z
# L R L R R
# 0 1 2.0 7 6.0 10.0
# 1 3 4.0 9 8.0 12.0
# 2 5 NaN 11 NaN NaN
This also solves the special case of the OP with equal indices and columns.
df_L.columns = pd.MultiIndex.from_product([["L", ], df_L.columns])

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

best way to present hierarchical data python pandas

I read about pandas multi index, but the examples I found did not cover this cases:
I have a set of measurements i in 1..n. Each measurement i consists of attibutes a, b, c, X, Y, Z. While a, b, c have scalar values, X, Y, Z are arrays.
(X, Y, Z have different length for different measurements, but within one measurement i, the arrays X, Y, Z have same length m).
Question: Whats the best way to represent this in a pandas data-frame?
The multi-index examples I saw would index data e.g. first level by i for the measurement and second level by k for index into X,Y,Z. But what about the attributes a, b, c that have just one value for each measurement, but not m? Should the a, b, c values be repeated? Or only the first row of each measurement i contains values for a, b, c, the other 2..m rows contain NaN?
As alluded to in the comment, much of this will depend on the sort of questions you want to be able to answer at the end of the day. One example of a potentially meaningful representation, which takes advantage of the facts that the arrays have the same length within each row, would be to take the product along the arrays, repeating the $i$th row $m_i$ times;
In [12]: from itertools import chain
In [13]: df = pd.DataFrame({'a': [5, 7, 1], 'b': [1, 3, 2], 'c': [10, 1, 1], 'X': [[10, 10], [10, 20, 30], [4, 5]], 'Y': [[20
...: , 30], [1, 4, 5], [0, 1]], 'Z': [[0, 3], [0, 1, 2], [50, 60]]})
In [14]: df
Out[14]:
X Y Z a b c
0 [10, 10] [20, 30] [0, 3] 5 1 10
1 [10, 20, 30] [1, 4, 5] [0, 1, 2] 7 3 1
2 [4, 5] [0, 1] [50, 60] 1 2 1
In [15]: pd.DataFrame({
...: 'a': np.repeat(df.a.values, df.X.apply(len)),
...: 'b': np.repeat(df.b.values, df.X.apply(len)),
...: 'c': np.repeat(df.c.values, df.X.apply(len)),
...: 'X': list(chain.from_iterable(df.X)),
...: 'Y': list(chain.from_iterable(df.Y)),
...: 'Z': list(chain.from_iterable(df.Z))})
...:
Out[15]:
X Y Z a b c
0 10 20 0 5 1 10
1 10 30 3 5 1 10
2 10 1 0 7 3 1
3 20 4 1 7 3 1
4 30 5 2 7 3 1
5 4 0 50 1 2 1
6 5 1 60 1 2 1
This assumes that the fact that the array lengths match across columns means that the arrays themselves are comparable element-wise.

How to add rows for all missing values of one multi-index's level?

Suppose that I have the following dataframe df, indexed by a 3-level multi-index:
In [52]: df
Out[52]:
C
L0 L1 L2
0 w P 1
y P 2
R 3
1 x Q 4
R 5
z S 6
Code to create the DataFrame:
idx = pd.MultiIndex(levels=[[0, 1], ['w', 'x', 'y', 'z'], ['P', 'Q', 'R', 'S']],
labels=[[0, 0, 0, 1, 1, 1], [0, 2, 2, 1, 1, 3], [0, 0, 2, 1, 2, 3]],
names=['L0', 'L1', 'L2'])
df = pd.DataFrame({'C': [1, 2, 3, 4, 5, 6]}, index=idx)
The possible values for the L2 level are 'P', 'Q', 'R', and 'S', but some of these values are missing for particular combinations of values for the remaining levels. For example, the combination (L0=0, L1='w', L2='Q') is not present in df.
I would like to add enough rows to df so that, for each combination of values for the levels other than L2, there is exactly one row for each of the L2 level's possible values. For the added rows, the value of the C column should be 0.
IOW, I want to expand df so that it looks like this:
C
L0 L1 L2
0 w P 1
Q 0
R 0
S 0
y P 2
Q 0
R 3
S 0
1 x P 0
Q 4
R 5
S 0
z P 0
Q 0
R 0
S 6
REQUIREMENTS:
the operation should leave the types of the columns unchanged;
the operation should add the smallest number of rows needed to complete only the specified level (L2)
Is there a simple way to perform this expansion?
Suppose L2 initially contains all the possible values you need, you can use unstack.stack trick:
df.unstack('L2', fill_value=0).stack(level=1)

In Pandas/Numpy, How to implement a rolling function inside each chunk using 2 different columns?

I have a df that is 'divided' by chunks, like this:
A = pd.DataFrame([[1, 5, 2, 0], [2, 4, 4, 0], [3, 3, 1, 1], [4, 2, 2, 0], [5, 1, 4, 0], [2, 4, 4, 1]],
columns=['A', 'B', 'C', 'D'], index=[1, 2, 3, 4, 5, 6,])
In this example, the chunk size is 3, and we have 2 chunks (signaled by the element 1 in the column 'D'). I need to perfom a rolling calculation inside each chunk, that involves 2 columns. Specifically, I need to create a column 'E' that is equal to column 'B' minus the rolling min of column 'C', in function:
def retracement(x):
return x['B'] - pd.rolling_min(x['C'], window=3)
I need to apply the formula above for each chunk. So following this recipe I tried:
chunk_size = 3
A['E'] = A.groupby(np.arange(len(A))//chunk_size).apply(lambda x: retracement(x))
ValueError: Wrong number of items passed 3, placement implies 1
The output would look like:
A B C D E
1 1 5 2 0 3
2 2 4 4 0 2
3 3 3 1 1 2
4 4 2 2 0 0
5 5 1 4 0 -1
6 2 4 4 1 2
Thanks
Update:
Following #EdChum recommendation didn't work, I got
TypeError: <lambda>() got an unexpected keyword argument 'axis'
something like this:
def chunkify(chunk_size):
df['chunk'] = (df.index.values - 1) / chunk_size
df['E'] = df.groupby('chunk').apply(lambda x: x.B - pd.expanding_min(x.C)).values.flatten()

Categories