Dataframe after reindex does not show all items in multiindex - python

Main goal - to reindex DataFrame with new multiindex that contains new values
In[34]: df = pd.DataFrame([[1,2,3], [1,4,2],[2,3,4], [2,2,1]], columns=['a', 'b', 'c'])
In[35]: df = df.set_index(['a', 'b'])
In[36]: df.index
Out[36]:
MultiIndex(levels=[[1, 2], [2, 3, 4]],
labels=[[0, 0, 1, 1], [0, 2, 1, 0]],
names=[u'a', u'b'])
In[37]: df_ri = df.reindex_axis([1,2,3,4], level='b', axis=0)
In[39]: df_ri
Out[39]:
c
a b
1 2 3
4 2
2 3 4
2 1
In[40]: df_ri.index
Out[40]:
MultiIndex(levels=[[1, 2], [1, 2, 3, 4]], #all new values are stored here but are not visible in df
labels=[[0, 0, 1, 1], [1, 3, 2, 1]],
names=[u'a', u'b'])
At the end - df output is not changed. When I Look at new index - it has new values, but they are not shown. I can avoid that by creating new df with that new index and then merging old df with new one - but it is not the best approach. Any suggestions?

Related

Python pandas grouping a dataframe by the unique value of a column

I have a dataframe in this format
A B
1990-02 1
1990-03 1
1999-05 1
1992-08 2
1996-12 2
2020-01 2
1990-05 3
1995-08 3
1999-11 3
2021-12 3
How can i convert this dataframe into groups base on the unique values of Column B
So my results should be in this format
[[[1990-02, 1],[1990-03, 1],[1999-05, 1]],
[[1992-08, 2],[1996-12, 2],[2020-01, 2]],
[[1990-05, 3],[1995-08, 3],[1999-11, 3],[2021-12, 3]]
]
This should make the job
import pandas as pd
data = {"A": ["1990-02", "1990-03","1999-05","1992-08","1996-12",
"2020-01","1990-05","1995-08","1999-11", "2021-12"],
"B": [1,1,1,2,2,2,3,3,3,3]}
df = pd.DataFrame(data=data)
out = df.groupby("B")['A'].apply(list)
output = [[[date, b_value] for date in block]
for b_value, block in zip(out.index, out.values)]
print(output)
Here's one way to get an equivalent structure with arrays:
>>> df.groupby("B").apply(pd.DataFrame.to_numpy).values
[array([['1990-02', 1],
['1990-03', 1],
['1999-05', 1]], dtype=object)
array([['1992-08', 2],
['1996-12', 2],
['2020-01', 2]], dtype=object)
array([['1990-05', 3],
['1995-08', 3],
['1999-11', 3],
['2021-12', 3]], dtype=object)]
Here is one way to get exactly what you want:
df.assign(l=df.agg(list, axis=1)).groupby('B')['l'].agg(list).tolist()
output:
[[['1990-02', 1], ['1990-03', 1], ['1999-05', 1]],
[['1992-08', 2], ['1996-12', 2], ['2020-01', 2]],
[['1990-05', 3], ['1995-08', 3], ['1999-11', 3], ['2021-12', 3]]]

Accessing pandas multi-index with a variable

I'm struggling to access a Pandas DataFrame with a multi-index programatically. Let's say I have
import pandas as pd
df = pd.DataFrame([[0, 0, 0, 1],
[0, 0, 1, 2],
[0, 1, 0, 7],
[0, 1, 1, 9],
[1, 0, 0, 1],
[1, 0, 1, 0],
[1, 1, 0, 1],
[1, 1, 1, 10]], columns=['c1', 'c2', 'c3', 'value'])
sums = df.groupby(['c1', 'c2', 'c3']).value.sum()
I can get the sum which corresponds to the [1, 1, 1] combination of c1, c2 and c3 with
sums[1, 1, 1]
That returns 10 as expected.
But what if I have a variable
q = [1, 1, 1]
how do I get the same value out?
I have tried
sums[q]
which gives
c1 c2 c3
0 0 1 2
1 2
1 2
Name: value, dtype: int64
Also I thought star operator could work:
sums[*q]
but that is invalid syntax.
Use Series.xs with tuple:
print (sums.xs((1,1,1)))
10
Or Series.loc:
print (sums.loc[(1,1,1)])
#alternative
#print (sums[(1,1,1)])
10
q = [1, 1, 1]
print (sums.loc[tuple(q)])
#alternative
#print (sums[tuple(q)])
10

Extend lists within a pandas Series

I have a pandas series that looks like this:
group
A [1,0,5,4,6,...]
B [2,2,0,1,9,...]
C [3,5,2,0,6,...]
I have similar series that I would like to add to the existing series by extending each of the lists. How can I do this?
I tried
for x in series:
x.extend(series[series.index[x]])
but this isn't working.
Consider the series s
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s
A [1, 0]
B [2, 2]
C [4, 1]
Name: group, dtype: object
You can extend each list with a similar series simply by adding them. pandas will use the underlying objects __add__ method to combine the pairwise elements. In the case of a list, the __add__ method concatenates the lists.
s + s
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object
However, this would not work if the elements were numpy.array
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s = s.apply(np.array)
In this case, I'd make sure they are lists
s.apply(list) + s.apply(list)
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object
Solution with add function (borrowed data sample from piRSquared):
s1 = s.add(s)
print (s1)
A [1, 0, 1, 0]
B [2, 2, 2, 2]
C [4, 1, 4, 1]
Name: group, dtype: object
EDIT:
If some index values are different, it is more complicated, because need reindex of union of all index values and replace NaN by empty lists by combine_first:
s = pd.Series([[1, 0], [2, 2], [4, 1]], list('ABC'), name='group')
s1 = pd.Series([[3, 9], [6, 4]], list('AD'), name='group')
idx = s.index.union(s1.index)
s = s.reindex(idx).combine_first(pd.Series([[]], index=idx))
s1 = s1.reindex(idx).combine_first(pd.Series([[]], index=idx))
s2 = s.add(s1)
print (s2)
A [1, 0, 3, 9]
B [2, 2]
C [4, 1]
D [6, 4]
Name: group, dtype: object

Summing Two DataFrames by Index

I have the following
df1 = pd.DataFrame([1, 1, 1, 1, 1], index=[ 1, 2, 3, 4 ,5 ], columns=['A'])
df2 = pd.DataFrame([ 1, 1, 1, 1, 1], index=[ 2, 3, 4, 5, 6], columns=['A'])
I want to return the DataFrame which will be the sum of the two for each row:
df = pd.DataFrame([ 1, 2, 2, 2, 2, 1], index=[1, 2, 3, 4, 5, 6], columns=['A'])
of course, the idea is that I don't know what the actual indices are, so the intersection could be empty and I'd get a concatenation of both DataFrames.
You can concatenate by row, fill missing values by 0, and sum by row:
>>> pd.concat([df1, df2], axis=1).fillna(0).sum(axis=1)
1 1
2 2
3 2
4 2
5 2
6 1
dtype: float64
If you want it as a DataFrame, simply do
pd.DataFrame({
'A': pd.concat([df1, df2], axis=1).fillna(0).sum(axis=1)})
(Also, note that if you need to do this just for specific Series A, Just use
pd.concat([df1.A, df2.A], axis=1).fillna(0).sum(axis=1)
)

subset (slice) of pandas.DataFrame has the same index as the original DataFrame

Suppose, I have a DataFrame with MultiIndex like this:
In [1]:d=pnd.DataFrame(range(5),index=pnd.MultiIndex.from_tuples([('A',1),('A',2),('A',3),('A',4),('A',5)]))
In [2]: d
Out[2]:
0
A 1 0
2 1
3 2
4 3
5 4
I can create another DataFrame by subsetting:
In [3]: p=d.loc[('A',slice(1,3)),:].copy()
In [4]: p
Out[4]:
0
A 1 0
2 1
3 2
but the index object of this new DataFrame is the same as from the original DataFrame (contains all original items in the 'levels').
In [5]: p.index
Out[5]:
MultiIndex(levels=[[u'A'], [1, 2, 3, 4, 5]],
labels=[[0, 0, 0], [0, 1, 2]])
How do I copy-out a subset which does not 'remember' the index object of the original DataFrame?
The reason I need to this is because some of my functions access the index object to get metadata, and the fact that the index carries over form the original DataFrame confuses these functions.
If you don't care about the top level of your index in your subset, you can set
p.index = p.index.droplevel()
p.index
Int64Index([1, 2, 3], dtype='int64')
Alternatively if you want to keep the multi-index and just reset the levels you can call set_levels:
p.index = p.index.set_levels(p.index.droplevel(),1)
p.index
MultiIndex(levels=[['A'], [1, 2, 3]],
labels=[[0, 0, 0], [0, 1, 2]])

Categories