Adding a Dataframe to a multindex dataframe - python

I'm trying to create a historical time-series of a number of identifiers for a number of different metrics, as part of that i'm trying to create multi index dataframe and then "fill it" with the individual dataframes.
Multi Index:
ID1 ID2
ITEM1 ITEM2 ITEM1 ITEM2
index
Dataframe to insert
ITEM1 ITEM2
Date
a
b
c
looking through the official docs and this website i found the following relevant:
Add single index data frame to multi index data frame, Pandas, Python and the associated pandas official docs pages:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
https://pandas.pydata.org/pandas-docs/stable/advanced.html
i've managed with something like :
for i in df1.index:
for j in df2.columns:
df1.loc[i,(ID,j)]=df2.loc[i,j]
but it seems highly inefficient when i need to do this across circa 100 dataframes.
for some reason a simply
df1.loc[i,(ID)]=df2.loc[i] doesn't seem to work
neither does :
df1[ID1]=df1.append(df2)
which returns a Cannot set a frame with no defined index and a value that cannot be converted to a Series
my understanding from looking around is that this is because im effectively leaving half the dataframe empty ( ragged list? )
any help appreciated on how to iteratively populate my multi index DF would be greatly appreciated.
let me know if i've missed relevant information,
cheers.

Setup
df1 = pd.DataFrame(
[[1, 2, 3, 4, 5, 6] * 2] * 3,
columns=pd.MultiIndex.from_product(['ID1 ID2 ID3'.split(), range(4)])
)
df2 = df1.ID1 * 2
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 1 2 3 4 5 6
1 1 2 3 4 5 6 1 2 3 4 5 6
2 1 2 3 4 5 6 1 2 3 4 5 6
df2
0 1 2 3
0 2 4 6 8
1 2 4 6 8
2 2 4 6 8
The problem is that Pandas is trying to line up indices (or columns in this case). We can do some transpose/join trickery but I'd rather avoid that.
Option 1
Take advantage of the fact that we can assign via loc an array so long as the shape matches up. Well, we better make sure it does and that the order of columns and index are correct. I use align with the right parameter to do this. Then assign the values of the aligned df2
df1.loc[:, 'ID1'] = df2.align(df1.ID1, 'right')[0].values
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 2
Or, we can give df2 the additional level of column indexing that we need to lined it up. The use update to replace the relevant cells in place.
df1.update(pd.concat({'ID1': df2}, axis=1))
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 3
A creative way using stack and assign with unstack
df1.stack().assign(ID1=df2.stack()).unstack()
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6

Related

pandas get first row for each unique value in a column

Given a pandas data frame, how can I get the first row for each unique value in a column?
for example, given:
a b key
0 1 2 1
1 2 3 1
2 3 3 1
3 4 5 2
4 5 6 2
5 6 6 2
6 7 2 1
7 8 2 1
8 9 2 3
the result when analyzing by column key should be
a b key
0 1 2 1
3 4 5 2
8 9 2 3
p.s. df src:
pd.DataFrame([{'a':1,'b':2,'key':1},
{'a':2,'b':3,'key':1},
{'a':3,'b':3,'key':1},
{'a':4,'b':5,'key':2},
{'a':5,'b':6,'key':2},
{'a':6,'b':6,'key':2},
{'a':7,'b':2,'key':1},
{'a':8,'b':2,'key':1},
{'a':9,'b':2,'key':3}])
drop_duplicates does this. By default it keeps the first of the set, although that can be changed by other parameters.
df = df.drop_duplicates('key')

New column that counts the frequency that a value occurs in a Pandas dataframe column

I have a dataframe that looks like
ID feature
1 2
1 3
1 4
2 3
2 2
3 5
3 8
3 4
3 2
4 4
4 6
and I want to add a new column n_ID that counts the number of times that an element occur in the column ID, so the desire output should look like
ID feature n_ID
1 2 3
1 3 3
1 4 3
2 3 2
2 2 2
3 5 4
3 8 4
3 4 4
3 2 4
4 4 2
4 6 2
I know the .value_counts() function but I don't know how to make use of this method to make the new column. Thanks in advance
Using value counts... I was thinking of this... #sophocles Thanks for transform... :)
df = pd.DataFrame({"ID":[1,1,1,2,2,3,3,3,3,4,4],
"feature":[1,2,3,4,5,6,7,8,9,10,11]})
df1 = pd.DataFrame(df["ID"].value_counts().reset_index())
df1.columns = ["ID","n_ID"]
df = df.merge(df1,how = "left",on="ID")
Just create new column and count the occurance using lambda func:
Code:
df['n_id'] = df.apply(lambda x: df['ID'].tolist().count(x.ID), axis=1)
Output:
ID feature n_id
0 1 1 3
1 1 2 3
2 1 3 3
3 2 4 2
4 2 5 2
5 3 6 4
6 3 7 4
7 3 8 4
8 3 9 4
9 4 10 2
10 4 11 2

How to use two columns to distinguish data points in a pandas dataframe

I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.
Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6
You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6

How to identify identical groups using pandas.groupby()?

I'm trying to use pandas to identify sub-sections of a dataframe which are identical. So, for example, if I have a dataframe like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
3 2 1 2
4 2 2 3
5 2 5 6
6 3 8 9
7 3 4 0
8 3 9 7
I want to group by ID, so Rows 0 - 2 would form Group 1, Rows 3 - 5 would form Group 2, and Rows 6 - 8 would form Group 3. I know I can use pd.groupby() to group rows by ID. In the case here, Group 2 is a repetition of Group 1 (Columns A and B are identical in both)
What I then want to do is to remove repeated groups, so in this case I would want to remove the second group. My final dataframe would then look like:
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Every column in the duplicate groups is the same, except for the ID which is different for each group. I only want to remove a group if it is identical for every row in the group. Any help would be much appreciated!
This is one way using a helper column and pd.Series.drop_duplicates.
The idea is to first create a mapping from id to a tuple of values representing all rows for that id. Then drop duplicates and extract the index of the remainder.
df['C'] = list(zip(df['A'], df['B']))
s = df.groupby('id')['C'].apply(tuple)\
.drop_duplicates().index
res = df.loc[df['id'].isin(s), ['id', 'A', 'B']]
print(res)
id A B
0 1 1 2
1 1 2 3
2 1 5 6
6 3 8 9
7 3 4 0
8 3 9 7
Check pd.crosstab
s=pd.crosstab(df.id,[df.A,df.B]).drop_duplicates().unstack()
s[s!=0].reset_index().drop(0,1)
Out[128]:
A B id
0 1 2 1
1 2 3 1
2 4 0 3
3 5 6 1
4 8 9 3
5 9 7 3

pandas stack second column below first and vice versa

I have a DataFrame with two columns and I would like to stack the second column below the first and the first below the second.
pd.DataFrame({'A':[1,2,3], 'B': [4,5,6]})
A B
0 1 4
1 2 5
2 3 6
Desired output:
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
So far I have tried:
pd.concat([df, df[['B','A']].rename(columns={'A':'B', 'B':'A'})])
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
Is this the cleanest way?
Concat is better if you ask me. But if you have a 100 columns renaming is a pain. As a generalized approach here's one with numpy flip and vstack i.e
v = df.values
pd.DataFrame(pd.np.vstack((v, pd.np.fliplr(v))), columns=df.columns)
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3

Categories