For a column in a pandas DataFrame with several rows I want to create a new column that has a specified number of rows that form sub-levels of the rows of the previous column. I'm trying this in order to create a large data matrix containing ranges of values as an input for a model later on.
As an example I have a small DataFrame as follows:
df:
A
1 1
2 2
3 3
. ..
To this DataFrame I would like to add 3 rows per row in the 'A' column of the DataFrame, forming a new column named 'B'. The result should be something like this:
df:
A B
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
. .. ..
I have tried various things of which a list comprehension combined with an if statement and using something to iterate over the rows in the DataFrame like iterrows() and subsequently 'append' the new rows seems most logic to me, however I cannot get it done. Especially the duplication of the 'A' column's rows.
Does anyone know how to do this?
Any suggestion is appreciated, many thanks in advance
I think you need numpy.repeat and numpy.tile with DataFrame constructor:
df = pd.DataFrame({'A':np.repeat(df['A'].values, 3),
'B':np.tile(df['A'].values, 3)})
print (df)
A B
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 3 1
7 3 2
8 3 3
In [28]: pd.DataFrame({'A':np.repeat(df.A.values, 3), 'B':np.tile(df.A.values,3)})
Out[28]:
A B
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 3 1
7 3 2
8 3 3
Here's another NumPy way with np.repeat to create one column and then re-using it for the another -
In [282]: df.A
Out[282]:
1 4
2 9
3 5
Name: A, dtype: int64
In [288]: r = np.repeat(df.A.values[:,None],3,axis=1)
In [289]: pd.DataFrame(np.c_[r.ravel(), r.T.ravel()], columns=[['A','B']])
Out[289]:
A B
0 4 4
1 4 9
2 4 5
3 9 4
4 9 9
5 9 5
6 5 4
7 5 9
8 5 5
Related
Given a pandas data frame, how can I get the first row for each unique value in a column?
for example, given:
a b key
0 1 2 1
1 2 3 1
2 3 3 1
3 4 5 2
4 5 6 2
5 6 6 2
6 7 2 1
7 8 2 1
8 9 2 3
the result when analyzing by column key should be
a b key
0 1 2 1
3 4 5 2
8 9 2 3
p.s. df src:
pd.DataFrame([{'a':1,'b':2,'key':1},
{'a':2,'b':3,'key':1},
{'a':3,'b':3,'key':1},
{'a':4,'b':5,'key':2},
{'a':5,'b':6,'key':2},
{'a':6,'b':6,'key':2},
{'a':7,'b':2,'key':1},
{'a':8,'b':2,'key':1},
{'a':9,'b':2,'key':3}])
drop_duplicates does this. By default it keeps the first of the set, although that can be changed by other parameters.
df = df.drop_duplicates('key')
I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.
Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6
You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
I'm trying to clean a dataset.
In the last 3 rows, I know if the column "B" is empty drop the whole row.
I haven't managed to figure out how to use dropna only on certain rows.
A B
1 1 3
2 5
3 6 5
4 2
5 3 6
Needs to become
A B
1 1 3
2 5
3 6 5
5 3 6
You slice the last three row then apply your condition pass that to drop
n=3
df=df.drop(df.tail(n).B.eq('').loc[lambda x : x].index)
A B
1 1 3
2 5
3 6 5
5 3 6
I'm trying to create a historical time-series of a number of identifiers for a number of different metrics, as part of that i'm trying to create multi index dataframe and then "fill it" with the individual dataframes.
Multi Index:
ID1 ID2
ITEM1 ITEM2 ITEM1 ITEM2
index
Dataframe to insert
ITEM1 ITEM2
Date
a
b
c
looking through the official docs and this website i found the following relevant:
Add single index data frame to multi index data frame, Pandas, Python and the associated pandas official docs pages:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
https://pandas.pydata.org/pandas-docs/stable/advanced.html
i've managed with something like :
for i in df1.index:
for j in df2.columns:
df1.loc[i,(ID,j)]=df2.loc[i,j]
but it seems highly inefficient when i need to do this across circa 100 dataframes.
for some reason a simply
df1.loc[i,(ID)]=df2.loc[i] doesn't seem to work
neither does :
df1[ID1]=df1.append(df2)
which returns a Cannot set a frame with no defined index and a value that cannot be converted to a Series
my understanding from looking around is that this is because im effectively leaving half the dataframe empty ( ragged list? )
any help appreciated on how to iteratively populate my multi index DF would be greatly appreciated.
let me know if i've missed relevant information,
cheers.
Setup
df1 = pd.DataFrame(
[[1, 2, 3, 4, 5, 6] * 2] * 3,
columns=pd.MultiIndex.from_product(['ID1 ID2 ID3'.split(), range(4)])
)
df2 = df1.ID1 * 2
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 1 2 3 4 5 6 1 2 3 4 5 6
1 1 2 3 4 5 6 1 2 3 4 5 6
2 1 2 3 4 5 6 1 2 3 4 5 6
df2
0 1 2 3
0 2 4 6 8
1 2 4 6 8
2 2 4 6 8
The problem is that Pandas is trying to line up indices (or columns in this case). We can do some transpose/join trickery but I'd rather avoid that.
Option 1
Take advantage of the fact that we can assign via loc an array so long as the shape matches up. Well, we better make sure it does and that the order of columns and index are correct. I use align with the right parameter to do this. Then assign the values of the aligned df2
df1.loc[:, 'ID1'] = df2.align(df1.ID1, 'right')[0].values
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 2
Or, we can give df2 the additional level of column indexing that we need to lined it up. The use update to replace the relevant cells in place.
df1.update(pd.concat({'ID1': df2}, axis=1))
df1
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
Option 3
A creative way using stack and assign with unstack
df1.stack().assign(ID1=df2.stack()).unstack()
ID1 ID2 ID3
0 1 2 3 0 1 2 3 0 1 2 3
0 2 4 6 8 5 6 1 2 3 4 5 6
1 2 4 6 8 5 6 1 2 3 4 5 6
2 2 4 6 8 5 6 1 2 3 4 5 6
For example, I have an DataFrame A as following
A
0
1
2
Now I want to insert every 2 rows in DataFrame B into A every 1 row and B is as following
B
3
3
4
4
5
5
finally I want
A
0
3
3
1
4
4
2
5
5
How can I achieve this?
One option is to take each dataframe's values, reshape, concatenate with np.hstack and then assign to a new dataframe.
In [533]: pd.DataFrame(np.hstack((df1.A.values.reshape(-1, 1),\
df2.B.values.reshape(-1, 2))).reshape(-1, ),\
columns=['A'])
Out[533]:
A
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
Another solution with pd.concat and df.stack:
In [622]: pd.DataFrame(pd.concat([df1.A, pd.DataFrame(df2.B.values.reshape(-1, 2))], axis=1)\
.stack().reset_index(drop=True),\
columns=['A'])
Out[622]:
A
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
Setup
Consider the dataframes a and b
a = pd.DataFrame(dict(A=range(3)))
b = pd.DataFrame(dict(B=np.arange(3).repeat(2) + 3))
Solution
Use interleave from toolz or cytoolz
The trick is to split b into two arguments of interleave
from cytoolz import interleave
pd.Series(list(interleave([a.A, b.B[::2], b.B[1::2]])))
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
dtype: int64
This is a modification of #root's answer to my question
Maybe this one ?
A=len(df1)+len(df2)
df1.index=(list(range(0, A,3)))
df2.index=list(set(range(0, A))-set(range(0, A,3)))
df2.columns=['A']
df=pd.concat([df1,df2],axis=0).sort_index()
df
Out[188]:
A
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
If we first split a to len(a) arrays and b to len(b) two arrays we can zip them together, stack and concatenate.
a = np.split(dfa.A.values,len(dfa.A))
b = np.split(dfb.B.values,len(dfb.B)/2)
c = np.concatenate(np.hstack(list(zip(a,b))))
pd.Series(c)
Returns:
0 0
1 3
2 3
3 1
4 4
5 4
6 2
7 5
8 5
dtype: int64