Flatten a dataframe with vector/list elements python - python

Lets say I have a dataframe like this:
A B C Profile
0 1 4 4 [1,2,3,4]
1 2 4 5 [2,2,4,1]
3 2 4 5 [2,2,4,1]
How can I go about making it become this:
A B C Profile[0] Profile[1] Profile[2] Profile[3]
0 1 4 4 1 2 3 4
1 2 4 5 2 2 4 1
3 2 4 5 2 2 4 1
I have tried this:
flat_list = [sublist for sublist in df['Profile']]
flat_df = pd.DataFrame(flat_list)
pd.concat([df.iloc[:,0:3], flat_df], axis=1)
BUT I have some NaN values and I need to retain the index for the flat list. This method just adds them all and moves all NaNs to the bottom instead of matching indices.
Ie i end up with this:
A B C Profile[0] Profile[1] Profile[2] Profile[3]
0 1 4 4 1 2 3 4
1 2 4 5 2 2 4 1
2 NaN NaN NaN 2 2 4 1
3 2 4 5 NaN NaN NaN NaN
TIA

Change you line pass with the index
flat_df = pd.DataFrame(flat_list, index = df.index)
out = pd.concat([df.iloc[:,0:3], flat_df], axis = 1)

Related

How to use two columns to distinguish data points in a pandas dataframe

I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.
Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6
You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6

split a string into separate columns in pandas

I have a dataframe with lots of data and 1 column that is structured like this:
index var_1
1 a=3:b=4:c=5:d=6:e=3
2 b=3:a=4:c=5:d=6:e=3
3 e=3:a=4:c=5:d=6
4 c=3:a=4:b=5:d=6:f=3
I am trying to structure the data in that column to look like this:
index a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
I have done the following thus far:
df1 = df['var1'].str.split(':', expand=True)
I can then loop through the cols of df1 and do another split on '=', but then I'll just have loads of disorganised label cols and value cols.
Use list comprehension with dictionaries for each value and pass to DataFrame constructor:
comp = [dict([y.split('=') for y in x.split(':')]) for x in df['var_1']]
df = pd.DataFrame(comp).fillna(0).astype(int)
print (df)
a b c d e f
0 3 4 5 6 3 0
1 4 3 5 6 3 0
2 4 0 5 6 3 0
3 4 5 3 6 0 3
Or use Series.str.split with expand=True for DataFrame, reshape by DataFrame.stack, again split, remove first level of MultiIndex and add new level by 0 column, last reshape by Series.unstack:
df = (df['var_1'].str.split(':', expand=True)
.stack()
.str.split('=', expand=True)
.reset_index(level=1, drop=True)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (df)
a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
Here's one approach using str.get_dummies:
out = df.var_1.str.get_dummies(sep=':')
out = out * out.columns.str[2:].astype(int).values
out.columns = pd.MultiIndex.from_arrays([out.columns.str[0], out.columns])
print(out.max(axis=1, level=0))
a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
You can apply "extractall" and "pivot".
After "extractall" you get:
0 1
index match
1 0 a 3
1 b 4
2 c 5
3 d 6
4 e 3
2 0 b 3
1 a 4
2 c 5
3 d 6
4 e 3
3 0 e 3
1 a 4
2 c 5
3 d 6
4 0 c 3
1 a 4
2 b 5
3 d 6
4 f 3
And in one step:
rslt= df.var_1.str.extractall(r"([a-z])=(\d+)") \
.reset_index(level="match",drop=True) \
.pivot(columns=0).fillna(0)
1
0 a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
#rslt.columns= rslt.columns.levels[1].values

Pandas DataFrame assign hirachic number to element

I have the following Dataframe:
a b c d
0 1 4 9 2
1 2 5 8 7
2 4 6 2 3
3 3 2 7 5
I want to assign a number to each element in a row according to it's order. The result should look like this:
a b c d
0 1 3 4 2
1 1 2 4 3
2 3 4 1 2
3 2 1 4 3
I tried to use the np.argsort function which doesn't work. Does someone know an easy way to to this? Thanks.
Use DataFrame.rank:
df = df.rank(axis=1).astype(int)
print (df)
a b c d
0 1 3 4 2
1 1 2 4 3
2 3 4 1 2
3 2 1 4 3

How to extract a value from a list in Pandas

Hi I have a dataframe and looks like this:
0 1
0 0 [03/25/93]
1 1 [6/18/85]
2 2 [7/8/71]
3 3 [9/27/75]
4 4 []
5 5 []
How can I extract the value inside the list in another column of the DataFrame???
0 1
0 0 03/25/93
1 1 6/18/85
2 2 7/8/71
3 3 9/27/75
4 4 NaN
5 5 Nan
Thank you very much.
Use str[0]:
df[1] = df[1].str[0]
print (df)
0 1
0 0 03/25/93
1 1 6/18/85
2 2 7/8/71
3 3 9/27/75
4 4 NaN
5 5 NaN

tracking maximum value in dataframe column

I want to produce a column B in a dataframe that tracks the maximum value reached in column A since row Index 0.
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4
I want to avoid iterating, so is there a vectorized solution and if so how could it look like ?
You're looking for cummax:
In [257]:
df['B'] = df['A'].cummax()
df
Out[257]:
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4

Categories