How to split a pandas df column that has objects? [duplicate] - python

This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 4 years ago.
I have a really simple Pandas dataframe where each cell contains a list. I'd like to split each element of the list into it's own column. I can do that by exporting the values and then creating a new dataframe. This doesn't seem like a good way to do this especially, if my dataframe had a column aside from the list column.
import pandas as pd
df = pd.DataFrame(data=[[[8,10,12]],
[[7,9,11]]])
df = pd.DataFrame(data=[x[0] for x in df.values])
Desired output:
0 1 2
0 8 10 12
1 7 9 11
Follow-up based on #Psidom answer:
If I did have a second column:
df = pd.DataFrame(data=[[[8,10,12], 'A'],
[[7,9,11], 'B']])
How do I not loose the other column?
Desired output:
0 1 2 3
0 8 10 12 A
1 7 9 11 B

You can loop through the Series with apply() function and convert each list to a Series, this automatically expand the list as a series in the column direction:
df[0].apply(pd.Series)
# 0 1 2
#0 8 10 12
#1 7 9 11
Update: To keep other columns of the data frame, you can concatenate the result with the columns you want to keep:
pd.concat([df[0].apply(pd.Series), df[1]], axis = 1)
# 0 1 2 1
#0 8 10 12 A
#1 7 9 11 B

You could do pd.DataFrame(df[col].values.tolist()) - is much faster ~500x
In [820]: pd.DataFrame(df[0].values.tolist())
Out[820]:
0 1 2
0 8 10 12
1 7 9 11
In [821]: pd.concat([pd.DataFrame(df[0].values.tolist()), df[1]], axis=1)
Out[821]:
0 1 2 1
0 8 10 12 A
1 7 9 11 B
Timings
Medium
In [828]: df.shape
Out[828]: (20000, 2)
In [829]: %timeit pd.DataFrame(df[0].values.tolist())
100 loops, best of 3: 15 ms per loop
In [830]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 4.06 s per loop
Large
In [832]: df.shape
Out[832]: (200000, 2)
In [833]: %timeit pd.DataFrame(df[0].values.tolist())
10 loops, best of 3: 161 ms per loop
In [834]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 40.9 s per loop

Related

How can I know the type of values in column names and list? [duplicate]

I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

pandas dataframe to series [duplicate]

This question already has an answer here:
Reshape a pandas DataFrame of (720, 720) into (518400, ) 2D into 1D
(1 answer)
Closed 2 years ago.
I have this dataframe:
pd.DataFrame({"X": [1,2,3,4],
"Y": [5,6,7,8],
"Z": [9,10,11,12]})
And I'm looking for this output:
Currently, the similar problems solved I have found are the opposite: looking from series to dataframe. The most similar I have found is this one, which isn't similar at all. I have tried also with pivot_table() and reshape(), but they require an index column where I'm just looking for one column.
Any suggestions?
PS: You can assume that the dataframe has 100 columns to avoid selecting them one by one, but you call them as they are ordered (e.g. if they are 100 columns, you can do X1:X100)
Use flattening with ravel('F') -
In [14]: pd.Series(df.to_numpy(copy=False).ravel('F'))
Out[14]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
dtype: int64
This series is a view into the input dataframe, which means virtually free runtime and zero memory overhead. Let's verify -
In [20]: s = pd.Series(df.to_numpy(copy=False).ravel('F'))
In [21]: np.shares_memory(s,df)
Out[21]: True
Let's confirm the timings too -
In [2]: df = pd.DataFrame(np.random.rand(100000,3), columns=['X','Y','Z'])
In [3]: %timeit pd.Series(df.to_numpy(copy=False).ravel('F'))
579 µs ± 9.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This is melt:
df.melt()[['value']]
Output:
value
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
One way is to reshape the data from the "wide" to the "tall" format by stacking:
df.T.stack().reset_index(drop=True)
#0 1
#1 2
#2 3
#3 4
#4 5
#5 6
#6 7
#7 8
#8 9
#9 10
#10 11
#11 12
As always, there are many ways to "skin a cat" in Pandas, and then performance may become the criterion. This is a meta-answer that compares the performance:
ravel by Divakar: 80 us
stack by DYZ: 640 us
melt by Quang Hoang: 2.03 ms

merge groupby results directly back to dataframe

Suppose I have the following data:
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
id1 id2 x
0 1 1 10
1 1 2 20
2 1 3 50
3 2 1 15
4 2 2 20
5 2 3 30
6 3 1 40
7 3 2 70
The dataframe is sorted along the two ids. Suppose I'd like to know the value of x of the FIRST observation within each group of id1 observations. The result would be like
id1 id2 x first_x
1 1 10 10
1 2 30 10
1 3 50 10
2 1 15 15
2 2 20 15
2 3 30 15
3 1 40 40
3 2 70 40
How do I achieve this 'subscripting'? Ideally, the new column would be filled for each observation.
I thought along the lines of
df['first_x'] = df.groupby(['id1'])[0]
I think simpliest is transform with first:
df['first_x'] = df.groupby('id1')['x'].transform('first')
Or map by Series created by drop_duplicates:
df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
print (df)
id1 id2 x first_x
0 1 1 10 10
1 1 2 20 10
2 1 3 50 10
3 2 1 15 15
4 2 2 20 15
5 2 3 30 15
6 3 1 40 40
7 3 2 70 40
First is shortest and fastest solution:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)
In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop
In [180]: %%timeit
...: first_xs = df.groupby(['id1']).first().to_dict()['x']
...:
...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
...:
1 loop, best of 3: 524 ms per loop
In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop
In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop
Something like this?
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
As you need to consider the entire dataframe when building values for each row, you need an intermediate step.
The following gets your first_x value using a group by, then uses that as a map to add a new column.
import pandas as pd
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
first_xs = df.groupby(['id1']).first().to_dict()['x']
df['first_x'] = df['id1'].map(lambda id: first_xs[id])

Selecting columns by list (and columns are subset of list)

I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there any column match or row match function in python?

I have two data frame lets say:
dataframe A with column 'name'
name
0 4
1 2
2 1
3 3
Another dataframe B with two columns i.e. name and value
name value
0 3 5
1 2 6
2 4 7
3 1 8
I want to rearrange the value in dataframe B according to the name column in dataframe A
I am expecting final dataframe similar to this:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Here are two options:
dfB.set_index('name').loc[dfA.name].reset_index()
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Or,
dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
dfA
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Timings:
import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
names = np.arange(10**7)
prng.shuffle(names)
dfA = pd.DataFrame({'name': names})
prng.shuffle(names)
dfB = pd.DataFrame({'name': names, 'value': prng.randint(0, 100, 10**7)})
%timeit dfB.set_index('name').loc[dfA.name].reset_index()
1 loop, best of 3: 2.27 s per loop
%timeit dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
1 loop, best of 3: 1.65 s per loop
%timeit dfB.set_index('name').ix[dfA.name].reset_index()
1 loop, best of 3: 1.66 s per loop

Categories