I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 4 years ago.
I have a really simple Pandas dataframe where each cell contains a list. I'd like to split each element of the list into it's own column. I can do that by exporting the values and then creating a new dataframe. This doesn't seem like a good way to do this especially, if my dataframe had a column aside from the list column.
import pandas as pd
df = pd.DataFrame(data=[[[8,10,12]],
[[7,9,11]]])
df = pd.DataFrame(data=[x[0] for x in df.values])
Desired output:
0 1 2
0 8 10 12
1 7 9 11
Follow-up based on #Psidom answer:
If I did have a second column:
df = pd.DataFrame(data=[[[8,10,12], 'A'],
[[7,9,11], 'B']])
How do I not loose the other column?
Desired output:
0 1 2 3
0 8 10 12 A
1 7 9 11 B
You can loop through the Series with apply() function and convert each list to a Series, this automatically expand the list as a series in the column direction:
df[0].apply(pd.Series)
# 0 1 2
#0 8 10 12
#1 7 9 11
Update: To keep other columns of the data frame, you can concatenate the result with the columns you want to keep:
pd.concat([df[0].apply(pd.Series), df[1]], axis = 1)
# 0 1 2 1
#0 8 10 12 A
#1 7 9 11 B
You could do pd.DataFrame(df[col].values.tolist()) - is much faster ~500x
In [820]: pd.DataFrame(df[0].values.tolist())
Out[820]:
0 1 2
0 8 10 12
1 7 9 11
In [821]: pd.concat([pd.DataFrame(df[0].values.tolist()), df[1]], axis=1)
Out[821]:
0 1 2 1
0 8 10 12 A
1 7 9 11 B
Timings
Medium
In [828]: df.shape
Out[828]: (20000, 2)
In [829]: %timeit pd.DataFrame(df[0].values.tolist())
100 loops, best of 3: 15 ms per loop
In [830]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 4.06 s per loop
Large
In [832]: df.shape
Out[832]: (200000, 2)
In [833]: %timeit pd.DataFrame(df[0].values.tolist())
10 loops, best of 3: 161 ms per loop
In [834]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 40.9 s per loop
I have a dataframe with subjects in two different conditions and many value columns.
d = {
"subject": [1, 1, 2, 2],
"condition": ["on", "off", "on", "off"],
"value": [1, 2, 3, 5]
}
df = pd.DataFrame(data=d)
df
subject
condition
value
0
1
on
1
1
1
off
2
2
2
on
3
3
2
off
5
I would like to get new columns which indicate the difference off-on between both conditions. In this case I would like to get:
subject
condition
value
off-on
0
1
on
1
1
1
1
off
2
1
2
2
on
3
2
3
2
off
5
2
How would I best do that?
I could achieve the result using this code:
onoff = (df[df.condition == "off"].value.reset_index() - df[df.condition == "on"].value.reset_index()).value
for idx, sub in enumerate(df.subject.unique()):
df.loc[df.subject == sub, "off-on"] = onoff.iloc[idx]
But it seems quite tedious and slow. I was hoping for a solution without loop. I have many rows and very many value columns. Is there a better way?
Use a pivot combined with map:
df['off-on'] = df['subject'].map(
df.pivot(index='subject', columns='condition', values='value')
.eval('off-on')
)
Or with a MultiIndex (more efficient than a pivot):
s = df.set_index(['condition', 'subject'])['value']
df['off-on'] = df['subject'].map(s['off']-s['on'])
Output:
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
timings
On 100k subjects
# MultiIndexing
43.2 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# pivot
77 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use DataFrame.pivot for possible easy mapping subtracted column off and on by Series.map:
df1 = df.pivot(index='subject', columns='condition', values='value')
df['off-on'] = df['subject'].map(df1['off'].sub(df1['on']))
print (df)
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
Details:
print (df.pivot(index='subject', columns='condition', values='value'))
condition off on
subject
1 2 1
2 5 3
print (df1['off'].sub(df1['on']))
subject
1 1
2 2
dtype: int64
I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have two dataframes A, B with NxM shape. I want to multiply both such that each element of A is multiplied with respective element of B.
e.g:
A,B = input dataframes
C = final dataframe
I want C[i][j] = A[i][j]*B[i][j] for i=1..N and j=1..M
I searched but couldn't get exactly the solution.
I think you can use:
C = A * B
Next solution is with mul:
C = A.mul(B)
Sample:
print A
a b
0 1 3
1 2 4
2 3 7
print B
a b
0 2 3
1 1 4
2 3 2
print A * B
a b
0 2 9
1 2 16
2 9 14
print A.mul(B)
a b
0 2 9
1 2 16
2 9 14
Timings with lenght of A and B 300k:
In [218]: %timeit A * B
The slowest run took 4.27 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.57 ms per loop
In [219]: %timeit A.mul(B)
100 loops, best of 3: 3.56 ms per loop
A = pd.concat([A]*100000).reset_index(drop=True)
B = pd.concat([B]*100000).reset_index(drop=True)
print A * B
print A.mul(B)
I need a way to print several lists of varying lengths as columns next to each other tab delimited and with the empty cells remaining empty or containing some fill character (e.g "-").
The methods attempted so far have not worked for lists of varying lengths and numpy has not been working as I expected it.
To summarize:
listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]]
printed as such in a .txt file:
1 4 9
2 5 10
3 6 11
- 7 12
- 8 -
You can use itertools.izip_longest. To fill the None spaces in the longer sequences you can use fillvalue (thanks #szxk):
>>> import itertools
>>> listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]]
>>> for x in itertools.izip_longest(*listname, fillvalue="-"):
... print '\t'.join([str(e) for e in x])
...
1 4 9
2 5 10
3 6 11
- 7 12
- 8 -
You can use zip function in this case that is more efficient for small list that itertools.izip
listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]]
with open('a.txt',w) as f:
for tup in zip(*listname) :
f.write('\t'.join(map(str,tup))
A bench-marking :
~$ python -m timeit "import itertools;listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]];itertools.izip_longest(*listname)"
1000000 loops, best of 3: 1.13 usec per loop
~$ python -m timeit "listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]];zip(*listname)"
1000000 loops, best of 3: 0.67 usec per loop
What about using pandas:
In [38]: listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]]
In [39]: import pandas as pd
In [40]: df = pd.DataFrame(listname, dtype=object)
In [41]: df.T
Out[41]:
0 1 2
0 1 4 9
1 2 5 10
2 3 6 11
3 None 7 12
4 None 8 None
[5 rows x 3 columns]
In [42]: df.T.to_csv("my_file.txt", index=False, header=False, sep="\t", na_rep="-")