Suppose I have a DataFrame such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
and I would like to retrieve the last value in column e. I could do:
df['e'].tail(1)
but this would return a series which has index 9 with it. Ideally, I just want to obtain the value as a number that I can work with directly. I could also do:
np.array(df['e'].tail(1))
but this would then require me to access/call the 0'th element of it before I can really work with it. Is there a more direct/easy way to do this?
You could try iloc method of dataframe:
In [26]: df
Out[26]:
a b c d e
0 -1.079547 -0.722903 0.457495 -0.687271 -0.787058
1 1.326133 1.359255 -0.964076 -1.280502 1.460792
2 0.479599 -1.465210 -0.058247 -0.984733 -0.348068
3 -0.608238 -1.238068 -0.126889 0.572662 -1.489641
4 -1.533707 -0.218298 -0.877619 0.679370 0.485987
5 -0.864651 -0.180165 -0.528939 0.270885 1.313946
6 0.747612 -1.206509 0.616815 -1.758354 -0.158203
7 -2.309582 -0.739730 -0.004303 0.125640 -0.973230
8 1.735822 -0.750698 1.225104 0.431583 -1.483274
9 -0.374557 -1.132354 0.875028 0.032615 -1.131971
In [27]: df['e'].iloc[-1]
Out[27]: -1.1319705662711321
Or if you want just scalar you could use iat which is faster. From docs:
If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures
In [28]: df.e.iat[-1]
Out[28]: -1.1319705662711321
Benchmarking:
In [31]: %timeit df.e.iat[-1]
100000 loops, best of 3: 18 µs per loop
In [32]: %timeit df.e.iloc[-1]
10000 loops, best of 3: 24 µs per loop
Try
df['e'].iloc[[-1]]
Sometimes,
df['e'].iloc[-1]
doesn't work.
We can also access it by indexing df.index and at:
df.at[df.index[-1], 'e']
It's faster than iloc but slower than without indexing.
If we decide to assign a value to the last element in column "e", the above method is much faster than the other two options (9-11 times faster):
>>> %timeit df.at[df.index[-1], 'e'] = 1
11.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit df['e'].iat[-1] = 1
107 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df['e'].iloc[-1] = 1
127 µs ± 7.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)```
Related
Given a dataframe, I am trying to print out how many cells of one column with a specific value correspond to the same index of another column having other specific values.
In this instance the output should be '2' since the condition is df[z]=4 and df[x]=C and only cells 10 and 11 match this requirement.
My code does not output any result but only a warning message: :5: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if (df[df['z']== 4].index.values) == (df[df['x']== 'C'].index.values):
:5: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
Besides fixing this issue, is there another more 'pythonish' way of doing this without a for loop?
import numpy as np
import pandas as pd
data=[['A', 1,2 ,5, 'blue'],
['A', 1,5,6, 'blue'],
['A', 4,4,7, 'blue']
,['B', 6,5,4,'yellow'],
['B',9,9,3, 'blue'],
['B', 7,9,1,'yellow']
,['B', 2,3,1,'yellow'],
['B', 5,1,2,'yellow'],
['C',2,10,9,'green']
,['C', 8,2,8,'green'],
['C', 5,4,3,'green'],
['C', 8,4 ,3,'green']]
df = pd.DataFrame(data, columns=['x','y','z','xy', 'color'])
k=0
print((df[df['z']==4].index.values))
print(df[df['x']== 'C'].index.values)
for i in (df['z']):
if (df[df['z']== 4].index.values) == (df[df['x']== 'C'].index.values):
k+=1
print(k)
try:
c=df['z'].eq(4) & df['x'].eq('C')
#your condition
Finally:
count=df[c].index.size
#OR
count=len(df[c].index)
output:
print(count)
>>>2
You can do the following:
df[(df['z']==4) & (df['x']=='C')].shape[0]
#2
Assuming just the number is necessary and not the filtered frame, calculating the number of True values in the Boolean Series is faster:
Calculate the conditions as Boolean Series:
m = df['z'].eq(4) & df['x'].eq('C')
Count True values via Series.sum:
k = m.sum()
or via np.count_nonzero:
k = np.count_nonzero(m)
k:
2
Timing Information via %timeit:
All timing excludes creation of the index as they all use the same index so the timing is similar in all cases:
m = df['z'].eq(4) & df['x'].eq('C')
Henry Ecker (This Answer)
%timeit m.sum()
25.6 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.count_nonzero(m)
7 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
IoaTzimas
%timeit df[m].shape[0]
151 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Anurag Dabas
%timeit df[m].index.size
163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit len(df[m].index)
165 µs ± 5.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
SeaBean
%timeit df.loc[m].shape[0]
151 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(Without loc is the same as IoaTzimas)
You can use .loc with the boolean condition mask of selecting the rows to locate the rows. Then, use shape[0] to get the row count:
df.loc[(df['z']== 4) & (df['x']== 'C')].shape[0]
You can use with or without .loc for the row selection. So, it's the same as:
df[(df['z']== 4) & (df['x']== 'C')].shape[0]
However, it is a good practice to use .loc rather than without it. You can refer to this post for further information.
Result:
2
I would expect that indexing one Series by another would result in a new Series whose index is the same as the indexing Series. However it is instead a combination of the index of the indexed and the indexing Series.
Running
A = pandas.Series(range(3), index=list("ABC"))
B = pandas.Series(list("AABBCC"), index=list("XYZWVU"))
print(A[B])
yields
A 0
A 0
B 1
B 1
C 2
C 2
dtype: int64
So the index here has the shape of B.index but the values of A.index.
I would instead have expected that the index for A[B] would be identical to B.index, as if composing two mappings. Why is it not like this? Is there a use to the current setup (which seems to me to be useless)? Is there a 'correct' way to get what I'm looking for?
This problem makes certain operations tricky. For example,
B[A[B] == 2]
would intuitively be how to select the entries of B whose values yield 2 when looking them up in A. (This is useful if B is a DataFrame with some object IDs in one column and we want to select rows of B based off a value of that object stored in a secondary table.) However this yields an ValueError: cannot reindex from a duplicate axis exception. This works if you drop the index:
B[(A[B] == 2).values]
Is using .values the proper way to do this or is there a better way?
In my opinion, indexing is choosing the corresponding indexes. So really A[B] should still have index in A.index.
If instead, you want to map the values, then use map. And it seems faster too:
%timeit -n 1000 B.map(A)
# 196 µs ± 6.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 A[B]
# 384 µs ± 5.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And:
%timeit -n 1000 B[B.map(A).eq(2)]
# 624 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 B[A[B].eq(2).values]
#779 µs ± 7.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Assuming I have a pandas dataframe such as
df_p = pd.DataFrame(
{'name_array':
[[20130101, 320903902, 239032902],
[20130101, 3253453, 239032902],
[65756, 4342452, 32425432523]],
'name': ['a', 'a', 'c']} )
I want to extract the series which contains the flatten arrays in each row whilst preserving the order
The expected result is a pandas.core.series.Series
This question is not a duplicate because my expected output is a pandas Series, and not a dataframe.
The solutions using melt are slower than OP's original method, which they shared in the answer here, especially after the speedup from my comment on that answer.
I created a larger dataframe to test on:
df = pd.DataFrame({'name_array': np.random.rand(1000, 3).tolist()})
And timing the two solutions using melt on this dataframe yield:
In [16]: %timeit pd.melt(df.name_array.apply(pd.Series).reset_index(), id_vars=['index'],value_name='name_array').drop('variable', axis=1).sort_values('index')
173 ms ± 5.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [17]: %timeit df['name_array'].apply(lambda x: pd.Series([i for i in x])).melt().drop('variable', axis=1)['value']
175 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The OP's method with the speedup I suggested in the comments:
In [18]: %timeit pd.Series(np.concatenate(df['name_array']))
18 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And finally, the fastest solution as provided here but modified to provide a series instead of dataframe output:
In [14]: from itertools import chain
In [15]: %timeit pd.Series(list(chain.from_iterable(df['name_array'])))
402 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This last method is faster than melt() by 3 orders of magnitude and faster than np.concatenate() by 2 orders of magnitude.
This is the solution I've figured out. Don't know if there are more efficient ways.
df_p = pd.DataFrame(
{'name_array':
[[20130101, 320903902, 239032902],
[20130101, 3253453, 239032902],
[65756, 4342452, 32425432523]],
'name': ['a', 'a', 'c']} )
data = pd.DataFrame( {'column':np.concatenate(df_p['name_array'].values)} )['column']
output:
[0 20130101
1 320903902
2 239032902
3 20130101
4 3253453
5 239032902
6 65756
7 4342452
8 32425432523
Name: column, dtype: int64]
You can use pd.melt:
pd.melt(df_p.name_array.apply(pd.Series).reset_index(),
id_vars=['index'],
value_name='name_array') \
.drop('variable', axis=1) \
.sort_values('index')
OUTPUT:
index name_array
0 20130101
0 320903902
0 239032902
1 20130101
1 3253453
1 239032902
2 65756
2 4342452
2 32425432523
you can flatten list of column's lists, and then create series of that, in this way:
pd.Series([element for row in df_p.name_array for element in row])
I have a dataframe where in one column I have a list of hash values stored like strings:
'[d85235f50b3c019ad7c6291e3ca58093,03e0fb034f2cb3264234b9eae09b4287]' just to be clear.
the dataframe looks like
1
0 [8a88e629c368001c18619c7cd66d3e96, 4b0709dd990a0904bbe6afec636c4213, c00a98ceb6fc7006d572486787e551cc, 0e72ae6851c40799ec14a41496d64406, 76475992f4207ee2b209a4867b42c372]
1 [3277ded8d1f105c84ad5e093f6e7795d]
2 [d85235f50b3c019ad7c6291e3ca58093, 03e0fb034f2cb3264234b9eae09b4287]
I'd like to create a list of unique hash id's present in this column.
What is the efficient way?
Thank you
Option 1
See timing below for fastest option
You can embed the parsing and flattening in one comprehension
[y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]
['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287']
From there, you can use either list(set()), pd.unique, or np.unique
pd.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
array(['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287'], dtype=object)
Option 2
For brevity, use pd.Series.extractall
list(set(df['1'].str.extractall('(\w+)')[0]))
['8a88e629c368001c18619c7cd66d3e96',
'4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc',
'0e72ae6851c40799ec14a41496d64406',
'76475992f4207ee2b209a4867b42c372',
'3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093',
'03e0fb034f2cb3264234b9eae09b4287']
#jezrael's list(set()) with my comprehension is fastest
Parse Timing
I kept the same list(set()) for purposes of comparing parsing and flattening.
%timeit list(set(np.concatenate(df['1'].apply(yaml.load).values).tolist()))
%timeit list(set([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]))
%timeit list(set(chain.from_iterable(df['1'].str.strip('[]').str.split(', '))))
%timeit list(set(df['1'].str.extractall('(\w+)')[0]))
1.01 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.42 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
279 µs ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
941 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This takes my comprehension and uses various ways to make unique to compare those speeds
%timeit pd.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
%timeit np.unique([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')])
%timeit list(set([y for x in df['1'].values.tolist() for y in x.strip('[]').split(', ')]))
57.8 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
17.5 µs ± 552 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.18 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You need strip with split first and for flatenning chain:
print (df.columns.tolist())
['col']
#convert strings to lists per rows
#change by your column name if necessary
s = df['col'].str.strip('[]').str.split(', ')
print (s)
0 [8a88e629c368001c18619c7cd66d3e96, 4b0709dd990...
1 [3277ded8d1f105c84ad5e093f6e7795d]
2 [d85235f50b3c019ad7c6291e3ca58093, 03e0fb034f2...
Name: col, dtype: object
#check first value
print (type(s.iat[0]))
<class 'list'>
#get unique values - for unique values use set
from itertools import chain
L = list(set(chain.from_iterable(s)))
['76475992f4207ee2b209a4867b42c372', '3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093', '4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc', '03e0fb034f2cb3264234b9eae09b4287',
'8a88e629c368001c18619c7cd66d3e96', '0e72ae6851c40799ec14a41496d64406']
from itertools import chain
s = [x.strip('[]').split(', ') for x in df['col'].values.tolist()]
L = list(set(chain.from_iterable(s)))
print (L)
['76475992f4207ee2b209a4867b42c372', '3277ded8d1f105c84ad5e093f6e7795d',
'd85235f50b3c019ad7c6291e3ca58093', '4b0709dd990a0904bbe6afec636c4213',
'c00a98ceb6fc7006d572486787e551cc', '03e0fb034f2cb3264234b9eae09b4287',
'8a88e629c368001c18619c7cd66d3e96', '0e72ae6851c40799ec14a41496d64406']
IIUC, you want to flatten your data. Convert it to a column of lists using yaml.load.
import yaml
df = df.applymap(yaml.load)
print(df)
1
0 [8a88e629c368001c18619c7cd66d3e96, 4b0709dd990...
1 [3277ded8d1f105c84ad5e093f6e7795d]
2 [d85235f50b3c019ad7c6291e3ca58093, 03e0fb034f2...
The easiest way would be to construct a new dataframe form the old one's values.
out = pd.DataFrame(np.concatenate(df.iloc[:, 0].values.tolist()))
print(out)
0
0 8a88e629c368001c18619c7cd66d3e96
1 4b0709dd990a0904bbe6afec636c4213
2 c00a98ceb6fc7006d572486787e551cc
3 0e72ae6851c40799ec14a41496d64406
4 76475992f4207ee2b209a4867b42c372
5 3277ded8d1f105c84ad5e093f6e7795d
6 d85235f50b3c019ad7c6291e3ca58093
7 03e0fb034f2cb3264234b9eae09b4287
I am trying to determine whether there is an entry in a Pandas column that has a particular value. I tried to do this with if x in df['id']. I thought this was working, except when I fed it a value that I knew was not in the column 43 in df['id'] it still returned True. When I subset to a data frame only containing entries matching the missing id df[df['id'] == 43] there are, obviously, no entries in it. How to I determine if a column in a Pandas data frame contains a particular value and why doesn't my current method work? (FYI, I have the same problem when I use the implementation in this answer to a similar question).
in of a Series checks whether the value is in the index:
In [11]: s = pd.Series(list('abc'))
In [12]: s
Out[12]:
0 a
1 b
2 c
dtype: object
In [13]: 1 in s
Out[13]: True
In [14]: 'a' in s
Out[14]: False
One option is to see if it's in unique values:
In [21]: s.unique()
Out[21]: array(['a', 'b', 'c'], dtype=object)
In [22]: 'a' in s.unique()
Out[22]: True
or a python set:
In [23]: set(s)
Out[23]: {'a', 'b', 'c'}
In [24]: 'a' in set(s)
Out[24]: True
As pointed out by #DSM, it may be more efficient (especially if you're just doing this for one value) to just use in directly on the values:
In [31]: s.values
Out[31]: array(['a', 'b', 'c'], dtype=object)
In [32]: 'a' in s.values
Out[32]: True
You can also use pandas.Series.isin although it's a little bit longer than 'a' in s.values:
In [2]: s = pd.Series(list('abc'))
In [3]: s
Out[3]:
0 a
1 b
2 c
dtype: object
In [3]: s.isin(['a'])
Out[3]:
0 True
1 False
2 False
dtype: bool
In [4]: s[s.isin(['a'])].empty
Out[4]: False
In [5]: s[s.isin(['z'])].empty
Out[5]: True
But this approach can be more flexible if you need to match multiple values at once for a DataFrame (see DataFrame.isin)
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]})
>>> df.isin({'A': [1, 3], 'B': [4, 7, 12]})
A B
0 True False # Note that B didn't match 1 here.
1 False True
2 True True
found = df[df['Column'].str.contains('Text_to_search')]
print(found.count())
the found.count() will contains number of matches
And if it is 0 then means string was not found in the Column.
You can try this to check a particular value 'x' in a particular column named 'id'
if x in df['id'].values
I did a few simple tests:
In [10]: x = pd.Series(range(1000000))
In [13]: timeit 999999 in x.values
567 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: timeit (x == 999999).any()
6.86 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: timeit x.eq(999999).any()
7.03 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: timeit x.eq(9).any()
7.04 ms ± 60 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: timeit x.isin([999999]).any()
9.54 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: timeit 999999 in set(x)
79.8 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly it doesn't matter if you look up 9 or 999999, it seems like it takes about the same amount of time using the in syntax (must be using some vectorized computation)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [25]: timeit 9999 in x.values
647 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]: timeit 999999 in x.values
642 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [27]: timeit 99199 in x.values
644 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: timeit 1 in x.values
667 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Seems like using x.values is the fastest, but maybe there is a more elegant way in pandas?
Or use Series.tolist or Series.any:
>>> s = pd.Series(list('abc'))
>>> s
0 a
1 b
2 c
dtype: object
>>> 'a' in s.tolist()
True
>>> (s=='a').any()
True
Series.tolist makes a list about of a Series, and the other one i am just getting a boolean Series from a regular Series, then checking if there are any Trues in the boolean Series.
Simple condition:
if any(str(elem) in ['a','b'] for elem in df['column'].tolist()):
Use
df[df['id']==x].index.tolist()
If x is present in id then it'll return the list of indices where it is present, else it gives an empty list.
I had a CSV file to read:
df = pd.read_csv('50_states.csv')
And after trying:
if value in df.column:
print(True)
which never printed true, even though the value was in the column;
I tried:
for values in df.column:
if value == values:
print(True)
#Or do something
else:
print(False)
Which worked. I hope this can help!
Use query() to find the rows where the condition holds and get the number of rows with shape[0]. If there exists at least one entry, this statement is True:
df.query('id == 123').shape[0] > 0
Suppose you dataframe looks like :
Now you want to check if filename "80900026941984" is present in the dataframe or not.
You can simply write :
if sum(df["filename"].astype("str").str.contains("80900026941984")) > 0:
print("found")