Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed
%timeit df_user1 = df.loc[df.user_id=='5561']
100 loops, best of 3: 11.9 ms per loop
or
%timeit df_user1_noloc = df[df.user_id=='5561']
100 loops, best of 3: 12 ms per loop
So why use loc?
Edit: This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that *
you can do column retrieval just by using the data frame's
getitem:
*
df['time'] # equivalent to df.loc[:, 'time']
it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below.
Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.
Explicit is better than implicit.
df[boolean_mask] selects rows where boolean_mask is True, but there is a corner case when you might not want it to: when df has boolean-valued column labels:
In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
Out[229]:
False True
0 3 1
1 4 2
2 5 3
You might want to use df[[True]] to select the True column. Instead it raises a ValueError:
In [230]: df[[True]]
ValueError: Item wrong length 1 instead of 3.
Versus using loc:
In [231]: df.loc[[True]]
Out[231]:
False True
0 3 1
In contrast, the following does not raise ValueError even though the structure of df2 is almost the same as df1 above:
In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
Out[258]:
A B
0 1 3
1 2 4
2 3 5
In [259]: df2[['B']]
Out[259]:
B
0 3
1 4
2 5
Thus, df[boolean_mask] does not always behave the same as df.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask] instead of df[boolean_mask] because the meaning of df.loc's syntax is explicit. With df.loc[indexer] you know automatically that df.loc is selecting rows. In contrast, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing details about indexer and df.
df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?).
In [237]: df2.loc[[True,False,True], 'B']
Out[237]:
0 3
2 5
Name: B, dtype: int64
When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:
In [239]: df2.loc[1:2]
Out[239]:
A B
1 2 4
2 3 5
In [271]: df2[1:2]
Out[271]:
A B
1 2 4
Performance Consideration on multiple columns "Chained Assignment" with and without using .loc
Let me supplement the already very good answers with the consideration of system performance.
The question itself includes a comparison on the system performance (execution time) of 2 pieces of codes with and without using .loc. The execution times are roughly the same for the code samples quoted. However, for some other code samples, there could be considerable difference on execution times with and without using .loc: e.g. several times difference or more!
A common case of pandas dataframe manipulation is we need to create a new column derived from values of an existing column. We may use the codes below to filter conditions (based on existing column) and set different values to the new column:
df[df['mark'] >= 50]['text_rating'] = 'Pass'
However, this kind of "Chained Assignment" does not work since it could create a "copy" instead of a "view" and assignment to the new column based on this "copy" will not update the original dataframe.
2 options available:
We can either use .loc, or
Code it another way without using .loc
2nd case e.g.:
df['text_rating'][df['mark'] >= 50] = 'Pass'
By placing the filtering at the last (after specifying the new column name), the assignment works well with the original dataframe updated.
The solution using .loc is as follows:
df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'
Now, let's see their execution time:
Without using .loc:
%%timeit
df['text_rating'][df['mark'] >= 50] = 'Pass'
2.01 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
With using .loc:
%%timeit
df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'
577 µs ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As we can see, with using .loc, the execution time is more than 3X times faster!
For a more detailed explanation of "Chained Assignment", you can refer to another related post How to deal with SettingWithCopyWarning in pandas? and in particular the answer of cs95. The post is excellent in explaining the functional differences of using .loc. I just supplement here the system performance (execution time) difference.
In addition to what has already been said (issues with having True, False as column name without using loc and ability to select rows and columns with loc and ability to do slicing for row and column selections), another big difference is that you can use loc to assign values to specific rows and columns. If you try to select a subset of the dataframe using boolean series and attempt to change a value of that subset selection you will likely get the SettingWithCopy warning.
Let's say you're trying to change the "upper management" column for all the rows whose salary is bigger than 60000.
This:
mask = df["salary"] > 60000
df[mask]["upper management"] = True
throws the warning that "A value is is trying to be set on a copy of a slice from a Dataframe" and won't work because df[mask] creates a copy and trying to update "upper management" of that copy has no effect on the original df.
But this succeeds:
mask = df["salary"] > 60000
df.loc[mask,"upper management"] = True
Note that in both cases you can do df[df["salary"] > 60000] or df.loc[df["salary"] > 60000], but I think storing boolean condition in a variable first is cleaner.
Related
I have written logic for changing data inside a pivoted table and I am able to achieve for a single condition but I need to place an else condition too. tried .apply() but it doesn't satisfy the dataset I am looking for.
df_s = sorted_data.groupby(["GH","HA","Tipo"]).first().reset_index()
df_s22 = df_s[df_s['Tipo'].eq('1')].assign(Tipo='2').rename(lambda x: x + .5)
I need an else condition above to assign 1 if not.
The column "Tipo" is a string the way you handle it currently. Perhaps you could convert it into integers, usually easier to work with. Anyway, you have something with a column Tipo, that are strings, and they are either '1' or '2' (hard to tell if there are other values allowed, which affects the approaches you can take).
import numpy as np
import pandas as pd
df_s = pd.DataFrame({'Tipo':[str(i) for i in np.random.randint(1,3, size=10)],
'other_data':[chr(i) for i in np.random.randint(65,90, size=10)]})
Method 1
The most direct solution to your problem would be to define a function and apply it row wise, i.e. axis=1 (probably inefficient but does the job):
def fn(row):
row.loc['Tipo'] = '2' if row.Tipo=='1' else '1'
return row
df_s22 = df_s.apply(fn, axis=1)
timings: 2.57 ms ± 153 µs per loop
Method 2
Or apply directly to the column of interest
df_s22 = df_s.copy()
df_s22.loc[:,'Tipo'] = df_s22.loc[:,'Tipo'].apply(lambda x: '2' if x=='1' else '1')
timings: 862 µs ± 30.7 µs per loop
Method 3
You can also use the eval method:
df_s22 = df_s.copy()
df_s22.loc[:,'Tipo'] = df_s22.eval("Tipo=='1'").astype(int)+1
timings: 2.45 ms ± 97.3 µs per loop
Here I use the eval method and checks if the Tipo column is '2'. Thus if it is, it will be True, which in Python can also be interpreted as '1', so adding 1 to the check Tipo=='2' will make the True value be 1+1=2, and the other values which evaluate to False (i.e. = 0), we add +1 to as well, will be 1. The output will have Tipo column as integers, not strings any more.
Method 4
Using the assign method we can use similar checks
df_s22 = df_s.assign(Tipo = str((df_s.Tipo=='1')+1))
timings: 783 µs ± 18.3 µs per loop
Tips and remarks:
In all cases you need to keep track of quotation marks, since Tipo is a string, for expressions within "- quotes the inner quotes need to be single '-quotes.
Also remember that you are creating a second DataFrame in memory (df_s22), if your dataset is large and you want to do complex operations on it they might be slower if memory becomes full. Think about just creating a new column, perhaps named Tipo22 in your original DataFrame df_s.
This question already has answers here:
What is the performance impact of non-unique indexes in pandas?
(2 answers)
Closed 4 years ago.
Suppose I have an pandas series that I'd like to function as a multimap (multiple values for each index key):
# intval -> data1
a = pd.Series(data=-np.arange(100000),
index=np.random.randint(0, 50000, 100000))
I'd like to select (as quickly as possible) all the values from a
where a's index matches another index b. (Like an inner join. Or a merge but for series).
a may have duplicates in its index.
b may not have duplicates and it not necessarily a subset of a's index. To give pandas the best possible chance, let's assume b can also be provided as a sorted index object:
b = pd.Index(np.unique(np.random.randint(30000, 100000, 100000))).sortvalues()
So, we would have something like:
target
a b result
3 0 3 3 0
3 1 7 8 3
4 2 8 ...
8 3 ...
9 4
...
I'm also only interested in getting the values of the result (index [3,8,...] not needed).
If a did not have duplicates, we would simply do:
a.reindex(b) # Cannot reindex a duplicate axis
Because & maintains the duplicates of a, we can't do:
d = a[a.index & b.index]
d = a.loc[a.index & b.index] # same
d = a.get(a.index & b.index) # same
print d.shape
So I think we need to do something like:
common = (a.index & b.index).unique()
a.loc[common]
... which is cumbersome, but also is surprising slow. It's not build the list of items to select that's slow:
%timeit (a.index & b).unique()
# 100 loops, best of 3: 3.39 ms per loop
%timeit (a.index & b).unique().sort_values()
# 100 loops, best of 3: 4.19 ms per loop
... so it look like its really retrieving the values that's slow:
common = ((a.index & b).unique()).sort_values()
%timeit a.loc[common]
#10 loops, best of 3: 43.3 ms per loop
%timeit a.get(common)
#10 loops, best of 3: 42.1 ms per loop
... That's around 20 operations per seconds. Not exactly zippy! Why so slow?
Surely there must be a fast way to lookup as set of values from pandas dataframe? I don't want to get an indexed object out -- really all I'm asking for is a merge over sorted indexes, or (slower) hashed int lookups. Either way, this should be an extremely fast operation -- not a 20 per second operation on my 3Ghz machine.
Also:
Profiling a.loc[common] give:
ncalls tottime percall cumtime percall filename:lineno(function)
# All the time spent here.
40 1.01 0.02525 1.018 0.02546 ~:0(<method 'get_indexer_non_unique' indexing.py:1443(_has_valid_type)
...
# seems to be called a lot.
1500 0.000582 3.88e-07 0.000832 5.547e-07 ~:0(<isinstance>)
PS. I posted a similar question previously, about why Series.map is so slow
Why is pandas.series.map so shockingly slow? . The reason was lazy-under-the-hood-indexing. This doesn't seem to be happening here.
Update:
For similarly sizes a and common where a is unique:
% timeit a.loc[common]
1000 loops, best of 3: 760 µs per loop
... as #jpp points out. Multiindex is likely to blame.
Repeated indices are guaranteed to slow down your dataframe indexing operations. You can amend your inputs to prove this to yourself:
a = pd.Series(data=-np.arange(100000), index=np.random.randint(0, 50000, 100000))
%timeit a.loc[common] # 34.1 ms
a = pd.Series(data=-np.arange(100000), index=np.arange(100000))
%timeit a.loc[common] # 6.86 ms
As mentioned in this related question:
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
Given the update to pandas 0.20.0 and the deprecation of .ix, I am wondering what the most efficient way to get the same result using the remaining .loc and .iloc. I just answered this question, but the second option (not using .ix) seems inefficient and verbose.
Snippet:
print df.iloc[df.loc[df['cap'].astype(float) > 35].index, :-1]
Is this the proper way to go when using both conditional and index position filtering?
You can stay in the world of a single loc by getting at the index values you need by slicing that particular index with positions.
df.loc[
df['cap'].astype(float) > 35,
df.columns[:-1]
]
Generally, you would prefer to avoid chained indexing in pandas (though, strictly speaking, you're actually using two different indexing methods). You can't modify your dataframe this way (details in the docs), and the docs cite performance as another reason (indexing once vs. twice).
For the latter, it's usually insignificant (or rather, unlikely to be a bottleneck in your code), and actually seems to not be the case (at least in the following example):
df = pd.DataFrame(np.random.uniform(size=(100000,10)),columns = list('abcdefghij'))
# Get columns number 2:5 where value in 'a' is greater than 0.5
# (i.e. Boolean mask along axis 0, position slice of axis 1)
# Deprecated .ix method
%timeit df.ix[df['a'] > 0.5,2:5]
100 loops, best of 3: 2.14 ms per loop
# Boolean, then position
%timeit df.loc[df['a'] > 0.5,].iloc[:,2:5]
100 loops, best of 3: 2.14 ms per loop
# Position, then Boolean
%timeit df.iloc[:,2:5].loc[df['a'] > 0.5,]
1000 loops, best of 3: 1.75 ms per loop
# .loc
%timeit df.loc[df['a'] > 0.5, df.columns[2:5]]
100 loops, best of 3: 2.64 ms per loop
# .iloc
%timeit df.iloc[np.where(df['a'] > 0.5)[0],2:5]
100 loops, best of 3: 9.91 ms per loop
Bottom line: If you really want to avoid .ix, and you're not intending to modify values in your dataframe, just go with chained indexing. On the other hand (the 'proper' but arguably messier way), if you do need to modify values, either do .iloc with np.where() or .loc with integer slices of df.index or df.columns.
How about breaking this into a two-step indexing:
df[df['cap'].astype(float) > 35].iloc[:,:-1]
or even:
df[df['cap'].astype(float) > 35].drop('cap',1)
Pandas remove .ix, and encourage you to use .iloc, .loc .
for this you can refer to the iloc, loc definition and how they are different from ix,
This might help you.
How are iloc, ix and loc different?
Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:
# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)
Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?
Here is a SSCCE demonstrating the issue:
import pandas as pd
import random
# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
dictionary[x] = 'Some string ' + str(x)
for x in range(200):
orig.append(random.randint(1, 11269))
series = pd.Series(orig)
# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')
Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.
It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:
series = series.map(lambda x: dictionary.get(x,x))
If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:
series = series.map(dictionary.get)
You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:
series = series.map(dictionary)
Timings
Some timing comparisons using your example data:
%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop
%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop
%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop
%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop
.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.
Thanks to #root: I did a benchmarking again and found different results on pandas v1.1.4
Found series.map(dictionary) fastest it also returns NaN is key not present
I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.