Related
I have the following Dataframe:
id1
result
2
0.5
3
1.4
4
1.4
7
3.4
2
1.4
I want to check for every row in the column ['id1'] if the value is unique
The output should be:
False
True
True
True
False
The first and the last are False because id 2 exists twice.
I used this method:
bool = df["id1"].is_unique but that checks if the whole column is unique. I want to check it for each row
df['id1'].map(~(df.groupby('id1').size() > 1))
Output
0 False
1 True
2 True
3 True
4 False
Name: id1, dtype: bool
Since I saw you tagged this question with pandas, I'm assuming you're using the pandas package.
You can create an array with a bunch of id1 here, then use pd.Series.duplicated method like the following example.
You can get the pandas docs here.
import pandas as pd
check_id1_duplicate = pd.Index([2, 3, 4, 7, 2])
check_id1_duplicate.duplicated(keep=False)
# Results would be array([True, False, False, False, True])
To add to #ShiriNmi's answer, the duplicated solution is more intuitive and about 8 times faster, while returning the same result.
%timeit -n 10_000 df['id1'].map(~(df.groupby('id1').size() > 1))
# 697 µs ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit ~df['id1'].duplicated(keep=False)
# 89.5 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I want to modify a single value in a DataFrame. The typical suggestion for doing this is to use df.at[] and reference the position as the index label and the column label, or to use df.iat[] and reference the position as the integer row and the integer column. But I want to reference the position as the integer row and the column label.
Assume this DataFrame:
dateindex
apples
oranges
bananas
2021-01-01 14:00:01.384624
1
X
3
2021-01-05 13:43:26.203773
4
5
6
2021-01-31 08:23:29.837238
7
8
9
2021-02-08 10:23:09.095632
0
1
2
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
I want to change the value "X" to "2". I don't know the exact time; I just know that it's the first row. But I do know that I want to change the "oranges" column.
I want to do something like df.at[0,'oranges'], but I can't do that; I get a KeyError.
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Wrt
The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?
Yes, it is. And I agree, it is awkward. The old .ix used to support these mixed indexing cases better but its behaviour depended on the dtype of the axis, making it inconsistent. In the meanwhile...
The other options, which have been used in the other answers, can all issue the SettingWithCopy warning. It's not guaranteed to raise the issue but it might, based on what the indexing criteria are and how values are assigned.
Referencing Combining positional and label-based indexing and starting with this df, which has dateindex as the index:
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 X 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
Using both options:
with .loc or .at:
df.at[df.index[0], 'oranges'] = -50
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -50 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
with .iloc or .iat:
df.iat[0, df.columns.get_loc('oranges')] = -20
apples oranges bananas
dateindex
2021-01-01 14:00:01.384624 1 -20 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
FWIW, I find approach #1 more consistent since it can handle multiple row indexes without changing the functions/methods used: df.loc[df.index[[0, 2]], 'oranges'] but approach #2 needs a different column indexer when there are multiple columns: df.iloc[[0, 2], df.columns.get_indexer(['oranges', 'bananas'])].
Solution with Series.iat
If it doesn't seem more awkward to you, you can use the iat method of pandas Series:
df["oranges"].iat[0] = 2
Time performance comparison with other methods
As this method doesn't raise any warning, it can be interesting to compare its time performance with other proposed solutions.
%%timeit
df.at[df.index[0], 'oranges'] = 2
# > 9.91 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df.iat[0, df.columns.get_loc('oranges')] = 2
# > 13.5 µs ± 74.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df["oranges"].iat[0] = 2
# > 3.49 µs ± 16.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The pandas.Series.iat method seems to be the most performant one (I took the median of three runs).
Let's try again with huge DataFrames
With a DatetimeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
df_large.index = pd.date_range(start=0, periods=100000)
# 2070-01-01 to 2243-10-16, a bit unrealistic
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = -2
# > 10.1 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = -2
# > 13.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = -2
# > 3.31 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
With a RangeIndex
# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = 2
# > 4.5 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = 2
# > 13.5 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df_large["col_55555"].iat[55555] = 2
# > 3.49 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Since it is a simple indexing with O(n) complexity, the size of the array doesn't change much the results, except when it comes to the "at + index" ; strangely enough, it shows worst performance with small dataframes. Thanks to the author wfaulk for spotting that using a RangeIndex decreases the access time of the "at + index" method. The time performance remains higher and constant when dealing with DatetimeIndex with pd.Series.iat.
You were actually quite close with your initial guess.
You would do it like this:
import pandas as pd
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
print(df)
# change th value of column a, row 2
df['a'][2] = 100
# print column a, row 2
print(df['a'][2])
There are lots of different variants such as loc and iloc, but this is one good method.
In the example we discovered that loc was optimal as df[][] throws an error:
import pandas as pd
data = [{'apples':1, 'oranges':'X', 'bananas':3},
{'apples':4, 'oranges':5, 'bananas':6},
{'apples':7, 'oranges':8, 'bananas':9},
{'apples':0, 'oranges':1, 'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
pd.to_datetime('2021-01-05 13:43:26.203773'),
pd.to_datetime('2021-01-31 08:23:29.837238'),
pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)
print(df)
df.loc['2021-01-01 14:00:01.384624','oranges'] = 10
# df['oranges'][0] = 10
print(df)
This works.
You can use the loc method. It receives the row and column you want to change.
Changing X to 2: df.loc[0, 'oranges'] = 2
See: pandas.DataFrame.loc
Let's have a small dataframe: df = pd.DataFrame({'CID': [1,2,3,4,12345, 6]})
When I search for membership the speed is vastly different based on whether I ask to search in df.CID or in df['CID'].
In[25]:%timeit 12345 in df.CID
Out[25]:89.8 µs ± 254 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In[26]:%timeit 12345 in df['CID']
Out[26]:42.3 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In[27]:type( df.CID)
Out[27]: pandas.core.series.Series
In[28]:type( df['CID'])
Out[28]: pandas.core.series.Series
Why is that?
df['CID'] delegates to NDFrame.__getitem__ and it is more obvious you are performing an indexing operation.
On the other hand, df.CID delegates to NDFrame.__getattr__, which has to do some additional heavy lifting, mainly to determine whether 'CID' is an attribute, a function, or a column you're calling using the attribute access (a convenience, but not recommended for production code).
Now, why is it not recommended? Consider,
df = pd.DataFrame({'A': [1, 2, 3]})
df.A
0 1
1 2
2 3
Name: A, dtype: int64
There are no issues referring to column "A" as df.A, because it does not conflict with any attribute or function namings in pandas. However, consider the pop function (just as an example).
df.pop
# <bound method NDFrame.pop of ...>
df.pop is a bound method of df. Now, I'd like to create a column called "pop" for various reasons.
df['pop'] = [4, 5, 6]
df
A pop
0 1 4
1 2 5
2 3 6
Great, but,
df.pop
# <bound method NDFrame.pop of ...>
I cannot use the attribute notation to access this column. However...
df['pop']
0 4
1 5
2 6
Name: pop, dtype: int64
Bracket notation still works. That's why this is better.
I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "apply is slow, and should be avoided".
I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:
If apply is so bad, then why is it in the API?
How and when should I make my code apply-free?
Are there ever any situations where apply is good (better than other possible solutions)?
apply, the Convenience Function you Never Needed
We start by addressing the questions in the OP, one by one.
"If apply is so bad, then why is it in the API?"
DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.
Some of the things apply can do:
Run any user-defined function on a DataFrame or Series
Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
Perform index alignment while applying the function
Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
Perform element-wise transformations
Broadcast aggregated results to original rows (see the result_type argument).
Accept positional/keyword arguments to pass to the user-defined functions.
...Among others. For more information, see Row or Column-wise Function Application in the documentation.
So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.
There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.
Let's address the next question.
"How and when should I make my code apply-free?"
To rephrase, here are some common situations where you will want to get rid of any calls to apply.
Numeric Data
If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).
Contrast the performance of apply for a simple addition operation.
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
<!- ->
df.apply(np.sum)
A 16
B 28
dtype: int64
df.sum()
A 16
B 28
dtype: int64
Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.
%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even if you enable passing raw arrays with the raw argument, it's still twice as slow.
%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another example:
df.apply(lambda x: x.max() - x.min())
A 8
B 8
dtype: int64
df.max() - df.min()
A 8
B 8
dtype: int64
%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()
2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, seek out vectorized alternatives if possible.
String/Regex
Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.
A common problem is to check whether a value in a column is present in another column of the same row.
df = pd.DataFrame({
'Name': ['mickey', 'donald', 'minnie'],
'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
'Value': [20, 10, 86]})
df
Name Value Title
0 mickey 20 wonderland
1 donald 10 welcome to donald's castle
2 minnie 86 Minnie mouse clubhouse
This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.
Using apply, this would be done using
df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)
0 False
1 True
2 True
dtype: bool
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
However, a better solution exists using list comprehensions.
df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
<!- ->
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.
For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.
Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over,
say, df['date'].apply(pd.to_datetime).
Read more at the
docs.
A Common Pitfall: Exploding Columns of Lists
s = pd.Series([[1, 2]] * 3)
s
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
People are tempted to use apply(pd.Series). This is horrible in terms of performance.
s.apply(pd.Series)
0 1
0 1 2
1 1 2
2 1 2
A better option is to listify the column and pass it to pd.DataFrame.
pd.DataFrame(s.tolist())
0 1
0 1 2
1 1 2
2 1 2
<!- ->
%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())
2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Lastly,
"Are there any situations where apply is good?"
Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.
Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.
df = pd.DataFrame(
pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2),
columns=['date1', 'date2'])
df
date1 date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30
df.dtypes
date1 object
date2 object
dtype: object
This is an admissible case for apply:
df.apply(pd.to_datetime, errors='coerce').dtypes
date1 datetime64[ns]
date2 datetime64[ns]
dtype: object
Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.
%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')
5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can make a similar case for other operations such as string operations, or conversion to category.
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
v/s
u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
v[c] = df[c].astype(category)
And so on...
Converting Series to str: astype versus apply
This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.
The graph was plotted using the perfplot library.
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.
GroupBy operations with chained transformations
GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.
One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
<!- ->
You'd need two successive groupby calls here:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
Using apply, you can shorten this to a a single call.
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).
Other Caveats
Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)
Not all applys are alike
The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.
Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).
If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.
GroupBy.apply: generally favoured
Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.
pd.DataFrame.apply column-wise: a mixed bag
pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:
# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3))) # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns
# Scenario_1 | Scenario_2
%timeit df.sum() # 800 ms | 109 ms
%timeit df.apply(pd.Series.sum) # 568 ms | 325 ms
%timeit df.max() - df.min() # 1.63 s | 314 ms
%timeit df.apply(lambda x: x.max() - x.min()) # 838 ms | 473 ms
%timeit df.mean() # 108 ms | 94.4 ms
%timeit df.apply(pd.Series.mean) # 276 ms | 233 ms
1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:
df['col'].apply(str) may slightly outperform df['col'].astype(str).
df.apply(pd.to_datetime) working on strings doesn't scale well with rows versus a regular for loop.
For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn't the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)
def faster_df_apply(df, func):
cols = list(df.columns)
data, index = [], []
for row in df.itertuples(index=True):
row_dict = {f:v for f,v in zip(cols, row[1:])}
data.append(func(row_dict))
index.append(row[0])
return pd.Series(data, index=index)
Are there ever any situations where apply is good?
Yes, sometimes.
Task: decode Unicode strings.
import numpy as np
import pandas as pd
import unidecode
s = pd.Series(['mañana','Ceñía'])
s.head()
0 mañana
1 Ceñía
s.apply(unidecode.unidecode)
0 manana
1 Cenia
Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by #jpp.
I want to get a list of the column headers from a Pandas DataFrame. The DataFrame will come from user input, so I won't know how many columns there will be or what they will be called.
For example, if I'm given a DataFrame like this:
>>> my_dataframe
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
I would get a list like this:
>>> header_list
['y', 'gdp', 'cap']
You can get the values as a list by doing:
list(my_dataframe.columns.values)
Also you can simply use (as shown in Ed Chum's answer):
list(my_dataframe)
There is a built-in method which is the most performant:
my_dataframe.columns.values.tolist()
.columns returns an Index, .columns.values returns an array and this has a helper function .tolist to return a list.
If performance is not as important to you, Index objects define a .tolist() method that you can call directly:
my_dataframe.columns.tolist()
The difference in performance is obvious:
%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For those who hate typing, you can just call list on df, as so:
list(df)
I did some quick tests, and perhaps unsurprisingly the built-in version using dataframe.columns.values.tolist() is the fastest:
In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop
In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop
In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop
In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop
(I still really like the list(dataframe) though, so thanks EdChum!)
It gets even simpler (by Pandas 0.16.0):
df.columns.tolist()
will give you the column names in a nice list.
Extended Iterable Unpacking (Python 3.5+): [*df] and Friends
Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
If you want a list....
[*df]
# ['A', 'B', 'C']
Or, if you want a set,
{*df}
# {'A', 'B', 'C'}
Or, if you want a tuple,
*df, # Please note the trailing comma
# ('A', 'B', 'C')
Or, if you want to store the result somewhere,
*cols, = df # A wild comma appears, again
cols
# ['A', 'B', 'C']
... if you're the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently ;)
P.S.: if performance is important, you will want to ditch the
solutions above in favour of
df.columns.to_numpy().tolist()
# ['A', 'B', 'C']
This is similar to Ed Chum's answer, but updated for
v0.24 where .to_numpy() is preferred to the use of .values. See
this answer (by me) for more information.
Visual Check
Since I've seen this discussed in other answers, you can use iterable unpacking (no need for explicit loops).
print(*df)
A B C
print(*df, sep='\n')
A
B
C
Critique of Other Methods
Don't use an explicit for loop for an operation that can be done in a single line (list comprehensions are okay).
Next, using sorted(df) does not preserve the original order of the columns. For that, you should use list(df) instead.
Next, list(df.columns) and list(df.columns.values) are poor suggestions (as of the current version, v0.24). Both Index (returned from df.columns) and NumPy arrays (returned by df.columns.values) define .tolist() method which is faster and more idiomatic.
Lastly, listification i.e., list(df) should only be used as a concise alternative to the aforementioned methods for Python 3.4 or earlier where extended unpacking is not available.
>>> list(my_dataframe)
['y', 'gdp', 'cap']
To list the columns of a dataframe while in debugger mode, use a list comprehension:
>>> [c for c in my_dataframe]
['y', 'gdp', 'cap']
By the way, you can get a sorted list simply by using sorted:
>>> sorted(my_dataframe)
['cap', 'gdp', 'y']
That's available as my_dataframe.columns.
It's interesting, but df.columns.values.tolist() is almost three times faster than df.columns.tolist(), but I thought that they were the same:
In [97]: %timeit df.columns.values.tolist()
100000 loops, best of 3: 2.97 µs per loop
In [98]: %timeit df.columns.tolist()
10000 loops, best of 3: 9.67 µs per loop
A DataFrame follows the dict-like convention of iterating over the “keys” of the objects.
my_dataframe.keys()
Create a list of keys/columns - object method to_list() and the Pythonic way:
my_dataframe.keys().to_list()
list(my_dataframe.keys())
Basic iteration on a DataFrame returns column labels:
[column for column in my_dataframe]
Do not convert a DataFrame into a list, just to get the column labels. Do not stop thinking while looking for convenient code samples.
xlarge = pd.DataFrame(np.arange(100000000).reshape(10000,10000))
list(xlarge) # Compute time and memory consumption depend on dataframe size - O(N)
list(xlarge.keys()) # Constant time operation - O(1)
In the Notebook
For data exploration in the IPython notebook, my preferred way is this:
sorted(df)
Which will produce an easy to read alphabetically ordered list.
In a code repository
In code I find it more explicit to do
df.columns
Because it tells others reading your code what you are doing.
%%timeit
final_df.columns.values.tolist()
948 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
list(final_df.columns)
14.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.columns.values)
1.88 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
final_df.columns.tolist()
12.3 µs ± 27.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.head(1).columns)
163 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The simplest option would be:
list(my_dataframe.columns) or my_dataframe.columns.tolist()
No need for the complex stuff above :)
Its very simple.
Like you can do it as:
list(df.columns)
For a quick, neat, visual check, try this:
for col in df.columns:
print col
As answered by Simeon Visser, you could do
list(my_dataframe.columns.values)
or
list(my_dataframe) # For less typing.
But I think most the sweet spot is:
list(my_dataframe.columns)
It is explicit and at the same time not unnecessarily long.
I feel the question deserves an additional explanation.
As fixxxer noted, the answer depends on the Pandas version you are using in your project. Which you can get with pd.__version__ command.
If you are for some reason like me (on Debian 8 (Jessie) I use 0.14.1) using an older version of Pandas than 0.16.0, then you need to use:
df.keys().tolist() because there isn’t any df.columns method implemented yet.
The advantage of this keys method is that it works even in newer version of Pandas, so it's more universal.
import pandas as pd
# create test dataframe
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(2))
list(df.columns)
Returns
['A', 'B', 'C']
n = []
for i in my_dataframe.columns:
n.append(i)
print n
This is the easiest way to reach your goal.
my_dataframe.columns.values.tolist()
and if you are Lazy, try this >
list(my_dataframe)
If the DataFrame happens to have an Index or MultiIndex and you want those included as column names too:
names = list(filter(None, df.index.names + df.columns.values.tolist()))
It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation.
I've run into needing this more often because I'm shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another "column" to me. It would probably make sense for pandas to have a built-in method for something like this (totally possible I've missed it).
its the simple code for you :
for i in my_dataframe:
print(i)
just do it
Even though the solution that was provided previously is nice, I would also expect something like frame.column_names() to be a function in Pandas, but since it is not, maybe it would be nice to use the following syntax. It somehow preserves the feeling that you are using pandas in a proper way by calling the "tolist" function: frame.columns.tolist()
frame.columns.tolist()
listHeaders = [colName for colName in my_dataframe]