Pandas iterate over values of single column in data frame - python

I am a beginner to python and pandas.
I have a 5000-row data frame that looks something like this:
INDEX COL1 COL2 COL3
0 10.0 12.0 15.0
1 14.0 16.0 153.8
2 18.0 20.0 16.3
3 22.0 24.0 101.7
I wish to iterate over the values in COL3 and carry out calculations, such that:
For each row in the data frame, if the value in COL3 is <= 100.0, multiply that value by 10 and assign to variable "New_Value";
Else, multiply the value by 5 and assign to variable "New_Value"
I understand that if statement cannot be directly applied to the data frame series, as it will lead to ambiguous value error. However, I am stuck trying to find the right tool for this task, and would appreciate some guidance.
Cheers

Using np.where:
df['New_Value'] = np.where(df['COL3']<=100,df['COL3']*10,df['COL3']*5)

One liner
df.COL1.apply(lambda x: x*10 if x<=100 else 5*x)
for this example, you can use apply, which will apply a function on each row of your data.
lambda is a quick function that you can define. It will have a bit of a difference compared to normal functions.
The condition is => x*10 if x<=100 so for each x under or equal to 100, multiply it by 10. ELSE multiply it by 5.

Try this:
df['New_Value']=df.COL3.apply(lambda x: 10*x if x<=100 else 5*x)

Related

Why is the pandas.Series.apply function with np.sum not applied to the entire Series?

I have the following dataframe:
>>> df
a b
0 aaa 22.0
1 bb 33.0
2 4 44.0
3 6 11.0
I want to sum the column b. I know that I can do np.sum(df['b']). But I want to understand syntax-wise why I can not use the following two to get the sum:
>>> df['b'].apply(np.sum, axis=0)
0 22.0
1 33.0
2 44.0
3 11.0
Name: b, dtype: float64
>>> df['b'].apply(np.sum)
0 22.0
1 33.0
2 44.0
3 11.0
Name: b, dtype: float64
Why is the apply function with np.sum not applied to the whole series?
In https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html it says
Invoke function on values of Series.
Can be ufunc (a NumPy function that applies to the entire Series) or a
Python function that only works on single values.
The np.sum is for sure a "NumPy function". I think I may have misunderstanding with "Can be ufunc (a NumPy function that applies to the entire Series)" - does this mean if it's a NumPy function then the function is applied to the entire Series at each cell value without aggregation?
You're applying np.sum to each element of the series df['b']. That's why you're not getting a scalar.
The method apply() takes a function as a parameter and applies it to the DataFrame column by column (where their values are the inputs). If the function used is aggregating (e.g np.sum, like yours), that is, it takes an input list and returns a single value. So, as a result, you will get a Series, with each element corresponding to a column.
With that being said, the equivalent of np.sum(df['b']) will be df.apply(np.sum)['b'] which gives 110.0 as well.
Simple:
From documentation, np.sum Sum of array elements over a given axis.
on the other hand pd.apply passes a function and applies it on every single value of the Pandas series
Thus, when combined like you have, summation will happen on the axis but on a single value.
So, you are better of just using np.sum(df['b']), which by default will sum on the axis=0 because thats what a pd series is.
Problem is that np.sum is not a Universal functions (ufunc)
> isinstance(np.sum, np.ufunc)
False
Series.apply handles ufunc with SeriesApply.apply_standard internally
class SeriesApply(NDFrameApply):
obj: Series
axis = 0
def apply_standard(self) -> DataFrame | Series:
# caller is responsible for ensuring that f is Callable
f = cast(Callable, self.f)
obj = self.obj
with np.errstate(all="ignore"):
if isinstance(f, np.ufunc): # <------
return f(obj)
If f is a numpy ufunc, it just pass the obj which is the Series itself to f.
So what you are using is just a function that only works on single values.
If you want to sum on the Series, you can use Series.sum() or np.sum(Series).
> df['b'].sum()
110.0
> np.sum(df['b'])
110.0

Difference between Pandas' apply() and Python's map() when creating new column? [duplicate]

Can you tell me when to use these vectorization methods with basic examples?
I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap methods though. Why do we have two methods for applying a function to a DataFrame? Again, simple examples which illustrate the usage would be great!
apply works on a row / column basis of a DataFrame
applymap works element-wise on a DataFrame
map works element-wise on a Series
Straight from Wes McKinney's Python for Data Analysis book, pg. 132 (I highly recommended this book):
Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:
In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [117]: frame
Out[117]:
b d e
Utah -0.029638 1.081563 1.280300
Ohio 0.647747 0.831136 -1.549481
Texas 0.513416 -0.884417 0.195343
Oregon -0.485454 -0.477388 -0.309548
In [118]: f = lambda x: x.max() - x.min()
In [119]: frame.apply(f)
Out[119]:
b 1.133201
d 1.965980
e 2.829781
dtype: float64
Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.
Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:
In [120]: format = lambda x: '%.2f' % x
In [121]: frame.applymap(format)
Out[121]:
b d e
Utah -0.03 1.08 1.28
Ohio 0.65 0.83 -1.55
Texas 0.51 -0.88 0.20
Oregon -0.49 -0.48 -0.31
The reason for the name applymap is that Series has a map method for applying an element-wise function:
In [122]: frame['e'].map(format)
Out[122]:
Utah 1.28
Ohio -1.55
Texas 0.20
Oregon -0.31
Name: e, dtype: object
Comparing map, applymap and apply: Context Matters
First major difference: DEFINITION
map is defined on Series ONLY
applymap is defined on DataFrames ONLY
apply is defined on BOTH
Second major difference: INPUT ARGUMENT
map accepts dicts, Series, or callable
applymap and apply accept callables only
Third major difference: BEHAVIOR
map is elementwise for Series
applymap is elementwise for DataFrames
apply also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.
Fourth major difference (the most important one): USE CASE
map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize)).
Also see When should I (not) want to use pandas apply() in my code? for a writeup I made a while back on the most appropriate scenarios for using apply (note that there aren't many, but there are a few— apply is generally slow).
Summarising
Footnotes
map when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as
NaN in the output.
applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in
some cases. My suggestion is to test them both and use whatever works
better.
map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to
use faster code paths for better performance.
Series.apply returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply. Note that apply also has
fastpaths when called with certain NumPy functions such as mean,
sum, etc.
Quick Summary
DataFrame.apply operates on entire rows or columns at a time.
DataFrame.applymap, Series.apply, and Series.map operate on one
element at time.
Series.apply and Series.map are similar and often interchangeable. Some of their slight differences are discussed in osa's answer below.
Adding to the other answers, in a Series there are also map and apply.
Apply can make a DataFrame out of a series; however, map will just put a series in every cell of another series, which is probably not what you want.
In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0 1
1 2
2 3
dtype: int64
In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]:
0 1
0 1 1
1 2 2
2 3 3
In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]:
0 0 1
1 1
dtype: int64
1 0 2
1 2
dtype: int64
2 0 3
1 3
dtype: int64
dtype: object
Also if I had a function with side effects, such as "connect to a web server", I'd probably use apply just for the sake of clarity.
series.apply(download_file_for_every_element)
Map can use not only a function, but also a dictionary or another series. Let's say you want to manipulate permutations.
Take
1 2 3 4 5
2 1 4 5 3
The square of this permutation is
1 2 3 4 5
1 2 5 3 4
You can compute it using map. Not sure if self-application is documented, but it works in 0.15.1.
In [39]: p=pd.Series([1,0,3,4,2])
In [40]: p.map(p)
Out[40]:
0 0
1 1
2 4
3 2
4 3
dtype: int64
#jeremiahbuddha mentioned that apply works on row/columns, while applymap works element-wise. But it seems you can still use apply for element-wise computation....
frame.apply(np.sqrt)
Out[102]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
frame.applymap(np.sqrt)
Out[103]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
Probably the simplest explanation the difference between apply and applymap:
apply takes the whole column as a parameter and then assign the result to this column
applymap takes the separate cell value as a parameter and assign the result back to this cell.
NB If apply returns the single value you will have this value instead of the column after assigning and eventually will have just a row instead of matrix.
Just wanted to point out, as I struggled with this for a bit
def f(x):
if x < 0:
x = 0
elif x > 100000:
x = 100000
return x
df.applymap(f)
df.describe()
this does not modify the dataframe itself, has to be reassigned:
df = df.applymap(f)
df.describe()
Based on the answer of cs95
map is defined on Series ONLY
applymap is defined on DataFrames ONLY
apply is defined on BOTH
give some examples
In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [4]: frame
Out[4]:
b d e
Utah 0.129885 -0.475957 -0.207679
Ohio -2.978331 -1.015918 0.784675
Texas -0.256689 -0.226366 2.262588
Oregon 2.605526 1.139105 -0.927518
In [5]: myformat=lambda x: f'{x:.2f}'
In [6]: frame.d.map(myformat)
Out[6]:
Utah -0.48
Ohio -1.02
Texas -0.23
Oregon 1.14
Name: d, dtype: object
In [7]: frame.d.apply(myformat)
Out[7]:
Utah -0.48
Ohio -1.02
Texas -0.23
Oregon 1.14
Name: d, dtype: object
In [8]: frame.applymap(myformat)
Out[8]:
b d e
Utah 0.13 -0.48 -0.21
Ohio -2.98 -1.02 0.78
Texas -0.26 -0.23 2.26
Oregon 2.61 1.14 -0.93
In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
b d e
Utah 0.13 -0.48 -0.21
Ohio -2.98 -1.02 0.78
Texas -0.26 -0.23 2.26
Oregon 2.61 1.14 -0.93
In [10]: myfunc=lambda x: x**2
In [11]: frame.applymap(myfunc)
Out[11]:
b d e
Utah 0.016870 0.226535 0.043131
Ohio 8.870453 1.032089 0.615714
Texas 0.065889 0.051242 5.119305
Oregon 6.788766 1.297560 0.860289
In [12]: frame.apply(myfunc)
Out[12]:
b d e
Utah 0.016870 0.226535 0.043131
Ohio 8.870453 1.032089 0.615714
Texas 0.065889 0.051242 5.119305
Oregon 6.788766 1.297560 0.860289
Just for additional context and intuition, here's an explicit and concrete example of the differences.
Assume you have the following function seen below. (
This label function, will arbitrarily split the values into 'High' and 'Low', based upon the threshold you provide as the parameter (x). )
def label(element, x):
if element > x:
return 'High'
else:
return 'Low'
In this example, lets assume our dataframe has one column with random numbers.
If you tried mapping the label function with map:
df['ColumnName'].map(label, x = 0.8)
You will result with the following error:
TypeError: map() got an unexpected keyword argument 'x'
Now take the same function and use apply, and you'll see that it works:
df['ColumnName'].apply(label, x=0.8)
Series.apply() can take additional arguments element-wise, while the Series.map() method will return an error.
Now, if you're trying to apply the same function to several columns in your dataframe simultaneously, DataFrame.applymap() is used.
df[['ColumnName','ColumnName2','ColumnName3','ColumnName4']].applymap(label)
Lastly, you can also use the apply() method on a dataframe, but the DataFrame.apply() method has different capabilities. Instead of applying functions element-wise, the df.apply() method applies functions along an axis, either column-wise or row-wise. When we create a function to use with df.apply(), we set it up to accept a series, most commonly a column.
Here is an example:
df.apply(pd.value_counts)
When we applied the pd.value_counts function to the dataframe, it calculated the value counts for all the columns.
Notice, and this is very important, when we used the df.apply() method to transform multiple columns. This is only possible because the pd.value_counts function operates on a series. If we tried to use the df.apply() method to apply a function that works element-wise to multiple columns, we'd get an error:
For example:
def label(element):
if element > 1:
return 'High'
else:
return 'Low'
df[['ColumnName','ColumnName2','ColumnName3','ColumnName4']].apply(label)
This will result with the following error:
ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', u'occurred at index Economy')
In general, we should only use the apply() method when a vectorized function does not exist. Recall that pandas uses vectorization, the process of applying operations to whole series at once, to optimize performance. When we use the apply() method, we're actually looping through rows, so a vectorized method can perform an equivalent task faster than the apply() method.
Here are some examples of vectorized functions that already exist that you do NOT want to recreate using any type of apply/map methods:
Series.str.split() Splits each element in the Series
Series.str.strip() Strips whitespace from each string in the Series.
Series.str.lower() Converts strings in the Series to lowercase.
Series.str.upper() Converts strings in the Series to uppercase.
Series.str.get() Retrieves the ith element of each element in the Series.
Series.str.replace() Replaces a regex or string in the Series with another string
Series.str.cat() Concatenates strings in a Series.
Series.str.extract() Extracts substrings from the Series matching a regex pattern.
My understanding:
From the function point of view:
If the function has variables that need to compare within a column/ row, use
apply.
e.g.: lambda x: x.max()-x.mean().
If the function is to be applied to each element:
1> If a column/row is located, use apply
2> If apply to entire dataframe, use applymap
majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)
def times10(x):
if type(x) is int:
x *= 10
return x
df2.applymap(times10)
FOMO:
The following example shows apply and applymap applied to a DataFrame.
map function is something you do apply on Series only. You cannot apply map on DataFrame.
The thing to remember is that apply can do anything applymap can, but apply has eXtra options.
The X factor options are: axis and result_type where result_type only works when axis=1 (for columns).
df = DataFrame(1, columns=list('abc'),
index=list('1234'))
print(df)
f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only
# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1)) # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result
As a sidenote, Series map function, should not be confused with the Python map function.
The first one is applied on Series, to map the values, and the second one to every item of an iterable.
Lastly don't confuse the dataframe apply method with groupby apply method.

Iterate over first N rows in pandas

What is the suggested way to iterate over the rows in pandas like you would in a file? For example:
LIMIT = 100
for row_num, row in enumerate(open('file','r')):
print (row)
if row_num == LIMIT: break
I was thinking to do something like:
for n in range(LIMIT):
print (df.loc[n].tolist())
Is there a built-in way to do this though in pandas?
Hasn't anyone answered the simple solution?
for row in df.head(5).itertuples():
# do something
Take a peek at this post.
I know others have suggested iterrows but no-one has yet suggested using iloc combined with iterrows. This will allow you to select whichever rows you want by row number:
for i, row in df.iloc[:101].iterrows():
print(row)
Though as others have noted if speed is essential an apply function or a vectorized function would probably be better.
>>> df
a b
0 1.0 5.0
1 2.0 4.0
2 3.0 3.0
3 4.0 2.0
4 5.0 1.0
5 6.0 NaN
>>> for i, row in df.iloc[:3].iterrows():
... print(row)
...
a 1.0
b 5.0
Name: 0, dtype: float64
a 2.0
b 4.0
Name: 1, dtype: float64
a 3.0
b 3.0
Name: 2, dtype: float64
>>>
You have values, itertuples and iterrows out of which itertuples performs best as benchmarked by fast-pandas.
You can use iterools.islice to take the first n items from iterrows:
import itertools
limit = 5
for index, row in itertools.islice(df.iterrows(), limit):
...
Since you said that you want to use something like an if I would do the following:
limit = 2
df = pd.DataFrame({"col1": [1,2,3], "col2": [4,5,6], "col3": [7,8,9]})
df[:limit].loc[df["col3"] == 7]
This would select the first two rows of the data frame, then return the rows out of the first two rows that have a value for the col3 equal to 7. Point being you want to use iterrows only in very very specific situations. Otherwise, the solution can be vectorized.
I don't know what exactly are you trying to achieve so I just threw a random example.
If you must iterate over the dataframe, you should use the iterrows() method:
for index, row in df.iterrows():
...

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Pandas groupby function

Suppose I have the data set below in a dataframe, df:
import pandas as pd
df = pd.DataFrame({'ID' : ['A','A','A','B','B','B'], 'Date' : ['1-Jan','2-Jan','3-Jan','1-Jan','2-Jan','3-Jan'],'VAL' : [45,23,54,65,76,23]})
I am trying to insert a column, say 'new_col', that calculates the percent change in VAL that is grouped by ID. So, for example, I would want the percent change from 45 to 23, 23 to 54, and then restart for ID 'B'. The below code works but it calculates the percent change regardless of ID.
df['new_col'] = (df['VAL'] - df['VAL'].shift(1)) / df['VAL'].shift(1)
I tried adding the group by function in front of it but I am still getting an error:
df['new_col'] = df.groupby('ID')[(df['VAL'] - df['VAL'].shift(1)) / df['VAL'].shift(1)]
^^^^^^^^^^^^^^^^
You can't just just stick your expression in brackets onto the groupby like that. What you need to do is use apply to apply a function that calculates what you want. What you want can be calculated more simply using the diff method:
>>> df.groupby('ID')['VAL'].apply(lambda g: g.diff()/g.shift())
0 NaN
1 -0.488889
2 1.347826
3 NaN
4 0.169231
5 -0.697368
dtype: float64
As DSM notes in a comment, in this case you can do it directly with the pct_change method:
>>> df.groupby('ID')['VAL'].pct_change()
0 NaN
1 -0.488889
2 1.347826
3 NaN
4 0.169231
5 -0.697368
dtype: float64
However, it is good to be aware of how to do it with apply because you'll need to do things that way if you want to do a more complex operation on the groups (i.e., an operation for which there is no predefined one-shot method).

Categories