Replace unwanted strings in pandas dataframe element wise and efficiently

Replace unwanted strings in pandas dataframe element wise and efficiently - python

I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index
col1 col2 col3
time
05/04/2018 05:14:52 AM +unend +unend 0
05/04/2018 05:14:57 AM 0 0 0
05/04/2018 05:15:02 AM 30.691 0.000 0.121
05/04/2018 05:15:07 AM 30.691 n. def. 0.108
05/04/2018 05:15:12 AM 30.715 0.000 0.105
As these are coming from some other device (df is produced by pd.read_csv(filename)) the dataframe instead of being a completely float type now ends up having unwanted strings like +unend and n. def.. These are not the classical +infinity or NaN , that df.fillna() could take care off. I would like to replace the strings with 0.0. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.
If i try to do
mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0
i get error: nothing to repeat
if i do
mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0
i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.
The below
mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0
does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?

These are not the classical +infinity or NaN , that df.fillna() could take care off
You can specify a list of strings to consider as NA when reading the csv file.
df = pd.read_csv(filename, na_values=['+unend', 'n. def.'])
And then fill the NA values with fillna

As pointed Edchum if need replace all non numeric values to 0 - first to_numeric with errors='coerce' create NaNs for not parseable values and then convert them to 0 by fillna:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)
If values are not substrings use DataFrame.isin or very nice answer of Haleemur Ali:
df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)
For substrings with define values:
There are special regex char + and ., so need escape them by \:
df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)
Or use applymap for elemnetwise check:
df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)
print (df)
col1 col2 col3
time
05/04/2018 05:14:52 AM 0.000 0.0 0.000
05/04/2018 05:14:57 AM 0.000 0.0 0.000
05/04/2018 05:15:02 AM 30.691 0.0 0.121
05/04/2018 05:15:07 AM 30.691 0.0 0.108
05/04/2018 05:15:12 AM 30.715 0.0 0.105

Do not use pd.Series.str.contains or pd.Series.isin
A more efficient solution to this problem is to use pd.to_numeric to convert try and convert all data to numeric.
Use errors='coerce' to default to NaN, which you can then use with pd.Series.fillna.
cols = ['col1', 'col2', 'col3']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce').fillna(0)

Related

Adding values in a column by "formula" in a pandas dataframe

I am trying to add values to a column using a formula, using the information from this question: Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
I already have the first number of the column B and I want to make a formula for the rest of column B.
The dataframe looks something like this:
A B C
0.16 0.001433 25.775485
0.28 0 25.784443
0.28 0 25.792396
...
And the method I tried was:
for i in range(1, len(df)):
df.loc[i, "B"] = df.loc[i-1, "B"] + df.loc[i,"A"]*((df.loc[i,"C"]) - (df.loc[i-1,"C"]))
But this code produces an infinite loop, can someone help me with this?

you can use shift and a simple assignment.
The general rule in pandas if you use loops you're doing something wrong, it's considered an anti pattern.
df['B_new'] = df['B'].shift(-1) - df['A'] * ((df['C'] - df['C'].shift(-1)))
A B C B_new
0 0.16 0.001433 25.775485 0.001433
1 0.28 0.000000 25.784443 0.002227
2 0.28 0.000000 25.792396 NaN

Pandas iterate over values of single column in data frame

I am a beginner to python and pandas.
I have a 5000-row data frame that looks something like this:
INDEX COL1 COL2 COL3
0 10.0 12.0 15.0
1 14.0 16.0 153.8
2 18.0 20.0 16.3
3 22.0 24.0 101.7
I wish to iterate over the values in COL3 and carry out calculations, such that:
For each row in the data frame, if the value in COL3 is <= 100.0, multiply that value by 10 and assign to variable "New_Value";
Else, multiply the value by 5 and assign to variable "New_Value"
I understand that if statement cannot be directly applied to the data frame series, as it will lead to ambiguous value error. However, I am stuck trying to find the right tool for this task, and would appreciate some guidance.
Cheers

Using np.where:
df['New_Value'] = np.where(df['COL3']<=100,df['COL3']*10,df['COL3']*5)

One liner
df.COL1.apply(lambda x: x*10 if x<=100 else 5*x)
for this example, you can use apply, which will apply a function on each row of your data.
lambda is a quick function that you can define. It will have a bit of a difference compared to normal functions.
The condition is => x*10 if x<=100 so for each x under or equal to 100, multiply it by 10. ELSE multiply it by 5.

Try this:
df['New_Value']=df.COL3.apply(lambda x: 10*x if x<=100 else 5*x)

Difference between Pandas' apply() and Python's map() when creating new column? [duplicate]

Can you tell me when to use these vectorization methods with basic examples?
I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap methods though. Why do we have two methods for applying a function to a DataFrame? Again, simple examples which illustrate the usage would be great!

apply works on a row / column basis of a DataFrame
applymap works element-wise on a DataFrame
map works element-wise on a Series
Straight from Wes McKinney's Python for Data Analysis book, pg. 132 (I highly recommended this book):
Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:
In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [117]: frame
Out[117]:
b d e
Utah -0.029638 1.081563 1.280300
Ohio 0.647747 0.831136 -1.549481
Texas 0.513416 -0.884417 0.195343
Oregon -0.485454 -0.477388 -0.309548
In [118]: f = lambda x: x.max() - x.min()
In [119]: frame.apply(f)
Out[119]:
b 1.133201
d 1.965980
e 2.829781
dtype: float64
Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.
Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:
In [120]: format = lambda x: '%.2f' % x
In [121]: frame.applymap(format)
Out[121]:
b d e
Utah -0.03 1.08 1.28
Ohio 0.65 0.83 -1.55
Texas 0.51 -0.88 0.20
Oregon -0.49 -0.48 -0.31
The reason for the name applymap is that Series has a map method for applying an element-wise function:
In [122]: frame['e'].map(format)
Out[122]:
Utah 1.28
Ohio -1.55
Texas 0.20
Oregon -0.31
Name: e, dtype: object

Comparing map, applymap and apply: Context Matters
First major difference: DEFINITION
map is defined on Series ONLY
applymap is defined on DataFrames ONLY
apply is defined on BOTH
Second major difference: INPUT ARGUMENT
map accepts dicts, Series, or callable
applymap and apply accept callables only
Third major difference: BEHAVIOR
map is elementwise for Series
applymap is elementwise for DataFrames
apply also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.
Fourth major difference (the most important one): USE CASE
map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize)).
Also see When should I (not) want to use pandas apply() in my code? for a writeup I made a while back on the most appropriate scenarios for using apply (note that there aren't many, but there are a few— apply is generally slow).
Summarising
Footnotes
map when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as
NaN in the output.
applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in
some cases. My suggestion is to test them both and use whatever works
better.
map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to
use faster code paths for better performance.
Series.apply returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply. Note that apply also has
fastpaths when called with certain NumPy functions such as mean,
sum, etc.

Quick Summary
DataFrame.apply operates on entire rows or columns at a time.
DataFrame.applymap, Series.apply, and Series.map operate on one
element at time.
Series.apply and Series.map are similar and often interchangeable. Some of their slight differences are discussed in osa's answer below.

Adding to the other answers, in a Series there are also map and apply.
Apply can make a DataFrame out of a series; however, map will just put a series in every cell of another series, which is probably not what you want.
In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0 1
1 2
2 3
dtype: int64
In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]:
0 1
0 1 1
1 2 2
2 3 3
In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]:
0 0 1
1 1
dtype: int64
1 0 2
1 2
dtype: int64
2 0 3
1 3
dtype: int64
dtype: object
Also if I had a function with side effects, such as "connect to a web server", I'd probably use apply just for the sake of clarity.
series.apply(download_file_for_every_element)
Map can use not only a function, but also a dictionary or another series. Let's say you want to manipulate permutations.
Take
1 2 3 4 5
2 1 4 5 3
The square of this permutation is
1 2 3 4 5
1 2 5 3 4
You can compute it using map. Not sure if self-application is documented, but it works in 0.15.1.
In [39]: p=pd.Series([1,0,3,4,2])
In [40]: p.map(p)
Out[40]:
0 0
1 1
2 4
3 2
4 3
dtype: int64

#jeremiahbuddha mentioned that apply works on row/columns, while applymap works element-wise. But it seems you can still use apply for element-wise computation....
frame.apply(np.sqrt)
Out[102]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
frame.applymap(np.sqrt)
Out[103]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN

Probably the simplest explanation the difference between apply and applymap:
apply takes the whole column as a parameter and then assign the result to this column
applymap takes the separate cell value as a parameter and assign the result back to this cell.
NB If apply returns the single value you will have this value instead of the column after assigning and eventually will have just a row instead of matrix.

Just wanted to point out, as I struggled with this for a bit
def f(x):
if x < 0:
x = 0
elif x > 100000:
x = 100000
return x
df.applymap(f)
df.describe()
this does not modify the dataframe itself, has to be reassigned:
df = df.applymap(f)
df.describe()

Based on the answer of cs95
map is defined on Series ONLY
applymap is defined on DataFrames ONLY
apply is defined on BOTH
give some examples
In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [4]: frame
Out[4]:
b d e
Utah 0.129885 -0.475957 -0.207679
Ohio -2.978331 -1.015918 0.784675
Texas -0.256689 -0.226366 2.262588
Oregon 2.605526 1.139105 -0.927518
In [5]: myformat=lambda x: f'{x:.2f}'
In [6]: frame.d.map(myformat)
Out[6]:
Utah -0.48
Ohio -1.02
Texas -0.23
Oregon 1.14
Name: d, dtype: object
In [7]: frame.d.apply(myformat)
Out[7]:
Utah -0.48
Ohio -1.02
Texas -0.23
Oregon 1.14
Name: d, dtype: object
In [8]: frame.applymap(myformat)
Out[8]:
b d e
Utah 0.13 -0.48 -0.21
Ohio -2.98 -1.02 0.78
Texas -0.26 -0.23 2.26
Oregon 2.61 1.14 -0.93
In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
b d e
Utah 0.13 -0.48 -0.21
Ohio -2.98 -1.02 0.78
Texas -0.26 -0.23 2.26
Oregon 2.61 1.14 -0.93
In [10]: myfunc=lambda x: x**2
In [11]: frame.applymap(myfunc)
Out[11]:
b d e
Utah 0.016870 0.226535 0.043131
Ohio 8.870453 1.032089 0.615714
Texas 0.065889 0.051242 5.119305
Oregon 6.788766 1.297560 0.860289
In [12]: frame.apply(myfunc)
Out[12]:
b d e
Utah 0.016870 0.226535 0.043131
Ohio 8.870453 1.032089 0.615714
Texas 0.065889 0.051242 5.119305
Oregon 6.788766 1.297560 0.860289

Just for additional context and intuition, here's an explicit and concrete example of the differences.
Assume you have the following function seen below. (
This label function, will arbitrarily split the values into 'High' and 'Low', based upon the threshold you provide as the parameter (x). )
def label(element, x):
if element > x:
return 'High'
else:
return 'Low'
In this example, lets assume our dataframe has one column with random numbers.
If you tried mapping the label function with map:
df['ColumnName'].map(label, x = 0.8)
You will result with the following error:
TypeError: map() got an unexpected keyword argument 'x'
Now take the same function and use apply, and you'll see that it works:
df['ColumnName'].apply(label, x=0.8)
Series.apply() can take additional arguments element-wise, while the Series.map() method will return an error.
Now, if you're trying to apply the same function to several columns in your dataframe simultaneously, DataFrame.applymap() is used.
df[['ColumnName','ColumnName2','ColumnName3','ColumnName4']].applymap(label)
Lastly, you can also use the apply() method on a dataframe, but the DataFrame.apply() method has different capabilities. Instead of applying functions element-wise, the df.apply() method applies functions along an axis, either column-wise or row-wise. When we create a function to use with df.apply(), we set it up to accept a series, most commonly a column.
Here is an example:
df.apply(pd.value_counts)
When we applied the pd.value_counts function to the dataframe, it calculated the value counts for all the columns.
Notice, and this is very important, when we used the df.apply() method to transform multiple columns. This is only possible because the pd.value_counts function operates on a series. If we tried to use the df.apply() method to apply a function that works element-wise to multiple columns, we'd get an error:
For example:
def label(element):
if element > 1:
return 'High'
else:
return 'Low'
df[['ColumnName','ColumnName2','ColumnName3','ColumnName4']].apply(label)
This will result with the following error:
ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', u'occurred at index Economy')
In general, we should only use the apply() method when a vectorized function does not exist. Recall that pandas uses vectorization, the process of applying operations to whole series at once, to optimize performance. When we use the apply() method, we're actually looping through rows, so a vectorized method can perform an equivalent task faster than the apply() method.
Here are some examples of vectorized functions that already exist that you do NOT want to recreate using any type of apply/map methods:
Series.str.split() Splits each element in the Series
Series.str.strip() Strips whitespace from each string in the Series.
Series.str.lower() Converts strings in the Series to lowercase.
Series.str.upper() Converts strings in the Series to uppercase.
Series.str.get() Retrieves the ith element of each element in the Series.
Series.str.replace() Replaces a regex or string in the Series with another string
Series.str.cat() Concatenates strings in a Series.
Series.str.extract() Extracts substrings from the Series matching a regex pattern.

My understanding:
From the function point of view:
If the function has variables that need to compare within a column/ row, use
apply.
e.g.: lambda x: x.max()-x.mean().
If the function is to be applied to each element:
1> If a column/row is located, use apply
2> If apply to entire dataframe, use applymap
majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)
def times10(x):
if type(x) is int:
x *= 10
return x
df2.applymap(times10)

FOMO:
The following example shows apply and applymap applied to a DataFrame.
map function is something you do apply on Series only. You cannot apply map on DataFrame.
The thing to remember is that apply can do anything applymap can, but apply has eXtra options.
The X factor options are: axis and result_type where result_type only works when axis=1 (for columns).
df = DataFrame(1, columns=list('abc'),
index=list('1234'))
print(df)
f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only
# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1)) # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result
As a sidenote, Series map function, should not be confused with the Python map function.
The first one is applied on Series, to map the values, and the second one to every item of an iterable.
Lastly don't confuse the dataframe apply method with groupby apply method.

iterating re.split() on a dataframe

I am trying to use re.split() to split a single variable in a pandas dataframe into two other variables.
My data looks like:
xg
0.05+0.43
0.93+0.05
0.00
0.11+0.11
0.00
3.94-2.06
I want to create
e a
0.05 0.43
0.93 0.05
0.00
0.11 0.11
0.00
3.94 2.06
I can do this using a for loop and and indexing.
for i in range(len(df)):
if df['xg'].str.len()[i] < 5:
df['e'][i] = df['xg'][i]
else:
df['e'][i], df['a'][i] = re.split("[\+ \-]", df['xg'][i])
However this is slow and I do not believe is a good way of doing this and I am trying to improve my code/python understanding.
I had made various attempts by trying to write it using np.where, or using a list comprehension or apply lambda but I can't get it too run. I think all the issues I have are because I am trying to apply the functions to the whole series rather than the positional value.
If anyone has an idea of a better method than my ugly for loop I would be very interested.

Borrowed from this answer using the str.split method with the expand argument:
https://stackoverflow.com/a/14745484/3084939
df = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
df[['left','right']] = df['col'].str.split('[+|-]', expand=True)
df.head()
col left right
0 1+2 1 2
1 3+4 3 4
2 20 20 None
3 0.6+1.6 0.6 1.6

This may be what you want. Not sure it's elegant, but should be faster than a python loop.
import pandas as pd
import numpy as np
data = ['0.05+0.43','0.93+0.05','0.00','0.11+0.11','0.00','3.94-2.06']
df = pd.DataFrame(data, columns=['xg'])
# Solution
tmp = df['xg'].str.split(r'[ \-+]')
df['e'] = tmp.apply(lambda x: x[0])
df['a'] = tmp.apply(lambda x: x[1] if len(x) > 1 else np.nan)
del(tmp)

Regex to retain - ve sign
import pandas as pd
import re
df1 = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
data = [[i] + re.findall('-*[0-9.]+', i) for i in df1['col']]
df = pd.DataFrame(data, columns=["col", "left", "right"])
print(df.head())
col left right
0 1+2 1 2
1 3+4 3 4
2 20 20 None
3 0.6-1.6 0.6 -1.6
[Program finished]

how to extract numeric information from a string in Pandas?

I have a column in my dataframe that contains string rows such as :
'(0.0,0.8638888888888889,3.7091666666666665,12.023333333333333,306.84694444444443)'
This output (produced by another program) corresponds to the min, 25th, median, 75th and max for a given variable.
I would like to extract that information, and put them in separate numeric columns, such as
min p25 p50
0.0 0.864 3.70
The data I have is really large. How can I do that in Pandas?
Many thanks!

IIUC then the following should work:
In [280]:
df = pd.DataFrame({'col':['(0.0,0.8638888888888889,3.7091666666666665,12.023333333333333,306.84694444444443)']})
df
Out[280]:
col
0 (0.0,0.8638888888888889,3.7091666666666665,12....
In [297]:
df[['min','p25','p50']] = df['col'].str.replace('\'|\(|\)','').str.split(',', expand=True).astype(np.float64)[[0,1,2]]
df
Out[297]:
col min p25 p50
0 (0.0,0.8638888888888889,3.7091666666666665,12.... 0.0 0.863889 3.709167
So this replaces the ' ( and ) characters with blank using str.replace and then we split using str.split on the comma and cast the type to float and then index the cols of interest.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace unwanted strings in pandas dataframe element wise and efficiently - python

These are not the classical +infinity or NaN , that df.fillna() could take care off You can specify a list of strings to consider as NA when reading the csv file. df = pd.read_csv(filename, na_values=['+unend', 'n. def.']) And then fill the NA values with fillna

Related

Adding values in a column by "formula" in a pandas dataframe

Pandas iterate over values of single column in data frame

Difference between Pandas' apply() and Python's map() when creating new column? [duplicate]

iterating re.split() on a dataframe

how to extract numeric information from a string in Pandas?

Categories

Resources