replace string in pandas dataframe - python

I have a dataframe with multiple columns. I want to look at one column and if any of the strings in the column contain #, I want to replace them with another string. How would I go about doing this?

A dataframe in pandas is composed of columns which are series - Panda docs link
I'm going to use regex, because it's useful and everyone needs practice, myself included! Panda docs for text manipulation
Note the str.replace. The regex string you want is this (it worked for me): '.*#+.*' which says "any character (.) zero or more times (*), followed by an # 1 or more times (+) followed by any character (.) zero or more times (*)
df['column'] = df['column'].str.replace('.*#+.*', 'replacement')
Should work, where 'replacement' is whatever string you want to put in.

My suggestion:
df['col'] = ['new string' if '#' in x else x for x in df['col']]
not sure which is faster.

Assuming you called your dataframe df, you can do:
pd.DataFrame(map(lambda col: map(lambda x: 'anotherString' if '#' in x else x, df[col]), df.columns)).transpose()

Related

Replacing a Character in .csv file only for specific strings

I am trying to clean a file and have removed the majority of unnecessary data excluding this one issue. The file I am cleaning is made up of rows containing numbers, see below example of a few rows.
[Example of data][1] [1]: https://i.stack.imgur.com/0bADX.png
You can see that I have cleaned the data so that there is a space between each character aside from the four characters that start each row. There are some character groupings that I have not yet added a space between each character because I need to replace the "1"s with a space rather than keeping the "1"s.
Strings I still need to clean2: https://i.stack.imgur.com/gmeUs.png
I have tried the following two methods in order to replace the 1's in these specific strings, but both produce results that I do not want.
Method 1 - Replacing 1's before splitting characters into their own columns
Data2 = pd.read_csv(filename.csv)
Data2['Column']=Data2['Column'].apply(lambda x: x.replace('1',' ') if len(x)>4 else x)
This method results in the replacement of every 1 in the entire file, not just the 1's in the strings like those pictured above (formatted like "8181818"). I would think that the if statement would excluded the removal of the 1's where there are less than 4 characters grouped together.
Method 2 - Replacing 1's after splitting characters into their own columns
Since Method 1 was resulting in the removal of each 1 in the file, I figured I could split each string into its own column (essentially using the spaces as a delimiter) and then try a similar method to clean these unnecessary 1's by focusing on the specific columns where these strings are located (columns 89, 951, and 961).
Data2[89]=Data2[89].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[89].str.split(' ').tolist()
Data2[89] = pd.DataFrame(Data2[89].str.split(' ').tolist())
Data2[951]=Data2[951].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[951].str.split(' ').tolist()
Data2[951] = pd.DataFrame(Data2[951].str.split(' ').tolist())
Data2[961]=Data2[961].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[961].str.split(' ').tolist()
Data2[961] = pd.DataFrame(Data2[961].str.split(' ').tolist())
This method successfully removed only the 1's in these strings, but when I am then splitting the numbers I am keeping from these strings into their own columns they are overwriting the existing values in those columns rather than pushing those existing values into columns further down the line.
Any assistance on either of these methods or advice on if there is a different approach I should be taking would be much appreciated.

Splitting the column values based on a delimiter (Pandas)

I have a panda dataframe with a column name - AA_IDs. The column name values has a special character "-#" in few rows. I need to determine three things:
Position of these special characters or delimiters
find the string before the special character
Find the string after the special character
E.g. AFB001 9183Daily-#789876A
Answer would be before the delimiter - AFB001 9183Daily and after the delimiter - 789876A
Just use apply function with split -
df['AA_IDs'].apply(lambda x: x.split('-#'))
This should give you a series with a list for each row as [AFB001 9183Daily, 789876A]
This would be significantly faster than using regex, and not to mention the readability.
So lets say the dataframe is called df and the column with the text is A.
You can use
import re # Import regex
pattern = r'<your regex>'
df['one'] = df.A.str.extract(pattern)
This creates a new column containing the extracted text. You just need to create a regex to extract what you want from your string(s). I highly recommend regex101 to help you construct your regex.
Hope this helps!

How to Check and Filter out Row in Dataframe if "." exists in df Cell

I tried to reference this SF answer: How to check if character exists in DataFrame cell
It gave a seemingly good solution but it doesn't appear to work for a period character "." Which of course is the character I'm trying to filter out on.
df_intials = df['Name'].str.contains('.')
Is there something specific about filterting through a dataframe that every value in the column has a "."?
When I convert to a list, and write a simple function to append strings with the character "." to it works correctly.
pd.Series.str.contains uses regex expressions as default, so you can either use the escape character backslack or parameter regex=False:
Try
df_intials = df['Name'].str.contains('\.')
or
df_intials = df['Name'].str.contains('.', regex=False)

How to perform str.strip in dataframe and save it with inplace=true?

I have dataframe with n columns. And I want to perform a strip to strings in one of the columns in the dataframe. I was able to do it, but I want this change to reflect in the original dataframe.
Dataframe: data
Name
0 210123278414410005
1 101232784144610006
2 210123278414410007
3 21012-27841-410008
4 210123278414410009
After stripping:
Name
0 10005
1 10006
2 10007
3 10008
4 10009
5 10010
I tried the below code and strip was successful
data['Name'].str.strip().str[13:]
However if I check dataframe, the strip is not reflected.
I am looking for something like inplace parameter.
String methods (the attributes of the .str attribute on a series) will only ever return a new Series, you can't use these for in-place changes. Your only option is to assign it back to the same column:
data['Name'] = data['Name'].str.strip().str[13:]
You could instead use the Series.replace() method with a regular expression, and inplace=True:
data['Name'].replace(r'(?s)\A\s*(.{,13}).*(?<!\s)\s*\Z', r'\1', regex=True, inplace=True)
The regular expression above matches up to 13 characters after leading whitespace, and ignores trailing whitespace and any other characters beyond the first 13 after whitespace is removed. It produces the same output as .str.strip().str[:13], but makes the changes in place.
The pattern is using a negative look-behind to make sure that the final \s* pattern matches all whitespace elements at the end before selecting between 0 and 13 characters of what remains. The \A and \Z anchors make it so the whole string is matched, and the (?s) at the start switches the . pattern (dot, any character except newlines) to include newlines when matching; this way an input value like ' foo\nbar ' is handled correctly.
Put differently, the \A\s* and (?<!\s)\s*\Z patterns act like str.strip() does, matching all whitespace at the start and end, respectively, and no more. The (.{,13)).* pattern matches everything in between, with the first 13 characters of those (or fewer, if there are not enough characters to match after stripping) captured as a group. That one group is then used as the replacement value.
And because . doesn't normally match \n characters, the (?s) flag at the start tells the regex engine to match newline characters anyway. We want all characters to be included after stripping, not just all except one.
data['Name'].str.strip().str[13:] returns you the new transformed column, but it is not changing in-place data (inside the dataframe). You should write:
data['Name'] = data['Name'].str.strip().str[13:]
to write the transformed data to the Name column.
I agree with the other answers that there's no inplace parameter for the strip function, as seen in the documentation for str.strip.
To add to that: I've found the str functions for pandas Series usually used when selecting specific rows. Like df[df['Name'].str.contains('69'). I'd say this is a possible reason that it doesn't have an inplace parameter -- it's not meant to be completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or we'll consistently get "last 5 characters" instead!
As per yatu's comment: you should reassign the Series with stripped values to the original column.
data['Name'] = data['Name'].str.strip().str[13:]
It is interesting to note that pandas DataFrames do work on numpy beneath.
There is also the idea to do broadcasting operations in numpy.
Here is the example I had in mind:
import numpy as np
import pandas as pd
df=pd.DataFrame([['210123278414410005', '101232784144610006']])
dfn=df.to_numpy(copy=False) #numpy array
df=pd.DataFrame(np.frompyfunc(lambda dfn: dfn[13:],1,1)(dfn) )
print(df) #10005 10006
This doesn't answer your question but it is just another option (it creates a new datafram from the numpy array though).

How To Remove \x95 chars from text - Pandas?

I am having trouble to remove a space at the beginning of string in pandas dataframe cells. If you look at the dataframe cells, it seems like that there is a space at the start of string however it prints "\x95 12345" when you output one of cells which has set of chars at the beginning, so as you can see it is not a normal space char but it is rather "\x95"
I already tried to use strip() - But it didn't help.
Dataframe was produced after the use of str.split(pat=',').tolist() expression which basically split the strings into different cells by ',' so now my strings have this char added.
Assuming col1 is your first column name:
import re
df.col1 = df.col1.apply(lambda x: re.sub(r'\x95',"",x))

Categories