How To Remove \x95 chars from text - Pandas? - python

I am having trouble to remove a space at the beginning of string in pandas dataframe cells. If you look at the dataframe cells, it seems like that there is a space at the start of string however it prints "\x95 12345" when you output one of cells which has set of chars at the beginning, so as you can see it is not a normal space char but it is rather "\x95"
I already tried to use strip() - But it didn't help.
Dataframe was produced after the use of str.split(pat=',').tolist() expression which basically split the strings into different cells by ',' so now my strings have this char added.

Assuming col1 is your first column name:
import re
df.col1 = df.col1.apply(lambda x: re.sub(r'\x95',"",x))

Related

Replacing a Character in .csv file only for specific strings

I am trying to clean a file and have removed the majority of unnecessary data excluding this one issue. The file I am cleaning is made up of rows containing numbers, see below example of a few rows.
[Example of data][1] [1]: https://i.stack.imgur.com/0bADX.png
You can see that I have cleaned the data so that there is a space between each character aside from the four characters that start each row. There are some character groupings that I have not yet added a space between each character because I need to replace the "1"s with a space rather than keeping the "1"s.
Strings I still need to clean2: https://i.stack.imgur.com/gmeUs.png
I have tried the following two methods in order to replace the 1's in these specific strings, but both produce results that I do not want.
Method 1 - Replacing 1's before splitting characters into their own columns
Data2 = pd.read_csv(filename.csv)
Data2['Column']=Data2['Column'].apply(lambda x: x.replace('1',' ') if len(x)>4 else x)
This method results in the replacement of every 1 in the entire file, not just the 1's in the strings like those pictured above (formatted like "8181818"). I would think that the if statement would excluded the removal of the 1's where there are less than 4 characters grouped together.
Method 2 - Replacing 1's after splitting characters into their own columns
Since Method 1 was resulting in the removal of each 1 in the file, I figured I could split each string into its own column (essentially using the spaces as a delimiter) and then try a similar method to clean these unnecessary 1's by focusing on the specific columns where these strings are located (columns 89, 951, and 961).
Data2[89]=Data2[89].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[89].str.split(' ').tolist()
Data2[89] = pd.DataFrame(Data2[89].str.split(' ').tolist())
Data2[951]=Data2[951].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[951].str.split(' ').tolist()
Data2[951] = pd.DataFrame(Data2[951].str.split(' ').tolist())
Data2[961]=Data2[961].apply(lambda x: x.replace('1',' ') if len(x)!=1 else x)
Data2[961].str.split(' ').tolist()
Data2[961] = pd.DataFrame(Data2[961].str.split(' ').tolist())
This method successfully removed only the 1's in these strings, but when I am then splitting the numbers I am keeping from these strings into their own columns they are overwriting the existing values in those columns rather than pushing those existing values into columns further down the line.
Any assistance on either of these methods or advice on if there is a different approach I should be taking would be much appreciated.

Splitting the column values based on a delimiter (Pandas)

I have a panda dataframe with a column name - AA_IDs. The column name values has a special character "-#" in few rows. I need to determine three things:
Position of these special characters or delimiters
find the string before the special character
Find the string after the special character
E.g. AFB001 9183Daily-#789876A
Answer would be before the delimiter - AFB001 9183Daily and after the delimiter - 789876A
Just use apply function with split -
df['AA_IDs'].apply(lambda x: x.split('-#'))
This should give you a series with a list for each row as [AFB001 9183Daily, 789876A]
This would be significantly faster than using regex, and not to mention the readability.
So lets say the dataframe is called df and the column with the text is A.
You can use
import re # Import regex
pattern = r'<your regex>'
df['one'] = df.A.str.extract(pattern)
This creates a new column containing the extracted text. You just need to create a regex to extract what you want from your string(s). I highly recommend regex101 to help you construct your regex.
Hope this helps!

Replace \n with space

I have a pd dataframe where I have text in a cloumn called Text.
I want to replace each newline with a space. Therefore i tried:
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('\n',"")
The Problem is: if the original text is written like of\nthe after applying my method i get ofthe.
Any solutions?
You can just add a space in the replace character:
df['Text']=df['Text'].str.replace('\n'," ")

replacing multiple strings in a column with a list

I have a column named "New_category" and I have to replace multiple strings into the column with the word "brandies". I have created the following list:
brandies = ["apricot brandies", "cherry brandies", "peach brandies"]
I want to write a formula, which will replace "apricot brandies", "cherry brandies", "peach brandies" in the column with the word "brandies".
This is what i have tried but it does not work.
iowa['New_category'] = iowa['New_Category'].str.replace('brandies','Brandies')
When using str.replace use a pattern that captures the whole string.
iowa['New_category'] = iowa['New_category'].str.replace('^.*brandies.*$', 'Brandies')

replace string in pandas dataframe

I have a dataframe with multiple columns. I want to look at one column and if any of the strings in the column contain #, I want to replace them with another string. How would I go about doing this?
A dataframe in pandas is composed of columns which are series - Panda docs link
I'm going to use regex, because it's useful and everyone needs practice, myself included! Panda docs for text manipulation
Note the str.replace. The regex string you want is this (it worked for me): '.*#+.*' which says "any character (.) zero or more times (*), followed by an # 1 or more times (+) followed by any character (.) zero or more times (*)
df['column'] = df['column'].str.replace('.*#+.*', 'replacement')
Should work, where 'replacement' is whatever string you want to put in.
My suggestion:
df['col'] = ['new string' if '#' in x else x for x in df['col']]
not sure which is faster.
Assuming you called your dataframe df, you can do:
pd.DataFrame(map(lambda col: map(lambda x: 'anotherString' if '#' in x else x, df[col]), df.columns)).transpose()

Categories