I have a pd dataframe where I have text in a cloumn called Text.
I want to replace each newline with a space. Therefore i tried:
for index, row in df.iterrows():
df['Text']=df['Text'].str.replace('\n',"")
The Problem is: if the original text is written like of\nthe after applying my method i get ofthe.
Any solutions?
You can just add a space in the replace character:
df['Text']=df['Text'].str.replace('\n'," ")
Related
I have a .csv file that has a column containing text. For each item in this column there is a gene name and a date (for example 'CYP2C19, CYP2D6 07/17/2020'). I want to remove the dates from all of values in this column so that only the two genes are visible (output: 'CYP2C19, CYP2D6'). Secondly, in some boxes there is both a gene name and an indication if there is no recommendation ('CYP2C9 08/19/2020 (no recommendation'). In these cases, I would like to remove both the date and the statement that says no recommendation (output: 'CYP2C19, CYP2D6').
I have tried using the code below to remove any text that contains slashes for a single string (I have not yet tried anything on the entire .csv file). However it left the 07 from the date unfortunately.
import re
pattern = re.compile('/(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
s = 'CYP2C19, CYP2D6 07/17/2020'
pattern.sub('', s)
Output: 'CYP2C19, CYP2D6 07'
One method is to just take the date out of the string, and then split it as you please. Note this works for any number of dates:
import re
x = 'CYP2C19, CYP2D6 07/17/2020'
x = re.sub(r'\s*\d{2}/\d{2}/\d{4}', "", x)
You could replace \s* with \s if you always know there will be only a single space separating the term you want and the date, but I don't see a reason to do that.
Note that you could now split this by the delimiter, which in the case of your question, is actually a comma followed by a space
result = x.split(", ")
# ['CYP2C19', 'CYP2D6']
Although in your csv you may find that it is just a comma (as CSVs normally are).
Combining the steps above:
import re
x = 'CYP2C19 08/15/1972, CYP2D6 07/17/2020'
x = re.sub(r'\s*\d{2}/\d{2}/\d{4}', "", x).split(", ")
# ['CYP2C19', 'CYP2D6']
I think that you could take each column then split it :
for exemple let's take the following string : column = ' CYP2D6 07/17/2020'
you could do : m = column.split() then you will obtain : a list like : m=['CYP2D6','07/17/2020']
after that you could simply take : gene = m[0]
I have a panda dataframe with a column name - AA_IDs. The column name values has a special character "-#" in few rows. I need to determine three things:
Position of these special characters or delimiters
find the string before the special character
Find the string after the special character
E.g. AFB001 9183Daily-#789876A
Answer would be before the delimiter - AFB001 9183Daily and after the delimiter - 789876A
Just use apply function with split -
df['AA_IDs'].apply(lambda x: x.split('-#'))
This should give you a series with a list for each row as [AFB001 9183Daily, 789876A]
This would be significantly faster than using regex, and not to mention the readability.
So lets say the dataframe is called df and the column with the text is A.
You can use
import re # Import regex
pattern = r'<your regex>'
df['one'] = df.A.str.extract(pattern)
This creates a new column containing the extracted text. You just need to create a regex to extract what you want from your string(s). I highly recommend regex101 to help you construct your regex.
Hope this helps!
I'm struggling to remove the first part of my URLs in column myId in csv file.
my.csv
myID
https://mybrand.com/trigger:open?Myservice=Email&recipient=brn:zib:b1234567-9ee6-11b7-b4a2-7b8c2344daa8d
desired output for myID
b1234567-9ee6-11b7-b4a2-7b8c2344daa8d
my code:
df['myID'] = df['myID'].map(lambda x: x.lstrip('https://mybrand.com/trigger:open?Myservice=Email&recipient=brn:zib:'))
output in myID (first letter 'b' is missing in front of the string):
1234567-9ee6-11b7-b4a2-7b8c2344daa8d
the above code removes https://mybrand.com/trigger:open?Myservice=Email&recipient=brn:zib: However it also removes the first letter from myID if there is one in front of the ID, if it's a number then it remains unchanged.
Could someone help with this? thanks!
You could try a regex replacement here:
df['myID'] = df['myID'].str.replace('^.*:', '', regex=True)
This approach is to simply remove all content from the start of MyID up to, and including, the final colon. This would leave behind the UUID you want to keep.
With lstrip you remove all characters from a string that match the set of characters you pass as an argument. So:
string = abcd
test = string.lstrip(ad)
print(test)
If you want to strip the first x characters of the string, you can just slice it like an array. For you, that would be something like:
df['myID'] = df['myID'].map(lambda x: x[:-37])
However, for this to work, the part you want to get from the string should have a constant size.
You can use re (if the part before what you want to extract is always the same)
import re
idx = re.search(r':zib:', myID)
myNewID = myID[idx.end():]
Then you will have :
myNewID
'b1234567-9ee6-11b7-b4a2-7b8c2344daa8d'
I tried to reference this SF answer: How to check if character exists in DataFrame cell
It gave a seemingly good solution but it doesn't appear to work for a period character "." Which of course is the character I'm trying to filter out on.
df_intials = df['Name'].str.contains('.')
Is there something specific about filterting through a dataframe that every value in the column has a "."?
When I convert to a list, and write a simple function to append strings with the character "." to it works correctly.
pd.Series.str.contains uses regex expressions as default, so you can either use the escape character backslack or parameter regex=False:
Try
df_intials = df['Name'].str.contains('\.')
or
df_intials = df['Name'].str.contains('.', regex=False)
I am having trouble to remove a space at the beginning of string in pandas dataframe cells. If you look at the dataframe cells, it seems like that there is a space at the start of string however it prints "\x95 12345" when you output one of cells which has set of chars at the beginning, so as you can see it is not a normal space char but it is rather "\x95"
I already tried to use strip() - But it didn't help.
Dataframe was produced after the use of str.split(pat=',').tolist() expression which basically split the strings into different cells by ',' so now my strings have this char added.
Assuming col1 is your first column name:
import re
df.col1 = df.col1.apply(lambda x: re.sub(r'\x95',"",x))