I want to replace the string "Private room in house" with "Private" in a column in a dataframe
I have tried
df['room'] = df['room'].str.replace("Private[]","Private")
putting all the various regular expression characters in the [] but nothing works. All I have succeeded in doing is removing the space after Private.
I have looked at re.sub but haven't managed to get anything to work for me. I'm pretty new to Python so this is probably a simple problem but I can't find the answer anywhere
You can use:
df['room'] = df['room'].str.replace('Private.*','Private', regex=True)
Or with a look behind:
df['room'] = df['room'].str.replace('(?<=Private).*', '', regex=True)
Related
I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101
I feel like I have to apologize in advance for this one, but I've searched for answers and they seem to tell me what I'm doing is correct.
I'm trying to set a DataFrame column to True if another column has instances of a lowercase letter immediately followed by an uppercase letter.
What I tried was this:
cities['multiteam'] = cities['team'].apply(lambda x: pd.notna(re.search(r'[A][a]',x)))
That's setting all the results to False, so I figured maybe I was doing something wrong with my lambda function and I made the following to debug just the re.search() part:
cities['multiteam'] = pd.notna(re.search(r'[a][A]','OneTwo'))
That's also setting all the results to False. And there I'm stuck.
The following code is useful only to look for a letter 'A' followed by the lower case 'a'.
cities['multiteam'] = cities['team'].apply(lambda t: pd.notna(re.search(r'[A][a]',t)))
You may need to change it if you want to check it for all letters. Maybe replace that line with something like this:
cities['multiteam'] = cities['team'].apply(lambda t: pd.notna(re.search(r'[A-Z][a-z]',t)))
You should never to apologise about asking questions. Using apply is quite slow, try and use the str.contains which can accept a regex pattern.
cities.assign(multiteam=cities.team.str.contains('[a-z][A-Z]'))
The assign above is pandas new recommend way of assigning columns.
The str.contains works with regex and fixed strings, much faster than apply.
The regex pattern above says a range of a-z followed by A-Z.
I'm working with Python 3.5 in Windows. I have a dataframe where a 'titles' str type column contains titles of headlines, some of which have special characters such as â,€,˜.
I am trying to replace these with a space '' using pandas.replace. I have tried various iterations and nothing works. I am able to replace regular characters, but these special characters just don't seem to work.
The code runs without error, but the replacement simply does not occur, and instead the original title is returned. Below is what I have tried already. Any advice would be much appreciated.
df['clean_title'] = df['titles'].replace('€','',regex=True)
df['clean_titles'] = df['titles'].replace('€','')
df['clean_titles'] = df['titles'].str.replace('€','')
def clean_text(row):
return re.sub('€','',str(row))
return str(row).replace('€','')
df['clean_title'] = df['titles'].apply(clean_text)
We can only assume that you refer to non-ASCI as 'special' characters.
To remove all non-ASCI characters in a pandas dataframe column, do the following:
df['clean_titles'] = df['titles'].str.replace(r'[^\x00-\x7f]', '')
Note that this is a scalable solution as it works for any non-ASCI char.
How to remove escape sequence character in dataframe
Data.
product,rating
pest,<br> test
mouse,/ mousetest
Solution: scala Code
val finaldf = df.withColumn("rating", regexp_replace(col("rating"), "\\\\", "/")).show()
I have a long string like this:
'[("He tended to be helpful, enthusiastic, and encouraging, even to studentsthat didn\'t have very much innate talent.\\n",), (\'Great instructor\\n\',), (\'He could always say something nice and was always helpful.\\n\',), (\'He knew what he was doing.\\n\',), (\'Likes art\\n\',), (\'He enjoys the classwork.\\n\',), (\'Good discussion of ideas\\n\',), (\'Open-minded\\n\',), (\'We learned stuff without having to take notes, we just applied it to what we were doing; made it an interesting and fun class.\\n\',), (\'Very kind, gave good insight on assignments\\n\',), (\' Really pushed me in what I can do; expanded how I thought about art, the materials used, and how it was visually.\\n\',)
and I want to remove all [, (, ", \, \n from this string at once. Somehow I can do it one by one, but always failed with '\n'. Is there any efficient way I can remove or translate all these characters or blank lines symbols?
Since my senectiecs are not long so I do not want to use dictionary methods like earlier questions.
Maybe you could use regex to find all the characters that you want to replace
s = s.strip()
r = re.compile("\[|\(|\)|\]|\\|\"|'|,")
s = re.sub(r, '', s)
print s.replace("\\n", "")
I have some problems with the "\n" but replacing after the regex is easy to remove too.
If string is correct python expression then you can use literal_eval from ast module to transform string to tuples and after that you can process every tuple.
from ast import literal_eval
' '.join(el[0].strip() for el in literal_eval(your_string))
If not then you can use this:
def get_part_string(your_string):
for part in re.findall(r'\((.+?)\)', your_string):
yield re.sub(r'[\"\'\\\\n]', '', part).strip(', ')
''.join(get_part_string(your_string))
I was wondering if any of the following exist in python:
A: non-regex equivalent of "re.findall()".
B: a way of neutralizing regex special characters in a variable before passing to findall().
I am passing a variable to re.findall which runs into problems when the variable has a period or a slash or a carat etc because I would like these characters to be interpreted literally. I realize it is not necessary to use regex to do this job, but I like the behavior of re.findall() because it returns a list of every match it finds. This allows me to easily count how many times the substring exists by using len().
Here's an example of my code:
>>substring_matches = re.findall(randomVariableOfCharacters, document_to_be_searched)
>>
>>#^^ will return something like ['you', 'you', 'you']
>>#but could also return something like ['end.', 'end.', 'ends']
>>#if my variable is 'end.' because "." is a wildcard.
>>#I would rather it return ['end.', 'end.']
>>
>>occurrences_of_substring = len(substring_matches)
I'm hoping to not have to use string.find(), if possible. Any help and/or advice is greatly appreciated!
You can use str.count() if you only want the number of occurances, but its not equivalent to re.findall() it only gets the count.
document_to_be_searched = "blabla bla bla."
numOfOcur = document_to_be_searched.count("bl")
Sure: looking at your code, I think that you're looking for is string.count.
>>> 'abcdabc'.count('abc')
2
Note that however, this is not an equivalent to re.findall; although it looks more appropriate in your case.