How to find/replace string field in dataframe? - python

I want to replace string that ends in ".F" with ".DE"
I tried :
data["field"] = data.field.str.replace(".F",".DE")
but strangely it replaces "BFb.N" into ".DEb.N"
Any suggestions?

Pandas String Replace
You probably think ".F" means to look for .F and replace it, but that's not what's happening. It's not treating ".F" as a string, it's treating it as a regular expression, which is an entirely different thing. In a regular expression ".F" means any character plus F, which is why you are seeing this behavior. Staying w/the regular expression you can add a \ to your . to make it a literal period, such as "\.F" and it should solve your problem.
Alternatively you can tell pandas not to treat it as a regular expression and it should also solve your problem
data.field.str.replace(".F",".DE", regex=False)

Related

Remove specific combination of characters in dataframe colum?

I have the following issue where I have some data that has a specific combination of characters that I need to remove, example:
data_col
*.test1.934n
test1.tedsdh
*.test1.test.sdfsdf
jhsdakn
*.test2.test
What I need to remove is all the instances that exist for the "*." character combination in the dataframe. So far I've tried:
df['data_col'].str.replace('^*.','')
However when I run the code it gives me this error:
re.error: nothing to repeat at position 1
Any advise on how to fix this? Thanks in advance.
The default behaviour of .str.replace in pandas version 1.4.2 or earlier is to treat the replacememnt pattern as a regular expression. If you are using regular expressions to match characters with special meaning like * and . you have to escape them with backslashes:
df['data_col'].str.replace(r'^\*\.', '', regex=True)
Note that I used raw string literals to make sure that backslashes are treated as is. I also added regex=True, because otherwise pandas complains that in future it will not treat patterns as regex. Due to ^ at the beginning, this regex will only match the beginning of each string.
However, it is also possible that you don't need regular expressions in this particular case at all.
If you want to remove any instance of *. in your strings (not only the beginning ones), you can just do it with
df['data_col'].str.replace('*.', '', regex=False)
If you want to remove instance of *. only at the beginning of the string, you can use .removeprefix instead:
df['data_col'].str.removeprefix('*.')

How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working

This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0
Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time
Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time
str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.
First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.
To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

parsing string with specific name in python

i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.
It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.

Double letter matching using python regex

I am trying to find a way using regex to match words that have 3 unique sets of double letters. so far i have this:
r".*([a-z])\1.*([a-z])\2.*([a-z])\3.*"
But that doesn't account for unique sets for double letters. Thanks in advance =)
Maybe like this? Seems to work for me.
r".*([a-z])\1.*((?=(?!\1))[a-z])\2.*((?=(?!\1))(?=(?!\2))[a-z])\3.*"
(?=expr) is a non-consuming regular expression, and (?!expr) is regex NOT operator.
(?=expr) is a non-consuming regular expression
however
(?!expr) is also a non-consumer expression. This time a not equals in place of an equals.
So enclosing the 'not' in the 'equals' adds nothing. It works without that as well. However stacking non-consuming expressions does not always work, and a single non-consuming will do the job anyway by using an 'or' ('|' character).
so
r".([a-z])\1.(?!\1)([a-z])\2.(?!\1|\2)([a-z])\3."
Tidied some braces also. I think this is cleaner and will be more reliable between versions.

Categories