Unable to remove blanks/tabs from string - python

I am using Python to build the logic. Facing the challenge in removing the whitespaces in the string.
Actual code as below:
searchExprLineWithoutSpace.replace(" ", "").strip())
Even i have triedwith lstrip, rstrip but no luck. Actual String in this is as below:
"// SetSearchExpr(searchExpr);
ExecuteQuery();
var FR = FirstRecord();"
Not matter what I do, I am unable to remove the space from the first line:
If someone can steer as to what else I can do, it would help. Thanks.

If you want to remove all whitespace, you can use the generic paramter for str.split, then join it.
One thing I will note is that this string that you provided is incorrect, since it needs to be triple quote for multilines.
Anyways, if you want to remove the whitespace, you can do
white_space_to_remove = "hello world this is uneven whitespace"
white_space_to_remove = ''.join(white_space_to_remove.split())
This will give the string with no whitespace whatsoever, since the delimiter for str.join is just an empty string.
However, I'm unsure as to what you're trying to do with removing whitespace since your question was quite generic.
Note:
str.split without any parameters will by default split by any whitespace.

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Remove extra spaces from a python string

I have a string which contains the following information.
mystring = "'$1$Not Running', ''"
I want to be able to remove the extra space and , '' after the Running. I tried to use strip() but it does not seem to work.
My desired output is mystring = "'$2$Not Running'"
I am not sure what I am missing here? Any help is appreciated.
One of the easier solutions would be to partition your string based on the comma:
mystring, comma, rest = mystring.partition(",")
This solution depends on there not being any commas in the string other than that one.
The better solution would be to figure out why the extra characters are in your string and what you can do to avoid it.
If that isn't possible, it looks like the string is valid Python, so you could parse it as a tuple and always pick the first element:
import ast
mystring, _ = ast.literal_eval(mystring)
Although in this case you would get what's inside the single quotes, not the single quotes as characters themselves.
i assume you want to remove the final 4 char's in your string. To do this you can simply
mystring = mystring[:-4]
if this is not right tell me and ill try to find a solution
strip() only removes spaces as the beginning and end of a string. Since what you want to remove is in the middle, it won't work for you.
You can use regular expressions to search and replace for specific strings:
import re
mystring = "'$1$Not Running', ''"
mynewstring = re.sub(", ''", "", mystring)
print(mynewstring)
# '$1$Not Running'
I'm not sure what extra space you're talking about, but you can use similar logic to replace it.
If this is literally the only thing you need it for, then some of the other answers might be simpler. If you need it for several different cases of input, this might be a better option. We'd need to see more examples of input to figure that out though.
Maybe there is something better but you can try to use split()
mynewstring = mystring.split()[0] + mystring.split()[1]
If the 4 characters you want to replace are ', ' then you can just use the string.replace() function to replace them with an empty string '':
mystring = mystring.replace( "', '", '')

Regex End of Line and Specific Chracters

So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?
Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)
There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r
Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8

parsing string with specific name in python

i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.
It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.

Extracting parenthesis with a specific format with Python

I am fairly new to python so I apologies if this is quite a novice question, but I am trying to extract text from parentheses that has specific format from a raw text file.
I have tried this with regular expressions, but please let me know if their is a better method.
To show what I want to do by example:
s = "Testing (Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)"
From this string I want a result something like:
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
The regular expression I have tried so far is
"(\(.+[,] [0-9]{4}\))"
in conjunction with re.findall(), however this only gives me the result:
['(Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)']
So, as you may have guessed, I am trying to extract the bibliographic references from a .txt file. But I don't want to extract anything that happens to be in parentheses that is not a bibliographic reference.
Again, I apologies if this is novice, and again if there is a question like this out there already. I have searched, but no luck as yet.
Using [^()] instead of .. This will make sure there is no nested ().
>>> re.findall("(\([^()]+[,] [0-9]{4}\))", s)
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
Assuming that you will have no nested brackets, you could use something like so: (\([^()]+?, [0-9]{4}\)). This will match any non bracket character which is within a set of parenthesis which is followed by a comma, a white space four digits and a closing parenthesis.
I would suggest something like \(\w+,\s+[0-9]{4}\). A couple changes from your original:
Match word characters (letters/numbers/underscores) instead of any character in the source name.
Match one or more space characters after the comma, instead of limiting yourself to a single literal space.

Categories