regex to remove hyphens and spaces - python

I've got the string:
<u>40 -04-11</u>
How do I remove the spaces and hyphens so it returns 400411?
Currently I've got this:
(<u[^>]*>)(\-\s)(<\/u>)
But I can't figure out why it isn't working. Any insight would be appreciated.
Thanks

(<u[^>]*>)(\-\s)(<\/u>)
Your pattern above doesn't tell your regex where to expect numbers.
(<u[^>]*>)(?:-|\s|(\d+))*(<\/u>)
That should get you started, but not being a python guy, I can't give you the exact replacement syntax. Just be aware that the digits are in a repeating capture group.
Edit: This is an edit in response to your comment. Like I said, not a python guy, but this will probably do what you need if you hold your tongue just right.
def repl(matchobj):
if matchobj.group(1) is None:
return ''
else:
return matchobj.group(1)
source = '<u>40 -04-11</u>40 -04-11<u>40 -04-11</u>40 -04-11'
print re.sub(r'(?:\-|\s|(\d+))(?=[^><]*?<\/u>)', repl, source)
Results in:
>>>'<u>400411</u>40 -04-11<u>400411</u>40 -04-11'
If the above offends the Python deities, I promise to sacrifice the next PHP developer I come across. :)

You don't really need a regex, you could use :
>>> '<u>40 -04-11</u>'.replace('-','').replace(' ','')
'<u>400411</u>'

Using Perl syntax:
s{
(<u[^>]*>) (.*?) (</u>)
}{
my ($start, $body, $end) = ($1, $2, $3);
$body =~ s/[-\s]//g;
$start . $body . $end
}xesg;
Or if Python doesn't have an equivalent to /e,
my $out = '';
while (
$in =~ m{
\G (.*?)
(?: (<u[^>]*>) (.*?) (</u>) | \z )
}sg
) {
my ($pre, $start, $body, $end) = ($1, $2, $3, $4);
$out .= $pre;
if (defined($start)) {
$body =~ s/[-\s]//g;
$out .= $start . $body . $end;
}
}

I'm admittedly not very good at regexes, but the way I would do this is by:
Doing a match on a <u>...</u> pair
doing a re.sub on the bit between the match using group().
That looks like this:
example_str = "<u> 76-6-76s</u> 34243vvfv"
tmp = re.search("(<u[^>]*>)(.*?)(<\/u>)",example_str).group(2)
clean_str = re.sub("(\D)","",tmp)
>>>'76676'

You should expose correctly your problem. I firstly didn't exactly understand it.
Having read your comment (only between the tags <u> and </u> tags) , I can now propose:
import re
ss = '87- 453- kol<u>40 -04-11</u> maa78-55 98 12'
print re.sub('(?<=<u>).+?(?=</u>)',
lambda mat: ''.join(c for c in mat.group() if c not in ' -'),
ss)
result
87- 453- kol<u>400411</u> maa78-55 98 12

Related

How to delete alphanumeric words out of a Unicode file

I need to use a dictionary database, but most of it is some alphanumeric useless stuff, and the interesting fields are either non alphanumeric (such as chinese characters) or inside some brackets. I searched a lot, learned about a lot of tools like sed, awk, grep, ect I even thought about creating a Python script to sort it out, but I never managed to find of a solution.
A line of the database looks like this:
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}
I need it to be like this :
助 ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}
Ho can I do this using any of the tools mentioned above?
Here is a Python solution if you would still like one:
import re
alpha_brack = re.compile(r"([a-zA-Z0-9.\-]+)|({.*?})")
my_string = """
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367
DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4
Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"""
match = alpha_brack.findall(my_string)
new_string = my_string
for g0, _ in match: # only care about first group!
new_string = new_string.replace(g0,'',1) # replace only first occurence!
final = re.sub(r'\s{2,}',' ', new_string) # finally, clean up whitespace
print(final)
My results:
'助ジョ たすける たすかる すける すけ {help} {rescue} {assist}'
Personally, given your example line, I'd sed out all alphanumeric characters that start and end with a space:
sed -i 's/ [a-zA-Z0-9 .-]+ / /g' should be close to what you need. You may have to add more special characters if the text you're wiping out contains other things. This is an in-place substitution for a single space (essentially deleting).
No linux box handy to verify this one... it may require a little massaging.
Also worth mentioning, this will not work if the brackets can contain two spaces: {test results found} as it'll blow away the results
Using perl:
perl -ne '
m/(.*?)({.*)/; # Split based on '{'
my $a=$1; my $b=$2;
$a =~ s/[[:alnum:]-.]//g; #Remove alphabets, numbers, '.', '-' (add more characters as you need.)
$a =~ s/ +/ /g; # Compress spaces.
print "$a $b\n"; #Print 2 parts and a newline
' dbfile.txt
Explanation in the inline comments.
Similar logic with sed:
sed '
h; #Save line in hold space.
s/{.*//; # Remove 2nd part
s/[a-zA-Z0-9.-]//g; # Remove all alphabets, numbers, . & -
s/ */ /g; # Compress spaces
x; #Save updated 1st part in hold space, take back the complete line in pattern space
s/[^{]*{/{/; #Remove first part
x; #Swap hold & pattern space again.
G; # Append 2nd part to first part separated by newline
s/\n//; # Remove newline.
' dbfile.txt
Using shell script (Bash):
#!/bin/bash
string="助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"
echo "" > tmpfield
for field in $string
do
if [ "${field:0:1}" != "{" ];then #
echo $field|sed "s/[a-zA-Z0-9 .-]/ /g" >> tmpfield
else
echo $field >> tmpfield
fi
done
#convert rows to one column
cat tmpfield | awk 'NF'|awk 'BEGIN { ORS = " " } { print }'
My output:
nampt#nampt-desktop:/mnt$ bash 1.bash
助 ジョ たす ける たす かる す ける すけ {help} {rescue} {assist}

Easy/Simple way to write switch-like regular expressions

I'm newbie for Python and wondering what is best way to write a code below in perl into python:
if ($line =~ /(\d)/) {
$a = $1
}
elsif ($line =~ /(\d\d)/) {
$b = $1
}
elsif ($line =~ /(\d\d\d)/) {
$c = $1
}
What I want to do is to retrieve a specific part of each line within a large set of lines. In python all what I can do is as below and is very ugly.
res = re.search(r'(\d)', line)
if res:
a = res.group(1)
else:
res = re.search(r'(\d\d)', line)
if res:
b = res.group(1)
else:
res = re.search(r'(\d\d\d)', line)
if res:
c = res.group(1)
Does any one know better way to write same thing without non-built-in module?
EDIT:
How do you write if you need parse line using very different re?
My point here is it should be simple so that any one can understand what the code is doing there.
In perl, we can write:
if ($line =~ /^this is a sample line (.+) and contain single value$/) {
$name = $1
}
elsif ($line =~ /^this is another sample: (.+):(.+) two values here$/) {
($address, $call) = ($1, $2)
}
elsif ($line =~ /^ahhhh thiiiss isiss (\d+) last sample line$/) {
$description = $1
}
From my view, this kind perl code is very simple and easy to understand.
EDIT2:
I found same discussion here:
http://bytes.com/topic/python/answers/750203-checking-string-against-multiple-patterns
So there's no way to write in python simple enough like perl..
You could write yourself a helper function to store the result of the match at an outer scope so that you don't need to rematch the regex in the if statement
def search(patt, str):
search.result = re.search(patt, str)
return search.result
if search(r'(\d)', line):
a = search.result.group(1)
elif search(r'(\d\d)', line):
b = search.result.group(1)
elif search(r'(\d\d\d)', line):
c = search.result.group(1)
In python 3.8, you'll be able to use:
if res := re.search(r'(\d)', line):
a = res.group(1)
elif res := re.search(r'(\d\d)', line):
b = res.group(1)
elif res := re.search(r'(\d\d\d)', line):
c = res.group(1)
Order of the pattern is very important. Because if you use this (\d)|(\d\d)|(\d\d\d) pattern, the first group alone will match all the digit characters. So, it won't try to check the next two patterns, since the first pattern alone will find all the matches.
res = re.search(r'(\d\d\d)|(\d\d)|(\d)', line)
if res:
a, b, c = res.group(3), res.group(2), res.group(1)
DEMO
Similar to perl except 'elif' instead of 'elsif' and ':' after the test and no curly braces (replaced by indentation) and optional parenthesis. There are many resources on the web which describe Python statements and more which can be easily found with a google search.
if re.search(r'(\d)', line):
a = re.search(r'(\d)', line).group(1)
elif re.search(r'(\d\d)', line):
b = re.search(r'(\d\d)', line).group(1)
elif re.search(r'(\d\d\d)', line):
c = re.search(r'(\d\d\d)', line).group(1)
Of course the logic of the code is flawed since 'b' and 'c' never get set but I think this is the syntax you were looking for.

how do i do a range regex in ruby like awk /start/,/stop/

I want to do an AWK-style range regex like this:
awk ' /hoststatus/,/\}/' file
In AWK this would print all the lines between the two patterns in a file:
hoststatus {
host_name=myhost
modified_attributes=0
check_command=check-host-alive
check_period=24x7
notification_period=workhours
check_interval=5.000000
retry_interval=1.000000
event_handler=
}
How do I do that in Ruby?
Bonus: How would you do it in Python?
This is really powerful in AWK, but I'm new to Ruby, and not sure how you'd do it. In Python I was also unable to find a solution.
Ruby:
str =
"drdxrdx
hoststatus {
host_name=myhost
modified_attributes=0
check_command=check-host-alive
check_period=24x7
notification_period=workhours
check_interval=5.000000
retry_interval=1.000000
event_handler=
}"
str.each_line do |line|
print line if line =~ /hoststatus/..line =~ /\}/
end
This is the infamous flip-flop.
with python passing in the multiline and dotall flags to re. The ? following the * makes it non-greedy
>>> import re
>>> with open('test.x') as f:
... print re.findall('^hoststatus.*?\n\}$', f.read(), re.DOTALL + re.MULTILINE)

Python: Regex question / CSV parsing / Psycopg nested arrays

I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as value. Psycopg only parses the outer array of such values.
My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable.
My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).
Currently, this is my code:
import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
result = result.groups()
The result of this should be:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']
Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.
Am I using a wrong approach? Or can anyone point me in the right direction?
Thanks in advance!
Python's native lib should do a good work. Have you tried it already?
http://docs.python.org/library/csv.html
From your sample, it looks something like ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$ to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.
If you can do ASSERTIONS, this will get you on the right track.
This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).
The two main parsing regex's are these:
strips delimeter quote too and only $2 contains data, use in a while loop, global context
/(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/
my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
/(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/
This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)
use strict; use warnings;
my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';
my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;
my $rxExpanded = qr/
(?!}$) # ASSERT ahead: NOT a } plus end
(?:^{?|,) # Boundry: Start of string plus { OR comma
\s* # 0 or more whitespace
( ".*?" | .*?) # Capture "Quoted" or non quoted data
\s* # 0 or more whitespace
(?=,|}$) # Boundry ASSERT ahead: Comma OR } plus end
/x;
my ($newstring, $sucess) = ('[', 0);
for my $field ($str =~ /$rx/g)
{
my $tmp = $field;
$sucess = 1;
if ( $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
$tmp = "'$tmp'";
}
$newstring .= "$tmp,";
}
if ( $sucess ) {
$newstring =~ s/,$//;
$newstring .= ']';
print $newstring,"\n";
}
else {
print "Invalid string!\n";
}
Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']
It seemed that the CSV approach was the easiest to implement:
def parsePsycopgSQLArray(input):
import csv
import cStringIO
input = input.strip("{")
input = input.strip("}")
buffer = cStringIO.StringIO(input)
reader = csv.reader(buffer, delimiter=',', quotechar='"')
return reader.next() #There can only be one row
if __name__ == "__main__":
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
result = parsePsycopgSQLArray(text)
print result
Thanks for the responses, they were most helpfull!
Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well:
def restore_str_array(val):
"""
Converts a postgres formatted string array (as a string) to python
:param val: postgres string array
:return: python array with values as strings
"""
val = val.strip("{}")
if not val:
return []
reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
return reader.next()

Regular expression to match C's multiline preprocessor statements

what I need is to match multiline preprocessor's statements such as:
#define max(a,b) \
({ typeof (a) _a = (a); \
typeof (b) _b = (b); \
_a > _b ? _a : _b; })
The point is to match everything between #define and last }), but I still can't figure out how to write the regexp. I need it to make it work in Python, using "re" module.
Could somebody help me please?
Thanks
This should do it:
r'(?m)^#define (?:.*\\\r?\n)*.*$'
(?:.*\\\r?\n)* matches zero or more lines ending with backslashes, then .*$ matches the final line.
I think something like this will work:
m = re.compile(r"^#define[\s\S]+?}\)*$", re.MULTILINE)
matches = m.findall(your_string_here)
This assumes that your macros all end with '}', with an optional ')' at the end.
I think above solution might not work for:
#define MACRO_ABC(abc, djhg) \
do { \
int i; \
/*
* multi line comment
*/ \
(int)i; \
} while(0);

Categories