How to delete alphanumeric words out of a Unicode file - python

I need to use a dictionary database, but most of it is some alphanumeric useless stuff, and the interesting fields are either non alphanumeric (such as chinese characters) or inside some brackets. I searched a lot, learned about a lot of tools like sed, awk, grep, ect I even thought about creating a Python script to sort it out, but I never managed to find of a solution.
A line of the database looks like this:
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}
I need it to be like this :
助 ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}
Ho can I do this using any of the tools mentioned above?

Here is a Python solution if you would still like one:
import re
alpha_brack = re.compile(r"([a-zA-Z0-9.\-]+)|({.*?})")
my_string = """
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367
DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4
Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"""
match = alpha_brack.findall(my_string)
new_string = my_string
for g0, _ in match: # only care about first group!
new_string = new_string.replace(g0,'',1) # replace only first occurence!
final = re.sub(r'\s{2,}',' ', new_string) # finally, clean up whitespace
print(final)
My results:
'助ジョ たすける たすかる すける すけ {help} {rescue} {assist}'

Personally, given your example line, I'd sed out all alphanumeric characters that start and end with a space:
sed -i 's/ [a-zA-Z0-9 .-]+ / /g' should be close to what you need. You may have to add more special characters if the text you're wiping out contains other things. This is an in-place substitution for a single space (essentially deleting).
No linux box handy to verify this one... it may require a little massaging.
Also worth mentioning, this will not work if the brackets can contain two spaces: {test results found} as it'll blow away the results

Using perl:
perl -ne '
m/(.*?)({.*)/; # Split based on '{'
my $a=$1; my $b=$2;
$a =~ s/[[:alnum:]-.]//g; #Remove alphabets, numbers, '.', '-' (add more characters as you need.)
$a =~ s/ +/ /g; # Compress spaces.
print "$a $b\n"; #Print 2 parts and a newline
' dbfile.txt
Explanation in the inline comments.
Similar logic with sed:
sed '
h; #Save line in hold space.
s/{.*//; # Remove 2nd part
s/[a-zA-Z0-9.-]//g; # Remove all alphabets, numbers, . & -
s/ */ /g; # Compress spaces
x; #Save updated 1st part in hold space, take back the complete line in pattern space
s/[^{]*{/{/; #Remove first part
x; #Swap hold & pattern space again.
G; # Append 2nd part to first part separated by newline
s/\n//; # Remove newline.
' dbfile.txt

Using shell script (Bash):
#!/bin/bash
string="助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"
echo "" > tmpfield
for field in $string
do
if [ "${field:0:1}" != "{" ];then #
echo $field|sed "s/[a-zA-Z0-9 .-]/ /g" >> tmpfield
else
echo $field >> tmpfield
fi
done
#convert rows to one column
cat tmpfield | awk 'NF'|awk 'BEGIN { ORS = " " } { print }'
My output:
nampt#nampt-desktop:/mnt$ bash 1.bash
助 ジョ たす ける たす かる す ける すけ {help} {rescue} {assist}

Related

Remove "function calls" with alphanumeric names and backslashes in a string with Python

I'm reading some strings from a file such as this one:
s = "Ab [word] 123 \test[abc] hi \abc [] a \command123[there\hello[www]]!"
which should be transformed into
"Ab [word] 123 abc hi \abc [] a therewww!"
Another example is
s = "\ human[[[rr] \[A] r \B[] r p\[]q \A[x\B[C]!"
which should be transformed into
"\ human[[[rr] A r r pq \A[xC!"
How can you generalize this to all similar "functions" with alphanumeric names? By "function" I mean a pattern such as \name[arg] where name is a (possibly empty) alphanumeric string and arg is a (possibly empty) arbitrary string.
Update: After reading kcsquared's comments, I looked through the input files and found stray brackets and backslashes, so I've updated my examples accordingly. The previous regex solution (see below) breaks completely for these special cases:
s = re.sub(r'\\command123\[([^}]*)\]', ' \\1', s)
s = re.sub(r'\test\[([^}]*)\]', ' \\1', s) # Fails if this substitution is executed first
s = " ".join(s.split())
Use an array to push and pop the strings onto it, as if it were a stack.
Scan the string by character and interpret it one by one, don't use regex.

sed to python replace extra delimiters in a

sed 's/\t/_tab_/3g'
I have a sed command that basically replaces all excess tab delimiters in my text document.
My documents are supposed to be 3 columns, but occasionally there's an extra delimiter. I don't have control over the files.
I use the above command to clean up the document. However all my other operations on these files are in python. Is there a way to do the above sed command in python?
sample input:
Column1 Column2 Column3
James 1,203.33 comment1
Mike -3,434.09 testing testing 123
Sarah 1,343,342.23 there here
sample output:
Column1 Column2 Column3
James 1,203.33 comment1
Mike -3,434.09 testing_tab_testing_tab_123
Sarah 1,343,342.23 there_tab_here
You may read the file line by line, split with tab, and if there are more than 3 items, join the items after the 3rd one with _tab_:
lines = []
with open('inputfile.txt', 'r') as fr:
for line in fr:
split = line.split('\t')
if len(split) > 3:
tmp = split[:2] # Slice the first two items
tmp.append("_tab_".join(split[2:])) # Append the rest joined with _tab_
lines.append("\t".join(tmp)) # Use the updated line
else:
lines.append(line) # Else, put the line as is
See the Python demo
The lines variable will contain something like
Mike -3,434.09 testing_tab_testing_tab_123
Mike -3,434.09 testing_tab_256
No operation here
import os
os.system("sed -i 's/\t/_tab_/3g' " + file_path)
Does this work? Please notice that there is a -i argument for the above sed command, which is used to modify the input file inplace.
You can mimic the sed behavior in python:
import re
pattern = re.compile(r'\t')
string = 'Mike\t3,434.09\ttesting\ttesting\t123'
replacement = '_tab_'
count = -1
spans = []
start = 2 # Starting index of matches to replace (0 based)
for match in re.finditer(pattern, string):
count += 1
if count >= start:
spans.append(match.span())
spans.reverse()
new_str = string
for sp in spans:
new_str = new_str[0:sp[0]] + replacement + new_str[sp[1]:]
And now new_str is 'Mike\t3,434.09\ttesting_tab_testing_tab_123'.
You can wrap it in a function and repeat for every line.
However, note that this GNU sed behavior isn't standard:
'NUMBER'
Only replace the NUMBERth match of the REGEXP.
interaction in 's' command Note: the POSIX standard does not
specify what should happen when you mix the 'g' and NUMBER
modifiers, and currently there is no widely agreed upon meaning
across 'sed' implementations. For GNU 'sed', the interaction is
defined to be: ignore matches before the NUMBERth, and then match
and replace all matches from the NUMBERth on.

Python : splitting string with multiple characters

I have a following input :
"auth-server $na me$ $1n ame$ [position [$pr io$]] xxxx [match-fqdn [[$fq dn$] [all]]]"
I need to store them in a list with $, <, and > serving as delimiters.
Expected output:
['auth-server', '$na me$', '$1n ame$', '[position', '[$pr io$]]', 'xxxx', '[match-fqdn', '[[$fq dn$]', '[all]]]']
How can I do this?
What you could do is split it on the spaces, then go through each substring and check if it starts with one of the special delimiters. If it does, start a new string and append subsequent strings until you get to the end delimiter. Then remove those substrings and replace them with the new one.
I think what you want is
import re
re.split(r"(?<=\]) | (?=\$|\[)", "auth-server $na me$ $1n ame$ [position [$pr io$]] xxxx [match-fqdn [[$fq dn$] [all]]]")
This yields
['auth-server', '$na me$', '$1n ame$', '[position', '[$pr io$]]', 'xxxx', '[match-fqdn', '[[$fq dn$]', '[all]]]']
Note however that this is not exactly what you described, but what matches your example. It seems that you want to split on spaces when they are preceded by ] or followed by $ or [.
try re.split and a regex who make someone cry blood
import re
print re.split(r'(\$[^\$]+\$|\[\S+([^\]]+\]\])?|[-0-9a-zA-Z]+)',"auth-server $na me$ $1n ame$ [position [$pr io$]] xxxx [match-fqdn [[$fq dn$] [all]]]")
consider using pyparsing:
from pyparsing import *
enclosed = Forward()
nestedBrackets = nestedExpr('[', ']')
enclosed << ( Combine(Group(Optional('$') + Word(alphas) + Optional('$'))) | nestedBrackets )
print enclosed.parseString(data).asList()
output:
[['auth-server', '$na', 'me$', '$1n', 'ame$', ['position', ['$pr', 'io$']], 'xxxx',
['match-fqdn', [['$fq', 'dn$'], ['all']]]]]
Not quite a full answer, but I used regexp search...
a = "auth-server $na me$ $1n ame$ [position [$pr io$]] xxxx [match-fqdn [[$fq dn$] [all]]]"
m = re.search('\$.*\$', a)
combine this with a.split() and we can do the math...

regex to remove hyphens and spaces

I've got the string:
<u>40 -04-11</u>
How do I remove the spaces and hyphens so it returns 400411?
Currently I've got this:
(<u[^>]*>)(\-\s)(<\/u>)
But I can't figure out why it isn't working. Any insight would be appreciated.
Thanks
(<u[^>]*>)(\-\s)(<\/u>)
Your pattern above doesn't tell your regex where to expect numbers.
(<u[^>]*>)(?:-|\s|(\d+))*(<\/u>)
That should get you started, but not being a python guy, I can't give you the exact replacement syntax. Just be aware that the digits are in a repeating capture group.
Edit: This is an edit in response to your comment. Like I said, not a python guy, but this will probably do what you need if you hold your tongue just right.
def repl(matchobj):
if matchobj.group(1) is None:
return ''
else:
return matchobj.group(1)
source = '<u>40 -04-11</u>40 -04-11<u>40 -04-11</u>40 -04-11'
print re.sub(r'(?:\-|\s|(\d+))(?=[^><]*?<\/u>)', repl, source)
Results in:
>>>'<u>400411</u>40 -04-11<u>400411</u>40 -04-11'
If the above offends the Python deities, I promise to sacrifice the next PHP developer I come across. :)
You don't really need a regex, you could use :
>>> '<u>40 -04-11</u>'.replace('-','').replace(' ','')
'<u>400411</u>'
Using Perl syntax:
s{
(<u[^>]*>) (.*?) (</u>)
}{
my ($start, $body, $end) = ($1, $2, $3);
$body =~ s/[-\s]//g;
$start . $body . $end
}xesg;
Or if Python doesn't have an equivalent to /e,
my $out = '';
while (
$in =~ m{
\G (.*?)
(?: (<u[^>]*>) (.*?) (</u>) | \z )
}sg
) {
my ($pre, $start, $body, $end) = ($1, $2, $3, $4);
$out .= $pre;
if (defined($start)) {
$body =~ s/[-\s]//g;
$out .= $start . $body . $end;
}
}
I'm admittedly not very good at regexes, but the way I would do this is by:
Doing a match on a <u>...</u> pair
doing a re.sub on the bit between the match using group().
That looks like this:
example_str = "<u> 76-6-76s</u> 34243vvfv"
tmp = re.search("(<u[^>]*>)(.*?)(<\/u>)",example_str).group(2)
clean_str = re.sub("(\D)","",tmp)
>>>'76676'
You should expose correctly your problem. I firstly didn't exactly understand it.
Having read your comment (only between the tags <u> and </u> tags) , I can now propose:
import re
ss = '87- 453- kol<u>40 -04-11</u> maa78-55 98 12'
print re.sub('(?<=<u>).+?(?=</u>)',
lambda mat: ''.join(c for c in mat.group() if c not in ' -'),
ss)
result
87- 453- kol<u>400411</u> maa78-55 98 12

Python: Regex question / CSV parsing / Psycopg nested arrays

I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as value. Psycopg only parses the outer array of such values.
My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable.
My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).
Currently, this is my code:
import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
result = result.groups()
The result of this should be:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']
Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.
Am I using a wrong approach? Or can anyone point me in the right direction?
Thanks in advance!
Python's native lib should do a good work. Have you tried it already?
http://docs.python.org/library/csv.html
From your sample, it looks something like ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$ to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.
If you can do ASSERTIONS, this will get you on the right track.
This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).
The two main parsing regex's are these:
strips delimeter quote too and only $2 contains data, use in a while loop, global context
/(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/
my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
/(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/
This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)
use strict; use warnings;
my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';
my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;
my $rxExpanded = qr/
(?!}$) # ASSERT ahead: NOT a } plus end
(?:^{?|,) # Boundry: Start of string plus { OR comma
\s* # 0 or more whitespace
( ".*?" | .*?) # Capture "Quoted" or non quoted data
\s* # 0 or more whitespace
(?=,|}$) # Boundry ASSERT ahead: Comma OR } plus end
/x;
my ($newstring, $sucess) = ('[', 0);
for my $field ($str =~ /$rx/g)
{
my $tmp = $field;
$sucess = 1;
if ( $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
$tmp = "'$tmp'";
}
$newstring .= "$tmp,";
}
if ( $sucess ) {
$newstring =~ s/,$//;
$newstring .= ']';
print $newstring,"\n";
}
else {
print "Invalid string!\n";
}
Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']
It seemed that the CSV approach was the easiest to implement:
def parsePsycopgSQLArray(input):
import csv
import cStringIO
input = input.strip("{")
input = input.strip("}")
buffer = cStringIO.StringIO(input)
reader = csv.reader(buffer, delimiter=',', quotechar='"')
return reader.next() #There can only be one row
if __name__ == "__main__":
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
result = parsePsycopgSQLArray(text)
print result
Thanks for the responses, they were most helpfull!
Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well:
def restore_str_array(val):
"""
Converts a postgres formatted string array (as a string) to python
:param val: postgres string array
:return: python array with values as strings
"""
val = val.strip("{}")
if not val:
return []
reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
return reader.next()

Categories