i am trying to remove non-printable characters from some string variables i have as i am reading in a text file. if i use the below re.sub method it won't work the \x.. chars are not removed
test1 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'
test2 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', test1)
but, if i take the value from test1 and place it in the re.sub as a "raw" string then it works perfectly
test2 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', r'ing record \xac\xd0\x81\xb4\x02\n2018 Apr')
test2 has 'ing record \n2018 Apr'
i was hoping to easily convert test1 in the first example into a raw string but i'm my searching this doesn't seem easy or possible. looking for a solution that allows me to use re.sub and remove these chars from a str variable , or if there is a way to convert my str variable into a raw string first?
UPDATE FIX:
i ended up having to do a lot of conversions to remove the unwanted hex codes but keep my newlines. this works not sure if there is a cleaner method out there.
test33 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'
test44 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', test33.encode('unicode-escape').decode("utf-8"))
test66 = test44.encode().decode('unicode-escape')
print(test66)
ing record
2018 Apr
If your string is purely ASCII you could try:
import re
import string
test33 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'
print re.sub(r'[^{0}\n]'.format(string.printable), '', test33)
or the unicode solution provided in:Stripping non printable characters from a string in python
Related
I'm trying to replace special characters in a data frame with unaccented or different ones.
I can replace one with
df['col_name'] = df.col_name.str.replace('?','j')
this turned the '?' to 'j' - but - I can't seem to figure out how to change more than one..
I have a list of special characters that I want to change. I've tried using a dictionary but it doesn't seem to work
the_reps = {'?','j'}
df1 = df.replace(the_reps, regex = True)
this gave me the error nothing to replace at position 0
EDIT:
this is what worked - although it is probably not that pretty:
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')
df[col]=df.col.str.replace('old char','new char')...
for each one ..
import re
s=re.sub("[_list of special characters_]","",_your string goes here_)
print(s)
An example for this..
str="Hello$#& Python3$"
import re
s=re.sub("[$#&]","",str)
print (s)
#Output:Hello Python3
Explanation goes here..
s=re.sub("[$#&]","",s)
Pattern to be replaced → “[$#&]”
[] used to indicate a set of characters
[$#&] → will match either $ or # or &
The replacement string is given as an empty string
If these characters are found in the string, they’ll be replaced with an empty string
you can use Series.replace with a dictionary
#d = { 'actual character ':'replacement ',...}
df.columns = df.columns.to_series().replace(d, regex=True)
Try This:
import re
my_str = "hello Fayzan-Bhatti Ho~!w"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string
Output: hello FayzanBhatti How
I have a bunch of regular expression I am using to scrape lot of specific fields from a text document. Those all work fine when used directly inside the python script.
But I thought of putting them in a YAML file and reading from there. Here's how it looks:
# Document file for Regular expression patterns for a company invoice
---
issuer: ABCCorp
fields:
invoice_number: INVOICE\s*(\S+)
invoice_date: INVOICE DATE\s*(\S+)
cusotmer_id: CUSTOMER ID\s*(\S+)
origin: ORIGIN\s*(.*)ETD
destination: DESTINATION\s*(.*)ETA
sub_total: SUBTOTAL\s*(\S+)
add_gst: SUBTOTAL\s*(\S+)
total_cost: TOTAL USD\s*(\S+)
description_breakdown: (?s)(DESCRIPTION\s*GST IN USD\s*.+?TOTAL CHARGES)
package_details_fields: (?s)(WEIGHT\s*VOLUME\s*.+?FLIGHT|ROAD REFERENCE)
mawb_hawb: (?s)((FLIGHT|ROAD REFERENCE).*(MAWB|MASTER BILL)\s*.+?GOODS COLLECTED FROM)
When I retrieve it using pyyml in python, it is adding a string quote around that (which is ok as I can add r'' later) but I see it is also adding extra \ in between the regex. That would make the regex go wrong when used in code now
import yaml
with open(os.path.join(TEMPLATES_DIR,"regex_template.yml")) as f:
my_dict = yaml.safe_load(f)
print(my_dict)
{'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)', 'cusotmer_id': 'CUSTOMER ID\\s*(\\S+)', 'origin': 'ORIGIN\\s*(.*)ETD', 'destination': 'DESTINATION\\s*(.*)ETA', 'sub_total': 'SUBTOTAL\\s*(\\S+)', 'add_gst': 'SUBTOTAL\\s*(\\S+)', 'total_cost': 'TOTAL USD\\s*(\\S+)', 'description_breakdown': '(?s)(DESCRIPTION\\s*GST IN USD\\s*.+?TOTAL CHARGES)', 'package_details_fields': '(?s)(WEIGHT\\s*VOLUME\\s*.+?FLIGHT|ROAD REFERENCE)', 'mawb_hawb'
How to read the right regex as I have it in yaml file? Does any string written in yaml file gets a quotation mark around that when read in python because that is a string?
EDIT:
The main regex in yaml file is:
INVOICE\s*(\S+)
Output in dict is:
'INVOICE\\s*(\\S+)'
This is too long to do as a comment.
The backslash character is used to escape special characters. For example:
'\n': newline
'\a': alarm
When you use it before a letter that has no special meaning it is just taken to be a backslash character:
'\s': backslash followed by 's'
But to be sure, whenever you want to enter a backslash character in a string and not have it interpreted as the start of an escape sequence, you double it up:
'\\s': also a backslash followed by a 's'
'\\a': a backslash followed by a 'a'
If you use a r'' type literal, then a backslash is never interpreted as the start of an escape sequence:
r'\a': a backslash followed by 'a' (not an alarm character)
r'\n': a backslash followed by n (not a newline -- however when used in a regex. it will match a newline)
Now here is the punchline:
When you print out these Python objects, such as:
d = {'x': 'ab\sd'}
print(d)
Python will print the string representation of the dictionary and the string will print:
'ab\\sd'. If you just did:
print('ab\sd')
You would see ab\sd. Quite a difference.
Why the difference. See if this makes sense:
d = {'x': 'ab\ncd'}
print(d)
print('ab\ncd')
Results:
d = {'x': 'ab\ncd'}
ab
cd
The bottom line is that when you print a Python object other than a string, it prints a representation of the object showing how you would have created it. And if the object contains a string and that string contains a backslash, you would have doubled up on that backslash when entering it.
Update
To process your my_dict: Since you did not provide the complete value of my_dict, I can only use a truncated version for demo purposes. But this will demonstrate that my_dict has perfectly good regular expressions:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
fields = my_dict['fields']
invoice_number_re = fields['invoice_number']
m = re.search(invoice_number_re, 'blah-blah INVOICE 12345 blah-blah')
print(m[1])
Prints:
12345
If you are going to be using the same regular expressions over and over again, then it is best to compile them:
import re
my_dict = {'issuer': 'ABCCorp', 'fields': {'invoice_number': 'INVOICE\\s*(\\S+)', 'invoice_date': 'INVOICE DATE\\s*(\\S+)'}}
#compile the strings to regular expressions
fields = my_dict['fields']
for k, v in fields.items():
fields[k] = re.compile(v)
invoice_number_re = fields['invoice_number']
m = invoice_number_re.search('blah-blah INVOICE 12345 blah-blah')
print(m[1])
I would like to use regular expressions on bytestrings in python of which I know the encoding (utf-8). I am facing difficulties trying to use character classes that involve characters that are encoded using more than one bit block. They appear to become two or more 'characters' that are matched separately in the character class.
Performing the search on (unicode) strings instead is possible, but I would like to know if there is a solution to defining character classes for the case of bytestrings as well. Maybe it's just not possible!?
Below is a python 3 example that shows what happens when I try to replace different line breaks with '\n':
import re
def show_pattern(pattern):
print(f"\nPattern repr:\t{repr(pattern)}")
def test_sub(pattern, replacement, text):
print(f"Before repr:\t{repr(text)}")
result = re.sub(pattern, replacement, text)
print(f"After repr:\t{repr(result)}")
# Pattern for line breaks
PATTERN = '[' + "\u000A\u000B\u000C\u000D\u0085\u2028\u2029" + ']'
REPLACEMENT = '\n'
TEXT = "How should I replace my unicode string\u2028using utf-8-encoded bytes?"
show_pattern(PATTERN)
test_sub(PATTERN, REPLACEMENT, TEXT)
# expected output:
# Pattern repr: '[\n\x0b\x0c\r\x85\u2028\u2029]'
# Before repr: 'How should I replace my unicode string\u2028using utf-8-encoded bytes?'
# After repr: 'How should I replace my unicode string\nusing utf-8-encoded bytes?'
ENCODED_PATTERN = PATTERN.encode('utf-8')
ENCODED_REPLACEMENT = REPLACEMENT.encode('utf-8')
ENCODED_TEXT = TEXT.encode('utf-8')
show_pattern(ENCODED_PATTERN)
test_sub(ENCODED_PATTERN, ENCODED_REPLACEMENT, ENCODED_TEXT)
# expected output:
# Pattern repr: b'[\n\x0b\x0c\r\xc2\x85\xe2\x80\xa8\xe2\x80\xa9]'
# Before repr: b'How should I replace my unicode string\xe2\x80\xa8using utf-8-encoded bytes?'
# After repr: b'How should I replace my unicode string\n\n\nusing utf-8-encoded bytes?'
In the encoded version, I end up with three '\n''s instead of one. Similar things happen for a more complicated document where it's not obvious what the correct output should be.
You may use an alternation based pattern rather than a character class, as you will want to match sequences of bytes:
PATTERN = "|".join(['\u000A','\u000B','\u000C','\u000D','\u0085','\u2028','\u2029'])
See the online demo.
If you prefer to initialize the pattern from a string use
CHARS = "\u000A\u000B\u000C\u000D\u0085\u2028\u2029"
PATTERN = "|".join(CHARS)
I have a number of strings from which I am aiming to remove charactars using replace. However, this dosent seem to wake. To give a simplified example, this code:
row = "b'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38'"
row = row.replace("b'", "").replace("'", "").replace('b"', '').replace('"', '')
print(row.encode('ascii', errors='ignore'))
still ouputs this b'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38' wheras I would like it to output James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38. How can I do this?
Edit: Updataed the code with a better example.
You seem to be mistaking single quotes for double quotes. Simple replace 'b:
>>> row = "xyz'b"
>>> row.replace("'b", "")
'xyz'
As an alternative to str.replace, you can simple slice the string to remove the unwanted leading and trailing characters:
>>> row[2:-1]
'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38'
In your first .replace, change b' to 'b. Hence your code should be:
>>> row = "xyz'b"
>>> row = row.replace("'b", "").replace("'", "").replace('b"', '').replace('"', '')
# ^ changed here
>>> print(row.encode('ascii', errors='ignore'))
xyz
I am assuming rest of the conditions you have are the part of other task/matches that you didn't mentioned here.
If all you want is to take the string before first ', then you may just do:
row.split("'")[0]
You haven't listed this to remove 'b:
.replace("'b", '')
import ast
row = "b'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38'"
b_string = ast.literal_eval(row)
print(b_string)
u_string = b_string.decode('utf-8')
print(u_string)
out:
b_string:b'James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38'
u_string: James Bray,/citations?user=8IqSrdIAAAAJ&hl=en&oe=ASCII,1985,6020,188.12,42,1.31,76,2.38
The real question is how to convert a string to python object.
You get a string which contains an a binary string, to convert it to python's binary string object, you should use eval(). ast.literal_eval() is more safe way to do it.
Now you get a binary string, you can convert it to unicode string which do not start with "b" by using decode()
i've the following code:
import re
key = re.escape('#one #two #some #tests #are #done')
print(key)
key = key.split()
print(key)
and the following output:
\#one\ \#two\ \#some\ \#tests\ \#are\ \#done
['\\#one\\', '\\#two\\', '\\#some\\', '\\#tests\\', '\\#are\\', '\\#done']
How come the backslashes are duplicated? I just want them once in my list, because i would like to use this list in a regular expression.
Thanks in advance! John
There is only one backslash each, but when printing the repr of the strings, they are duplicated (escaped) - just as you would need to duplicate them when using a string to build a regex. So everything is fine.
For example:
>>> len("\\")
1
>>> len("\\n")
2
>>> len("\n")
1
>>> print "\\n"
\n
>>> print "\n"
>>>
The \ character is an escape character, that is a character that changes the meaning of the subsequent character[s]. For example the "n" character is simply an "n". But if you escape it like "\n" it becomes the "newline" character. So, if you need to use a \ literal, you need to escape it with... itself: \\
The backslashes are not duplicated. To realize this, try to do:
for element in key:
print element
And you will see this output:
\#one\
\#two\
\#some\
\#tests\
\#are\
\#done
When you have printed whole list, the python used representation where strings are printed not as they are, but they are printed as python expression (notice the quotes "", they are not in the strings)
To actually encode string containing backslash, you need to duplicate that backslash. That is it.
When you convert a list to a string (e.g. to print it), it calls repr on each object contained in the list. That's why you get the quotes and extra backslashes in your second line of output. Try this:
s = "\\a string with an escaped backslash"
print s # prints: \a string with an escaped backslash
print repr(s) # prints: '\\a string with an escaped backslash'
The repr call puts quotes around the string, and shows the backslash escapes.