Deleting Non-ASCII *lines* from a file? - python

Is there a way I can remove non-ascii lines (not characters) from a file? So given something like this:
Line 1 (full ASCII character set)
Line 2 (contains unicode characters)
Line 3 (full ASCII)
Line 4 (contains unicode characters)
I want:
Line 1
Line 3
I know I can use iconv to remove ASCII characters but I want to delete any line that contains non-ascii lines. Is there a utility/pythonic way to do this?

If you want to eliminate lines that contain any non-ascii characters:
def ascii_lines(iterable):
for line in iterable:
if all(ord(ch) < 128 for ch in line):
yield line
f = open('somefile.txt')
for line in ascii_lines(f):
print line

Given string like the next:
>>> s = "asd\n\xaa\xfa\xaf\nqwe"
>>> print s
asd
╙З╞
qwe
You may simply filter it by your criteria:
>>> s = filter(lambda x: ord(x) < 128, s)
>>> s
'asd\n\nqwe'
>>> print s
asd
qwe
Also you may achieve the same result with converting to unicode:
>>> str(s.decode('ascii', 'ignore'))
'asd\n\nqwe'
To remove empty lines I'd use re.sub('\n+', '\n', s).

In practice you'll want to do something with the data, and need to parse it further. If your file test looks like
http://example.com dog
http://example.com/å%20ä%20ö/ foo
http://google.com bar
A pyparsing script would remove the bad lines like so
from pyparsing import *
ParserElement.setDefaultWhitespaceChars(" \t")
EOL = LineEnd()
ascii = u''.join(unichr(x) for x in xrange(33,127))
words = Word(ascii)
good_line = Group(ZeroOrMore(words) + EOL)
bad_line = SkipTo(EOL,include=True)
blocks = good_line | bad_line.suppress()
grammar = ZeroOrMore(blocks) + StringEnd()
P = grammar.parseFile("test")
print P
Which would give as output:
[['http://example.com', 'dog', '\n'], ['http://google.com', 'bar']]
The advantage to the other methods (which work fine, and answer the question), as that you now have a nice parse tree to further manipulate the data. The idea is to write a grammar, not a parser, for any task that has the potential to become more complicated then when first started.

for line in fin:
try:
fout.write(line.encode('ASCII'))
except UnicodeDecodeError:
pass

LC_ALL=C grep -v $'[^\t\r -~]'
grep -v prints all lines that don't match the pattern. LC_ALL=C sets the locale to "C". $'[^\t\r -~]' is a pattern that, in the C locale, means "contains a character that is not a horizontal tab, a line-feed, a space, or an ASCII glyphic character". ($'...' is a Bash notation: it's equivalent to '...', except that it processes backslash-escapes like \t and \r. [^...] is a "negative character class", meaning "any character that isn't listed in .... Inside a character class, - matches a range: in this case, the range from space to tilde. The C locale is necessary to make sense of this "range".)

Related

Split string if separator is not in-between two characters

I want to write a script that reads from a csv file and splits each line by comma except any commas in-between two specific characters.
In the below code snippet I would like to split line by commas except the commas in-between two $s.
line = "$abc,def$,$ghi$,$jkl,mno$"
output = line.split(',')
for o in output:
print(o)
How do I write output = line.split(',') so that I get the following terminal output?
~$ python script.py
$abc,def$
$ghi$
$jkl,mno$
You can do this with a regular expression:
In re, the (?<!\$) will match a character not immediately following a $.
Similarly, a (?!\$) will match a character not immediately before a dollar.
The | character cam match multiple options. So to match a character where either side is not a $ you can use:
expression = r"(?<!\$),|,(?!\$)"
Full program:
import re
expression = r"(?<!\$),|,(?!\$)"
print(re.split(expression, "$abc,def$,$ghi$,$jkl,mno$"))
One solution (maybe not the most elegant but it will work) is to replace the string $,$ with something like $,,$ and then split ,,. So something like this
output = line.replace('$,$','$,,$').split(',,')
Using regex like mousetail suggested is the more elegant and robust solution but requires knowing regex (not that anyone KNOWS regex)
Try regular expressions:
import re
line = "$abc,def$,$ghi$,$jkl,mno$"
output = re.findall(r"\$(.*?)\$", line)
for o in output:
print('$'+o+'$')
$abc,def$
$ghi$
$jkl,mno$
First, you can identify a character that is not used in that line:
c = chr(max(map(ord, line)) + 1)
Then, you can proceed as follows:
line.replace('$,$', f'${c}$').split(c)
Here is your example:
>>> line = '$abc,def$,$ghi$,$jkl,mno$'
>>> c = chr(max(map(ord, line)) + 1)
>>> result = line.replace('$,$', f'${c}$').split(c)
>>> print(*result, sep='\n')
$abc,def$
$ghi$
$jkl,mno$

Python prevent decoding HEX to ASCII while removing backslashes from my Var

I want to strip some unwanted symbols from my variable. In this case the symbols are backslashes. I am using a HEX number, and as an example I will show some short simple code down bellow. But I don't want python to convert my HEX to ASCII, how would I prevent this from happening.? I have some long shell codes for asm to work with later which are really long and removing \ by hand is a long process. I know there are different ways like using echo -e "x\x\x\x" > output etc, but my whole script will be written in python.
Thanks
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> b = a.strip("\\")
>>> print b
1�Phtv
>>> a = "\x31\x32\x33\x34\x35\x36"
>>> b = a.strip("\\")
>>> print b
123456
At the end I would like it to print my var:
>>> print b
x31x32x33x34x35x36
There are no backslashes in your variable:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(a)
1ÀPhtv
Take newline for example: writing "\n" in Python will give you string with one character -- newline -- and no backslashes. See string literals docs for full syntax of these.
Now, if you really want to write string with such backslashes, you can do it with r modifier:
>>> a = r"\x31\xC0\x50\x68\x74\x76"
>>> print(a)
\x31\xC0\x50\x68\x74\x76
>>> print(a.replace('\\', ''))
x31xC0x50x68x74x76
But if you want to convert a regular string to hex-coded symbols, you can do it character by character, converting it to number ("\x31" == "1" --> 49), then to hex ("0x31"), and finally stripping the first character:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(''.join([hex(ord(x))[1:] for x in a]))
'x31xc0x50x68x74x76'
There are two problems in your Code.
First the simple one:
strip() just removes one occurrence. So you should use replace("\\", ""). This will replace every backslash with "", which is the same as removing it.
The second problem is pythons behavior with backslashes:
To get your example working you need to append an 'r' in front of your string to indicate, that it is a raw string. a = r"\x31\xC0\x50\x68\x74\x76". In raw strings, a backlash doesn't escape a character but just stay a backslash.
>>> r"\x31\xC0\x50\x68\x74\x76"
'\\x31\\xC0\\x50\\x68\\x74\\x76'

How do I efficiently replace the last line in a string?

I have a multi-line string, and I would like to replace the last line of the string with a different line. How do I most efficiently do this?
Split on the last linebreak and attach a new line:
new = old.rstrip('\n').rsplit('\n', 1)[0] + '\nNew line to be added with line break included.'
This first removes any trailing newline after the last line, splits once on the last newline in the text, takes everything before that last newline, and concatenates the result with a new newline and text.
Demo:
>>> old = '''The quick
... brown fox jumps
... over the lazy
... dog
... '''
>>> old.rstrip('\n').rsplit('\n', 1)[0] + '\nhorse and rider'
'The quick\nbrown fox jumps\nover the lazy\nhorse and rider'
This presumes that your lines are separated by \n characters; reading text files in text mode gives you such data on any platform.
If you are dealing with data with different line endings, adjust accordingly. In such cases os.linesep can come in useful.
I would suggest this approach:
>>> x = """
... test1
... test2
... test3"""
>>> print "\n".join(x.splitlines()[:-1]+["something else"])
test1
test2
something else
>>>
You can accomplish this using a simple regular expressions.
import re
new_string = re.sub(r'[^\n]*\n?$', "replacement", "existing\nstring")
It matches everything with the exception of the \n at the end of the string and replaces it with the replacement string.

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

Removing control characters from a string in python

I currently have the following code
def removeControlCharacters(line):
i = 0
for c in line:
if (c < chr(32)):
line = line[:i - 1] + line[i+1:]
i += 1
return line
This is just does not work if there are more than one character to be deleted.
There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".
This snippet removes all control characters from a string.
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
Examples of unicode categories:
>>> from unicodedata import category
>>> category('\r') # carriage return --> Cc : control character
'Cc'
>>> category('\0') # null character ---> Cc : control character
'Cc'
>>> category('\t') # tab --------------> Cc : control character
'Cc'
>>> category(' ') # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A') # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',') # comma -----------> Po : punctuation
'Po'
>>>
You could use str.translate with the appropriate map, for example like this:
>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'
Anyone interested in a regex character class that matches any Unicode control character may use [\x00-\x1f\x7f-\x9f].
You may test it like this:
>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True
So to remove the control characters using re just use the following:
>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'
This is the easiest, most complete, and most robust way I am aware of. It does require an external dependency, however. I consider it to be worth it for most projects.
pip install regex
import regex as rx
def remove_control_characters(str):
return rx.sub(r'\p{C}', '', 'my-string')
\p{C} is the unicode character property for control characters, so you can leave it up to the unicode consortium which ones of the millions of unicode characters available should be considered control. There are also other extremely useful character properties I frequently use, for example \p{Z} for any kind of whitespace.
Your implementation is wrong because the value of i is incorrect. However that's not the only problem: it also repeatedly uses slow string operations, meaning that it runs in O(n2) instead of O(n). Try this instead:
return ''.join(c for c in line if ord(c) >= 32)
And for Python 2, with the builtin translate:
import string
all_bytes = string.maketrans('', '') # String of 256 characters with (byte) value 0 to 255
line.translate(all_bytes, all_bytes[:32]) # All bytes < 32 are deleted (the second argument lists the bytes to delete)
You modify the line during iterating over it. Something like ''.join([x for x in line if ord(x) >= 32])
filter(string.printable[:-5].__contains__,line)
I've tried all the above and it didn't help. In my case, I had to remove Unicode 'LRM' chars:
Finally I found this solution that did the job:
df["AMOUNT"] = df["AMOUNT"].str.encode("ascii", "ignore")
df["AMOUNT"] = df["AMOUNT"].str.decode('UTF-8')
Reference here.

Categories