Remove Weird Characters using python - python

I have this large SQL file with about 1 milllion inserts in it, some of the inserts are corrupted (about 6000) with weird characters that i need to remove so i can insert them into my DB.
Ex:
INSERT INTO BX-Books VALUES ('2268032019','Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');
i want to remove only the weird characters and leave all of the normal ones
I tried using the following code to do so:
import fileinput
import string
fileOld = open('text1.txt', 'r+')
file = open("newfile.txt", "w")
for line in fileOld: #in fileinput.input(['C:\Users\Vashista\Desktop\BX-SQL-Dump\test1.txt']):
print(line)
s = line
printable = set(string.printable)
filter(lambda x: x in printable, s)
print(s)
file.write(s)
but it doesnt seem to be working, when i print s it is the same as what is printed during line and whats stranger is that nothing gets written to the file.
Any advice or tips on how to solve this would be useful

import string
strg = "'2268032019', Petite histoire de la d�©sinformation','Vladimir Volkoff',1999,'Editions du Rocher','http://images.amazon.com/images/P/2268032019.01.THUMBZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.MZZZZZZZ.jpg','http://images.amazon.com/images/P/2268032019.01.LZZZZZZZ.jpg');"
newstrg = ""
acc = """ '",{}[].`;: """
for x in strg:
if x in string.ascii_letters or x in string.digits or x in acc:
newstrg += x
print (newstrg)
Output;
'2268032019', Petite histoire de la dsinformation','Vladimir Volkoff',1999,'Editions du Rocher','http:images.amazon.comimagesP2268032019.01.THUMBZZZ.jpg','http:images.amazon.comimagesP2268032019.01.MZZZZZZZ.jpg','http:images.amazon.comimagesP2268032019.01.LZZZZZZZ.jpg';
>>>
You can check if the element of the string is in ASCII letters and then create a new string without non-ASCII letters.
Also it depends on your variable type. If you work with lists, you don't have to define a new variable. Just del mylist[x] will work.

You can use regular expressions sub() to do simple string replacements.
https://docs.python.org/2/library/re.html#re.sub
# -*- coding: utf-8 -*-
import re
dirty_string = u'©sinformation'
# in first param, put a regex to screen for, in this case I negated the desired characters.
clean_string = re.sub(r'[^a-zA-Z0-9./]', r'', dirty_string)
print clean_string
# Outputs
>>> sinformation

Related

String from file to string utf-8 in python

So I am reading and manipulate a file with :
base_file = open(path+'/'+base_name, "r")
lines = base_file.readlines()
After this I search and find the "raw_data" start of line.
if re.match("\s{0,100}raw_data: ",line):
split_line = line.split("raw_data:")
print(split_line)
raw_string = split_line[1]
One example of raw_data is:
raw_data: "&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\}\277\210\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
And raw_string will be
print(raw_data)
"&\276!\300\307
=\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\}\277\210\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
If I tried to read this file I will obtain one char to one char even for escape characters.
So, my question is how to transform this plain text to utf-8 string so that I can have one character when reading \300 and not 4 characters.
I tried to pass "encondig =utf-8" in open file method but does not work.
I have made the same example passing raw_data as variable and it works properly.
RAW_DATA = "&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\\}\277\210\\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300<I>>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
print(f"Qnt -> {len(RAW_DATA)}") # Qnt -> 256
print(type(RAW_DATA))
at = 0
total = 0
while at < len(RAW_DATA):
fin = at+4
substrin = RAW_DATA[at:fin]
resu = FourString_float(substrin)
at = fin
For this example \300 is only one char.
Hope someone can help me.
The problem is that on the read file the escape \ symbols are coming in as \, but in the example you've provided they are being evaluated as part of the numerics that follow it. ie, \276 is read as a single character.
If you run:
RAW_DATA = r"&\276!\300\307 =\277\"O\271\277vH9?j?\345?#\243\264=\350\034\345\277\260\345\033\300\023\017(#z|\273\277L\\}\277\210\\\031\300\213\263z\277\302\241\033\300\000\207\323\277\247Oh>j\354\215#\364\305\201\276\361+\202#t:\304\277\344\231\243#\225k\002\300vw\262\277\362\220j\300\"(\337\276\354b8\300\230\347H\300\201\320\204\300S;N\300Z0G\300<I>>j\210\000#\034\014\220#\231\330J#\223\025\236#\006\332\230\276\227\273\n\277\353#,#\202\205\215\277\340\356\022\300/\223\035\277\331\277\362\276a\350\013#)\353\276\277v6\316\277K\326\207\300`2)\300\004\014Q\300\340\267\271\300MV\305\300\327\010\207\300j\346o\300\377\260\216\300[\332g\300\336\266\003\300\320S\272?6\300Y#\356\250\034\300\367\277&\300\335Uq>o\010&\300r\277\252\300U\314\243\300\253d\377\300"
print(f"Qnt -> {len(RAW_DATA)}") # Qnt -> 256
print(type(RAW_DATA))
at = 0
total = 0
while at < len(RAW_DATA):
fin = at+4
substrin = RAW_DATA[at:fin]
resu = FourString_float(substrin)
at = fin
You would should be getting the same error that you were getting originally. Notice that we are using the raw-string literal instead of regular string literal. This will ensure that the \ don't get escaped.
You would need to evaluate the RAW_DATA to force it to evaluate the \.
You can do something like RAW_DATA = eval(f'"{RAW_DATA}"') or
import ast
RAW_DATA = ast.literal_eval(f'"{RAW_DATA}"')
Note, the second option is a bit more secure that doing a straight eval as you are limiting the scope of what can be executed.

Find, decode and replace all base64 values in text file

I have a SQL dump file that contains text with html links like:
<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>
I'd like to find, decode and replace the base64 part of the text in each of these links.
I've been trying to use Python w/ regular expressions and base64 to do the job. However, my regex skills are not up to the task.
I need to select any string that starts with
'getattachement.php?data='
and ends with
'"'
I then need to decode the part between 'data=' and '&quot' using base64.b64decode()
results should look something like:
<a href="http://blahblah.org/kb/4/Topcon_data-download_howto.pdf">attached file</a>
I think the solution will look something like:
import re
import base64
with open('phpkb_articles.sql') as f:
for line in f:
re.sub(some_regex_expression_here, some_function_here_to_decode_base64)
Any ideas?
EDIT: Answer for anyone who's interested.
import re
import base64
import sys
def decode_base64(s):
"""
Method to decode base64 into ascii
"""
# fix escaped equal signs in some base64 strings
base64_string = re.sub('%3D', '=', s.group(1))
decodedString = base64.b64decode(base64_string)
# substitute '|' for '/'
decodedString = re.sub('\|', '/', decodedString)
# escape the spaces in file names
decodedString = re.sub(' ', '%20', decodedString)
# print 'assets/' + decodedString + '&quot' # Print for debug
return 'assets/' + decodedString + '&quot'
count = 0
pattern = r'getattachment.php\?data=([^&]+?)&quot'
# Open the file and read line by line
with open('phpkb_articles.sql') as f:
for line in f:
try:
# globally substitute in new file path
edited_line = re.sub(pattern, decode_base64, line)
# output the edited line to standard out
sys.stdout.write(edited_line)
except TypeError:
# output unedited line if decoding fails to prevent corruption
sys.stdout.write(line)
# print line
count += 1
you already have it, you just need the small pieces:
pattern: r'data=([^&]+?)&quot' will match anything after data= and before &quot
>>> pat = r'data=([^&]+?)&quot'
>>> line = '<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>'
>>> decodeString = re.search(pat,line).group(1) #because the b64 string is capture by grouping, we only want group(1)
>>> decodeString
'NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY='
you can then use str.replace() method as well as base64.b64decode() method to finish the rest. I dont want to just write your code for you but this should give you a good idea of where to go.

prevent new line character from being read in literally to python script

I have a string that I want to pass to a python script, e.g.
$printf "tas\nty\n"
yields
tas
ty
however when I pipe (e.g. printf "tas\nty\n" | ./pumpkin.py) where pumpkin.py is :
#!/usr/bin/python
import sys
data = sys.stdin.readlines()
print data
I get the output
['tas\n', 'ty\n']
How do I prevent the newline character from being read by python?
You can strip all white spaces (at the beginning and in the end) using strip :
data = [s.strip() for s in sys.stdin.readlines()]
If you need to strip only \n in the end you can do:
data = [s.rstrip('\n') for s in sys.stdin.readlines()]
Or use splitlines method:
data = sys.stdin.read().splitlines()
http://www.tutorialspoint.com/python/string_splitlines.htm

Converting a utf16 csv to array

I've tried to convert a CSV file coded in UTF-16 (exported by another program) to a simple array in Python 2.7 with very few luck.
Here's the nearest solution I've found:
from io import BytesIO
with open ('c:\\pfm\\bdh.txt','rb') as f:
x = f.read().decode('UTF-16').encode('UTF-8')
for line in csv.reader(BytesIO(x)):
print line
This code returns:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
['1\tnom1\tetq1\text1 ...
What I'm trying to get it's something like this:
[['','Nombre','Etiqueta','Extensión de archivo','Tamaño lógico','Categoría']
['1','nom1','etq1','ext1','123','cat1']
['2','nom2','etq2','ext2','456','cat2']]
So, I'd need to convert those hexadecimal chars to latin typos (as: á,é,í,ó,ú, or ñ), and those tab-separated strings into arrays fields.
Do I really need to use dictionaries for the first part? I think there should be an easier solution, as I can see and write all these characater by keyboard.
For the second part, I think the CSV library won't help in this case, as I read it can't manage UTF-16 yet.
Could you give me a hand? Thank you!
ITEM #1: The hexadecimal characters
You are getting the:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
output because you are printing a list. The behaviour of the list is to print the representation of each item. That is, it is the equivalent of:
print('[{0}]'.format(','.join[repr(item) for item in lst]))
If you use print(line[0]) you will get the output of the line.
ITEM #2: The output
The problem here is that the csv parser is not parsing the content as a tab-separated file, but a comma-separated file. You can fix this by using:
for line in csv.reader(BytesIO(s), delimiter='\t'):
print(line)
instead.
This will give you the desired result.
Processing a UTF-16 file with the csv module in Python 2 can indeed be a pain. Re-encoding to UTF-8 works, but you then still need to decode the resulting columns to produce unicode values.
Note also that your data appears to be tab delimited; the csv.reader() by default uses commas, not tabs, to separate columns. You'll need to configure it to use tabs instead by setting delimiter='\t' when constructing the reader.
Use io.open() to read UTF-16 and produce unicode lines. You can then use codecs.iterencode() to translate the decoded unicode values from the UTF-16 file to UTF-8.
To decode the rows back to unicode values, you could use an extra generator to do so as you iterate:
import csv
import codecs
import io
def row_decode(reader, encoding='utf8'):
for row in reader:
yield [col.decode('utf8') for col in row]
with io.open('c:\\pfm\\bdh.txt', encoding='utf16') as f:
wrapped = codecs.iterencode(f, 'utf8')
reader = csv.reader(wrapped, delimiter='\t')
for row in row_decode(reader):
print row
Each line will still use repr() on each contained value, which means that you'll see Python string literal syntax to represent strings. Any non-printable or non-ASCII codepoint will be represented by an escape code:
>>> [u'', u'Nombre', u'Etiqueta', u'Extensión de archivo', u'Tamaño lógico', u'Categoría']
[u'', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
This is normal; the output is meant to be useful as a debugging aid and can be pasted back into any Python session to reproduce the original value, without worrying about terminal encodings.
For example, ó is represented as \xf3, representing the Unicode codepoint U+00F3 LATIN SMALL LETTER O WITH ACUTE. If you were to print this one column, Python will encode the Unicode string to bytes matching your terminal encoding, resulting in your terminal showing you the correct string again:
>>> u'Extensi\xf3n de archivo'
u'Extensi\xf3n de archivo'
>>> print u'Extensi\xf3n de archivo'
Extensión de archivo
Demo:
>>> import csv, codecs, io
>>> io.open('/tmp/demo.csv', 'w', encoding='utf16').write(u'''\
... \tNombre\tEtiqueta\tExtensi\xf3n de archivo\tTama\xf1o l\xf3gico\tCategor\xeda
... ''')
63L
>>> def row_decode(reader, encoding='utf8'):
... for row in reader:
... yield [col.decode('utf8') for col in row]
...
>>> with io.open('/tmp/demo.csv', encoding='utf16') as f:
... wrapped = codecs.iterencode(f, 'utf8')
... reader = csv.reader(wrapped, delimiter='\t')
... for row in row_decode(reader):
... print row
...
[u' ', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
>>> # the row is displayed using repr() for each column; the values are correct:
...
>>> print row[3], row[4], row[5]
Extensión de archivo Tamaño lógico Categoría

Find a word in the lines in file and split it into two lines

My inputfile(i.txt) is given below:
പ്രധാനമന്ത്രി മന്‍മോഹന്‍സിംഗ് നാട്ടില്‍ എത്തി .
അദ്ദേഹം മലയാളി അല്ല കാരണം അദ്ദേഹത്തെ പറ്റി പറയാന്‍ വാക്കുകല്ളില്ല .
and my connectives are in the list:
connectives=['കാരണം','അതുകൊണ്ട്‌ ','പക്ഷേ','അതിനാല്‍','എങ്കിലും','എന്നാലും','എങ്കില്‍','എങ്കില്‍പോലും',
'എന്നതുകൊണ്ട്‌ ','എന്ന']
My desired output is(outputfile.txt):
പ്രധാനമന്ത്രി മന്‍മോഹന്‍സിംഗ് നാട്ടില്‍ എത്തി .
അദ്ദേഹം മലയാളി അല്ല .
അദ്ദേഹത്തെ പറ്റി പറയാന്‍ വാക്കുകല്ളില്ല .
If there are 2 connectives split according to that. My code is:
fr = codecs.open('i.txt', encoding='utf-8')
fw = codecs.open('outputfile.txt', 'w')
for line in fr:
line_data=line.split()
for x, e in list(enumerate(line_data)):
if e in connectives:
line_data[x]='.'
The code is not completed.
I think you just have some indentation problems. I also added u'' to the connectives to specify unicode since I am using python 2.7.
You need to maybe add a carriage return with the . if you want it to split an existing line into two lines...
Here is a start (but not final):
import codecs
connectives=[u'കാരണം',u'അതുകൊണ്ട്‌ ',u'പക്ഷേ',u'അതിനാല്‍',u'എങ്കിലും',u'എന്നാലും',u'എങ്കില്‍',u'എങ്കില്‍പോലും',
u'എന്നതുകൊണ്ട്‌ ',u'എന്ന']
fr = codecs.open('i.txt', encoding='utf-8')
# fw = codecs.open('outputfile.txt', 'w')
for line in fr:
line_data=line.split()
for x, e in list(enumerate(line_data)):
if e in connectives:
line_data[x]='.\n'
print " ".join(line_data).lstrip()
Generates this output (extra space because the split comes in the middle of a line).
പ്രധാനമന്ത്രി മന്‍മോഹന്‍സിംഗ് നാട്ടില്‍ എത്തി .
അദ്ദേഹം മലയാളി അല്ല .
അദ്ദേഹത്തെ പറ്റി പറയാന്‍ വാക്കുകല്ളില്ല .
Here's one way you could do it, building up a string word by word and adding .\n where appropriate:
#!/usr/bin/python
# -*- coding: utf-8 -*-
connectives=set(['കാരണം','അതുകൊണ്ട്‌ ','പക്ഷേ','അതിനാല്‍','എങ്കിലും','എന്നാലും',
'എങ്കില്‍','എങ്കില്‍പോലും','എന്നതുകൊണ്ട്‌ ','എന്ന', '.'])
s=""
with open('i.txt') as file:
for line in file:
for word in line.split():
if word in connectives:
s += '.\n'
else:
s += '{} '.format(word)
print s
Note that I added the '.' to the end of the connectives list and made it into a set. Sets are a type of collection that are useful for fast membership testing, such as if word in connectives: in the code. I also decided to use str.format to put the word into the string. This could be changed for word + ' ' if preferred.
Output:
പ്രധാനമന്ത്രി മന്‍മോഹന്‍സിംഗ് നാട്ടില്‍ എത്തി .
അദ്ദേഹം മലയാളി അല്ല .
അദ്ദേഹത്തെ പറ്റി പറയാന്‍ വാക്കുകല്ളില്ല .
Unlike the other answer, there's no problem with the leading whitespace at the start of each line after the first one.
By the way, if you are comfortable using list comprehensions, you could condense the code down to this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
connectives=set(['കാരണം','അതുകൊണ്ട്‌ ','പക്ഷേ','അതിനാല്‍','എങ്കിലും','എന്നാലും',
'എങ്കില്‍','എങ്കില്‍പോലും','എന്നതുകൊണ്ട്‌ ','എന്ന', '.'])
with open('i.txt') as file:
s = ''.join(['.\n' if word in connectives else '{} '.format(word)
for line in file
for word in line.split()])
print s

Categories