Trouble scanning list for duplicates - python

Hey so i want to scan this text file of emails and if two of the same emails pop up i want it to be printed if only 1 email is on the list i dont want it to be printed.
It worked for a different text file but now its saying traceback error???
#note make sure found.txt and list.txt are in the 'include' for pycharmfrom collect ions import Counter
print("Welcome DADDY")
with open('myheritage-1-million.txt') as f:
c=Counter(c.strip().lower() for c in f if c.strip()) #for case-insensitive search
for line in c:
if c[line] > 1:
print(line)
ERROR:
rs/dcaputo/PycharmProjects/searchtoolforrhys/venv/include/search.py
Welcome DADDY
Traceback (most recent call last):
File "/Users/dcaputo/PycharmProjects/searchtoolforrhys/venv/include/search.py", line 5, in <module>
c = Counter(c.strip().lower() for c in f if c.strip()) #for case-insensitive search
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/collections/__init__.py", line 566, in __init__
self.update(*args, **kwds)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/collections/__init__.py", line 653, in update
_count_elements(self, iterable)
File "/Users/dcaputo/PycharmProjects/searchtoolforrhys/venv/include/search.py", line 5, in <genexpr>
c = Counter(c.strip().lower() for c in f if c.strip()) #for case-insensitive search
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 2668: invalid continuation byte
Process finished with exit code 1
a list of all emails that are shown up 2 times in that whole text file

The key is the error message at the end:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 2668: invalid continuation byte
This error can occur when trying to read a non-text file as text. Your file could be corrupted somehow and has some data (at position 2668) in it that can't be read as text.

Related

Program crashes during reading text file

def process_file(self):
error_flag = 0
line_count = 0
log_file = self.file_name
pure_name = log_file.strip()
# print('Before opening file ',pure_name)
logfile_in = open(pure_name, 'r') # Read file
lines = logfile_in.readlines()
# print('After reading file enteries ', pure_name)
Error Message
Traceback (most recent call last):
File "C:\Users\admin\PycharmProjects\BackupLogCheck\main.py", line 49, in <module>
backupLogs.process_file()
File "C:\Users\admin\PycharmProjects\BackupLogCheck\main.py", line 20, in process_file
lines = logfile_in.readlines()
File "C:\Users\admin\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 350: character maps to <undefined>
Process finished with exit code 1
Line 49 is where I call above method. But I have traced that it crashes at reading the file. I have checked the file; it has just text in it. I don't know if there are some characters which it doesn't like on reading entries. I am running on Windows 10.
I am new to Python, any suggestion how to find/correct the issue?
Try the file name in string format
logfile_in = open('pure_name', 'r') # Read file
lines = logfile_in.readlines()
print(lines)
output
['test line one\n', 'test line two']
or
logfile_in = open('pure_name', 'r') # Read file
lines = logfile_in.readlines()
for line in lines:
print(line)
output
test line one
test line two

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2: invalid start byte, tried all encoding styles

ad
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 826, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 841, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 920, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1052, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1238, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1429, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 2: invalid start byte
I am getting above error while reading my CSV
to rectify this I used unicode escape:
csv_df=pd.read_csv(file_path,header=0,squeeze=True,dtype=str,keep_default_na=False,encoding='unicode_escape')
However,
Now I am getting \xa0 for space between two words:
'ObjectStatus': 'IN\xa0SERVICE'
My CSV has:
Key Values
RequestID
ObjectType CONTAINER
ObjectName INMUNVMBMHPBNB6001ENBCMW005
ObjectStatus IN SERVICE
ObjectType CONTAINER
The unicode_escape codec is for literal escape codes (length 4 \\xa0 vs. length 1 \xa0). As displayed, that's just Python's debug representation of the string, and it prints \xa0 to show that it isn't a regular space. You're file is probably encoded in cp1252 or latin1, as \xa0 is the NO-BREAK SPACE in those encodings.
Example:
>>> d = {'ObjectStatus': 'IN\xa0SERVICE'}
>>> d
{'ObjectStatus': 'IN\xa0SERVICE'}
>>> print(d['ObjectStatus'])
IN SERVICE
For me below has worked.
Using str.replace() I have replaced all values in the column having '\xa0' with ' '
csv_df = pd.read_csv(file_path, header=0, squeeze=True,dtype=str, keep_default_na=False)
csv_df['Values'] = csv_df['Values'].astype(str).str.replace(u'\xa0', ' ')
I had to pass these values into an another function which created an XML, tried all encodings, none worked.

Enconding UTF-8 issue when trying reading a json file

I've got the error shown below when tryng to read a json file whose encode is UTF-8, can someone know how I can resolve this issue?
reviews = pd.read_csv('reviews.csv', nrows=1000)
businesses = pd.read_csv('businesses.csv', nrows=1000)
checkins = []
with open('checkins.json', encoding='utf-8') as f:
for row in f.readlines()[:1000]:
checkins.append(json.loads(row))
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-10-4f54896faeca> in <module>
3 checkins = []
4 with open('checkins.json', encoding='utf-8') as f:
----> 5 for row in f.readlines()[:1000]:
6 checkins.append(json.loads(row))
~\Anaconda3\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 37: invalid continuation byte

Python Unicode decode error- Not able to run script even after suggested correction

I am using python 2.7.9 to create excel sheet using tab delimited text files; however I am getting problem while running this python script
#!/usr/bin/env python
# encoding=utf8
import xlwt
import os
import sys
reload(sys)
sys.setdefaultencoding('utf8')
wb = xlwt.Workbook()
path = "/home/Final_analysis/"
#print(os.listdir())
lis = os.listdir(path)
sheetnumber = 1
for x in lis:
if os.path.isfile(x)==True:
extension = os.path.splitext(x)
print(extension[1])
if extension[1] == '.txt':
#print("Yes")
ws = wb.add_sheet(extension[0])
row = 0
column = 0
a = open(x)
while True:
a1 = a.readline()
if len(a1)==0:
break
data = a1.split("\t")
for z in data:
ws.write(row,column,z)
column += 1
column = 0
row += 1
sheetnumber+=1
else:
pass
wb.save("Ronic.xls")
I am getting following error
Traceback (most recent call last):
File "home/Final_analysis/combine_excel_v2.py", line 39, in <module>
wb.save("Ronic.xls")
File "/usr/local/lib/python2.7/site-packages/xlwt/Workbook.py", line 710, in save
doc.save(filename_or_stream, self.get_biff_data())
File "/usr/local/lib/python2.7/site-packages/xlwt/Workbook.py", line 674, in get_biff_data
shared_str_table = self.__sst_rec()
File "/usr/local/lib/python2.7/site-packages/xlwt/Workbook.py", line 636, in __sst_rec
return self.__sst.get_biff_record()
File "/usr/local/lib/python2.7/site-packages/xlwt/BIFFRecords.py", line 77, in get_biff_record
self._add_to_sst(s)
File "/usr/local/lib/python2.7/site-packages/xlwt/BIFFRecords.py", line 92, in _add_to_sst
u_str = upack2(s, self.encoding)
File "/usr/local/lib/python2.7/site-packages/xlwt/UnicodeUtils.py", line 50, in upack2
us = unicode(s, encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 83: ordinal not in range(128)
I have used answer given in thread How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"
But it didn't work.
problem is at wb.save() command
Setting the encoding at the top of your program is to handle non-ascii characters in your code, not your data. sys.setdefaultencoding('utf8') is not intended to be used in ordinary programs and does more harm than good.
To fix the problem, tell xlwt about the encoding to use.
Change this line:
wb = xlwt.Workbook()
to this:
wb = xlwt.Workbook(encoding="UTF-8")

Utf-8 decode error in pyramid/WebOb request

I found an error in the logs of a website of mine, in the log i got the body of the request, so i tried to reproduce that
This is what i got.
>>> from mondishop.models import *
>>> from pyramid.request import *
>>> req = Request.blank('/')
>>> b = DBSession.query(Log).filter(Log.id == 503).one().payload.encode('utf-8')
>>> req.method = 'POST'
>>> req.body = b
>>> req.params
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/phas/virtualenv/mondishop/local/lib/python2.7/site-packages/webob/request.py", line 856, in params
params = NestedMultiDict(self.GET, self.POST)
File "/home/phas/virtualenv/mondishop/local/lib/python2.7/site-packages/webob/request.py", line 807, in POST
vars = MultiDict.from_fieldstorage(fs)
File "/home/phas/virtualenv/mondishop/local/lib/python2.7/site-packages/webob/multidict.py", line 92, in from_fieldstorage
obj.add(field.name, decode(value))
File "/home/phas/virtualenv/mondishop/local/lib/python2.7/site-packages/webob/multidict.py", line 78, in <lambda>
decode = lambda b: b.decode(charset)
File "/home/phas/virtualenv/mondishop/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 52: invalid start byte
>>> req.POST
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/phas/virtualenv/mondishop/local/lib/python2.7/site-packages/webob/request.py", line 807, in POST
vars = MultiDict.from_fieldstorage(fs)
File "/home/phas/virtualenv/mondishop/local/lib/python2.7/site-packages/webob/multidict.py", line 92, in from_fieldstorage
obj.add(field.name, decode(value))
File "/home/phas/virtualenv/mondishop/local/lib/python2.7/site-packages/webob/multidict.py", line 78, in <lambda>
decode = lambda b: b.decode(charset)
File "/home/phas/virtualenv/mondishop/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 52: invalid start byte
>>>
The error is the same as the one i got i my log, so apparently something goes bad try to decoding the original post.
What is weird is that i get an error trying to utf-8 decode something that i just utf-8 encoded.
I cannot provide the content of the original request body because it contains some sensitive data (it's a paypal IPN) and i don't really have any idea on how to start addressing this issue.

Categories