Errors when trying to read csv file into pandas data frame - python

I am fairly new to working with Python so forgive me if this is an obvious issue but when attempting to read my csv file into a data frame I get the following errors:
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1244, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1259, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1450, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 11: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\keegan.olson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\keegan.olson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 460, in _read
data = parser.read(nrows)
File "C:\Users\keegan.olson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 1198, in read
ret = self._engine.read(nrows)
File "C:\Users\keegan.olson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 2157, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1126, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1244, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1259, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1450, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 11: invalid start byte
Any Help would be greatly appreciated!

Related

Pandas ignores set encoding in read_csv?

Using Linux, Pandas 1.0.1 and Python 3.6 I get a strange error in production:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
new_deps = self._run_get_new_deps()
File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
task_gen = self.task.run()
File "/opt/app-root/src/import_validation/validate_csv.py", line 275, in run
validate(temp_csv, self.query_id)
File "/opt/app-root/src/import_validation/validate_csv.py", line 263, in validate
pandas.read_csv(path, encoding='latin1', sep=sep)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1136, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1253, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1268, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1458, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 12: invalid continuation byte
As you can see in the traceback, I am already setting the encoding to latin1:
pandas.read_csv(path, encoding='latin1', sep=sep)
Why does pandas try to decode UTF-8 when i have specified latin1 as encoding? I have tried to use other aliases for latin1, it gives the same result.
Any idea why pandas seems to be ignoring my encoding setting?
Edit: Removed comment on not working in Windows. The same error happened, I just cheated when I passed the file, not passing it the same way.
The problem was a few too many layers of abstraction. I had a wrapper around this that tried to decompress the file if it ended with 'gz'. I then gave pandas not a path, but a temporary file. This file does of course already have its encoding set, and then the encoding setting is ignored in pandas. The solution is/was to pass the encoding to the temporary file, or as I did, just pass the original path to pandas, as it handles decompressed of files automatically.

Merging CSV files in python [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 3 years ago.
I have been trying to merge several csv files into one but its showing me some error. I am new to python, your help will be highly appreciated.
Following is my code:
import pandas as pd
import numpy as np
import glob
all_data_csv = pd.read_csv("C:/Users/Am/Documents/A.csv", encoding='utf-8')
for f in glob.glob('*.csv'):
df = pd.read_csv(f, encoding='utf-8')
all_data_csv= pd.merge(all_data_csv,df ,how= 'outer')
print(all_data_csv)
and the error shown:
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1169, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/internship/j.py", line 8, in <module>
df = pd.read_csv(f, encoding='utf-8')
File "C:\Users\Amreeta Koner\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\Amreeta Koner\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\Users\Amreeta Koner\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "C:\Users\Amreeta Koner\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1: invalid start byte
It seems like you have a non-ascii character in your csv file. I would check out the answer here. Hope it helps.
#run the same code with little addon
pd.read_csv("C:/Users/Am/Documents/A.csv",header=0,encoding = "ISO-8859-1")

exception reading in large tab separated file chunked

I have a 350MB tab separated text file. If I try to read it into memory I get an out of memory exception. So I am trying something along those lines (i.e. only read in a few columns):
import pandas as pd
input_file_and_path = r'C:\Christian\ModellingData\X.txt'
column_names = [
'X1'
# , 'X2
]
raw_data = pd.DataFrame()
for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t'):
raw_data = pd.concat([raw_data, chunk], ignore_index=True)
print(raw_data.head())
Unfortunately, I get this:
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1134, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/xxxx/EdaDataPrepRange1.py", line 17, in <module>
for chunk in pd.read_csv(input_file_and_path, header=None, names=column_names, chunksize=1000, sep='\t'):
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
return self.get_chunk()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte
Any ideas. Btw how can I generally deal with large files and impute for example missing variables? Ultimately, I need to read in everything to determine, for example, the median to be imputed.
use encoding="utf-8" while using pd.read_csv
Here they have used this encoding. see if this works. open(file path, encoding='windows-1252'):
Reference: 'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
Working Solution
to use encoding encoding="ISO-8859-1"
Regarding your large file problem, just use a file handler and context manager:
with open("your_file.txt") as fileObject:
for line in fileObject:
do_something_with(line)
## No need to close file as 'with' automatically does that
This won't load the whole file into memory. Instead, it'll load a line at a time, and will 'forget' previous lines unless you store a reference.
Also, regarding your encoding problem, just use encoding="utf-8" while using pd.read_csv.

How to download all transcripts from seeking alpha

Is there some way to automatically download all the transcripts from the SA website?
http://seekingalpha.com/earnings/earnings-call-transcripts
I tried using the http://newspaper.readthedocs.io/en/latest/ python code but I get the following error:
earnings_call_transcripts_2 = newspaper.build('http://seekingalpha.com/earnings/earnings-call-transcripts', memoize_articles=False)
Traceback (most recent call last):
File "/Users/name/anaconda/lib/python3.5/site-packages/newspaper/parsers.py", line 67, in fromstring
cls.doc = lxml.html.fromstring(html)
File "/Users/name/anaconda/lib/python3.5/site-packages/lxml/html/__init__.py", line 867, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/Users/name/anaconda/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77697)
File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040)
File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
File "src/lxml/parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104143)
File "<string>", line None
lxml.etree.XMLSyntaxError: line 295: b"htmlParseEntityRef: expecting ';'"
[Source parse ERR] http://seekingalpha.com/earnings/earnings-call-transcripts

Unicode decode error when trying to install Django

I am trying to install Django 1.6.11 but getting this error message:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 7: ordinal not in range(128)
Full traceback:
(lwc_env) C:\Users\lenovo\PycharmProjects\lwc_env>pip install django==1.6.11
Collecting django==1.6.11
Exception:
Traceback (most recent call last):
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\basecommand.py", line 223, in main
status = self.run(options, args)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\commands\install.py", line 293, in run
wb.build(autobuilding=True)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\wheel.py", line 705, in build
self.requirement_set.prepare_files(self.finder)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\req\req_set.py", line 334, in prepare_files
functools.partial(self._prepare_file, finder))
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\req\req_set.py", line 321, in _walk_req_to_install
more_reqs = handler(req_to_install)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\req\req_set.py", line 461, in _prepare_file
req_to_install.populate_link(finder, self.upgrade)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\req\req_install.py", line 249, in populate_link
self.link = finder.find_requirement(self, upgrade)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\index.py", line 486, in find_requirement
all_versions = self._find_all_versions(req.name)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\index.py", line 404, in _find_all_versions
index_locations = self._get_index_urls_locations(project_name)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\index.py", line 378, in _get_index_urls_locations
page = self._get_page(main_index_url)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\index.py", line 818, in _get_page
return HTMLPage.get_page(link, session=self.session)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\index.py", line 928, in get_page
"Cache-Control": "max-age=600",
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\requests\sessions.py", line 477, in get
return self.request('GET', url, **kwargs)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\download.py", line 373, in request
return super(PipSession, self).request(method, url, *args, **kwargs)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\requests\sessions.py", line 605, in send
r.content
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\requests\models.py", line 750, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\requests\models.py", line 673, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\requests\packages\urllib3\response.py", line 307, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\requests\packages\urllib3\response.py", line 243, in read
data = self._fp.read(amt)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 54, in read
self.__callback(self.__buf.getvalue())
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\cachecontrol\controller.py", line 244, in cache_response
self.serializer.dumps(request, response, body=body),
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\download.py", line 276, in set
return super(SafeFileCache, self).set(*args, **kwargs)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\cachecontrol\caches\file_cache.py", line 99, in set
with self.lock_class(name) as lock:
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\lockfile\mkdirlockfile.py", line 18, in __init__
LockBase.__init__(self, path, threaded, timeout)
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\site-packages\pip\_vendor\lockfile\__init__.py", line 189, in __init__
hash(self.path)))
File "C:\Users\lenovo\PycharmProjects\lwc_env\lib\ntpath.py", line 85, in join
result_path = result_path + p_path
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 7: ordinal not in range(128)
I am using PyCharm on Windows 7 if this matters. Is there a problem with my computer or am I just doing something wrong?
Thank you for help!

Categories