I am writing code that requires use of the pandas.read_csv function and I like to use a test python file before I implement my code into the main python file. The part of the CSV file I am trying to read in just has data displayed in a seemingly random fashion with no real columns or headers. I just want to read the information into a dataframe so I can parse the data exactly where I know it is going to be for every file. For the test code and the main code I am using the same list of CSV files. The only difference is the test code runs in a different folder and does not sit inside a function. In my test code I have no issue extracting data from the CSV file using the read_csv function but on my main program it is giving me errors. In my test code I can easily use pd.read_csv this way:
for x in range(len(filelist)):
df = pd.read_csv(filelist[x], index_col=False, nrows=15, header=None, usecols=(0,1,2,3),
dtype={0:"string",1:"string",2:"string",3:"string"})
print(df)
The output is shown below:
Output from test code execution
However, when I try to port this over into my main code it won't function the same way. If I copy the code exactly it says there is no column 1,2, or 3. My next step was to erase the usecols and dtype variables and then it gave me the error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 2
I tried adding the comma delimiter and I tried changing the engine to python. Neither worked. Eventually, I determined from the tokenizing error that the program was expecting less information on certain lines so I broke up the code so that there were two dataframes and each would hold a respective amount of columns. This finally worked. The dataframes I created were structured as shown below:
df1 = pd.read_csv(filelist[x],skiprows=range(1), index_col=False, nrows=11, header=None)
df2 = pd.read_csv(filelist[x],skiprows=range(0,13), index_col=False, nrows=2,
usecols=(0,1,2), header=None)
print(df1)
print(df2)
The output for this is shown below:
Output from main code execution
This gives me something I can work with to accomplish my task, but it was extremely frustrating working through this and I have no idea why I even needed to go through all of this. I still will have to go back through and make some final adjustments including all the calls to the variables I need from these so if I can figure out why it is not working the same in the main code it would make my life a little easier. Does anyone have any clue why I had to make these adjustments? It seems the code for my main program is not reading in empty cells or just takes the amount of spaces used for the first row it looks at and just assumes the rest should be the same. Any information would be greatly appreciated. Thank you.
I am adding the full error messages below. I made it so it only calls the first file in the list for debugging purposes. This first one is when I copy the read_csv command over exactly:
Traceback (most recent call last):
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 484, in <module>
checkfilevariables(filelist)
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 221, in checkfilevariables
df = pd.read_csv(filelist[0], index_col=False, nrows=15, header=None, usecols=(0,1,2,3),
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 933, in __init__
self._engine = self._make_engine(f, self.engine)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1231, in _make_engine
return mapping[engine](f, **self.options)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 146, in __init__
self._validate_usecols_names(
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\base_parser.py", line 913, in _validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: [1, 2, 3]
This next error occurs after I remove usecols and dtype from the parameters.
Traceback (most recent call last):
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 483, in <module>
checkfilevariables(filelist)
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 221, in checkfilevariables
df = pd.read_csv(filelist[0], index_col=False, nrows=15, header=None)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 817, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 2
This final set of errors is given after I add the delimiter=',' and engine='python' parameters while usecols and dtypes have still been removed.
Traceback (most recent call last):
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 483, in <module>
checkfilevariables(filelist)
File "c:\Users\jacob.hollidge\Desktop\DCPR Threshold\DCPRthresholdV2.0.py", line 221, in checkfilevariables
df = pd.read_csv(filelist[0], index_col=False, nrows=15, header=None, delimiter=',', engine='python')
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\python_parser.py", line 270, in read
alldata = self._rows_to_cols(content)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\python_parser.py", line 1013, in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File "C:\Users\jacob.hollidge\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\python_parser.py", line 739, in _alert_malformed
raise ParserError(msg)
pandas.errors.ParserError: Expected 2 fields in line 13, saw 4
I am working with streamlit to create a tool that takes user input (csv file name) and cleans/produces output as a dataframe. I continuously get OSError: [Errno 22] Invalid argument: 'M:/Desktop/AutomationProject/'
I am aware of all the past solves of this error, however they all say change backslash to forward slash on windows and this is a quick fix, however after doing this I still have the same issue.
Note my tool still works when inputting the file name, just consistently shows an error (below)
Thanks in advance for your help!
Code:
st.header('1 - Express Autocalls')
autocall_gbp_file = str(st.text_input("Please type in your Autocall File Name (GBP)"))
express_gbp = pd.read_csv("M:/Desktop/AutomationProject/" + autocall_gbp_file)
OSError: [Errno 22] Invalid argument: 'M:/Desktop/AutomationProject/'
Traceback:
File "C:\Users\adavie18.conda\envs\projectenv\lib\site->packages\streamlit\scriptrunner\script_runner.py", line 475, in _run_script
exec(code, module.dict)
File "M:\Desktop\AutomationProject\AutocallApp.py", line 176, in
express_gbp = pd.read_csv("M:/Desktop/AutomationProject/" + autocall_gbp_file)
File "C:\Users\adavie18.conda\envs\projectenv\lib\site-packages\pandas\util_decorators.py", >line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\adavie18.conda\envs\projectenv\lib\site-packages\pandas\io\parsers\readers.py", >line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\adavie18.conda\envs\projectenv\lib\site-packages\pandas\io\parsers\readers.py", >line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\adavie18.conda\envs\projectenv\lib\site-packages\pandas\io\parsers\readers.py", >line 933, in init
self._engine = self._make_engine(f, self.engine)
File "C:\Users\adavie18.conda\envs\projectenv\lib\site-packages\pandas\io\parsers\readers.py", >line 1217, in _make_engine
self.handles = get_handle( # type: ignore[call-overload]
File "C:\Users\adavie18.conda\envs\projectenv\lib\site-packages\pandas\io\common.py", line 789, >in get_handle
handle = open(
The usual best practice to keep OS paths consistent across platforms in pythong is using the os module:
import os
path1 = "Desktop/" + "folder1/" + "folder2/"
with open(path1, "r") as file:
pass
# here, script is not consistent across OS,
# and can be difficult to format correctly for Windows
# instead, do:
path2 = os.path.join("Desktop", "folder1", "folder2")
with open(path2, "r") as file:
pass
# now, your script can find your Windows files,
# and the same script works for MacOS, Linux platforms
This helps keep consistency across platforms, so you can avoid meticulous string formatting
I am working with streamlit in python to produce a tool that takes a user's input of a csv filename, and then carries out cleaning/tabulating of the data within the file.
I have encountered an issue where before the user has entered their filename, my streamlit site shows a "FileNotFoundError: [Errno 2] No such file or directory:"
This is expected because the user has not entered their filename yet - however once filename is entered the code runs smoothly. I am hoping to overcome this issue but as a relative newcomer to Python I am quite unsure how!
Please see code snippet below
autocall_gbp_file = str(st.text_input("Please type in your Autocall File Name (GBP)"))
filepath = M:/Desktop/AutomationProject/
express_gbp = pd.read_csv(filepath + autocall_gbp_file + ".csv")
st.write('Saved!')
The exact error I get before any user input has been taken is:
FileNotFoundError: [Errno 2] No such file or directory:
'M:/Desktop/AutomationProject/.csv'
Traceback:
File "C:\Users\adavie18\.conda\envs\projectenv\lib\site-
packages\streamlit\scriptrunner\script_runner.py", line 475, in
_run_script
exec(code, module.__dict__)
File "M:\Desktop\AutomationProject\AutocallApp.py", line 179, in
<module>
express_gbp = pd.read_csv(filepath+autocall_gbp_file+".csv")
File "C:\Users\adavie18\.conda\envs\projectenv\lib\site-
packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\adavie18\.conda\envs\projectenv\lib\site-
packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\adavie18\.conda\envs\projectenv\lib\site-
packages\pandas\io\parsers\readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\adavie18\.conda\envs\projectenv\lib\site-
packages\pandas\io\parsers\readers.py", line 933, in __init__
self._engine = self._make_engine(f, self.engine)
File "C:\Users\adavie18\.conda\envs\projectenv\lib\site-
packages\pandas\io\parsers\readers.py", line 1217, in _make_engine
self.handles = get_handle( # type: ignore[call-overload]
File "C:\Users\adavie18\.conda\envs\projectenv\lib\site-
packages\pandas\io\common.py", line 789, in get_handle
handle = open(
Thanks in advance to anyone who can offer a suggestion!
The general pattern for both Streamlit and Python in general is to test for the value existing:
if autocall_gbp_file:
express_gbp = pd.read_csv(filepath + autocall_gbp_file + ".csv")
When the Streamlit app runs before a user inputs something, the value of autocall_gbp_file is None. By writing if autocall_gbp_file:, you're only running the pandas read_csv after someone has entered a value.
Separately, you're better off developing this with st.file_uploader than using text_input, as the Streamlit app doesn't necessarily have access to the user filesystem and same drive mapping as the machine you are developing on. By using st.file_uploader, you're literally providing the actual file, not a reference to where it might be located.
My code looks like this:
I am using PyCharm as my IDE and the csv file I'm using is from MS Excess. I've encoded the csv as UTF-8. I am trying to read the file using pandas. I want to be able to distinquish between objects and ints when I call df.info() This is also why I didn't change it to 'latin-1' or 'ISO...'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
cols = ['sentiment','id','date','query_string','user','text']
df = pd.read_csv("trainingandtestdata\\training.1600000.processed.noemoticon.csv", header=None,
names=cols, encoding='utf-8')#low_memory=False dtype='unicode' encoding='latin1'
df.head()
df.info()
df.sentiment.value_counts()
My error looks like this:
How do I fix the can't decode bytes in position xxxx to xxxx?
"C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\Scripts\python.exe"
"C:/Users/dashg/PycharmProjects/Twitter Sentiment/Reviewer.py"
Traceback (most recent call last):
File "C:/Users/dashg/PycharmProjects/Twitter Sentiment/Reviewer.py", line 6, in <module>
df = pd.read_csv("trainingandtestdata\\training.1600000.processed.noemoticon.csv", header=None,
names=cols, encoding='utf-8')#low_memory=False dtype='unicode' encoding='latin1'
File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site-
packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site-
packages\pandas\io\parsers.py", line 454, in _read
data = parser.read(nrows)
File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site-
packages\pandas\io\parsers.py",
line 1133, in read
ret = self._engine.read(nrows)
File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site-
packages\pandas\io\parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 929, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 2063, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 51845-51846: invalid continuation
byte
Process finished with exit code 1
your file doesn't have utf-8 encoding format while your using encoding='utf-8' in read_csv method. use other encoding method to help you solve the problem, like 'latin' or 'ISO-8859-1'. i refer you to this link for help.
worst case scenario, if none of this works, you can read the file in 'rb' mode (open(file, 'rb')) and parse it yourself by splitting each line of data using csv delimiter!
I was having the same problem, but in my case the solution was really easy. My ide is PyCharm 2020.1 and the .csv have the iso-8859-1 encoding, I've tried everything without luck, so I decided to check my ide config. I went to:
File
Settings
Left column: Editor
In Editor: File encoding
Then I add my .csv file with the botton: + which is in the right side, and finally change ide's config. Change it all to iso, because by default was in utf-8 and use the exact character to work with the file, in my case is: ?.
Hope this work
Its better to save that csv into xlsx and read as
pd.read_excel
I've got a problem with pandas read_csv. I had a many txt files that associate with stock market.It's like this:
SecCode,SecName,Tdate,Ttime,LastClose,OP,CP,Tq,Tm,Tt,Cq,Cm,Ct,HiP,LoP,SYL1,SYL2,Rf1,Rf2,bs,s5,s4,s3,s2,s1,b1,b2,b3,b4,b5,sv5,sv4,sv3,sv2,sv1,bv1,bv2,bv3,bv4,bv5,bsratio,spd,rpd,depth1,depth2
600000,浦发银行,20120104,091501,8.490,.000,.000,0,.000,0,0,.000,0,.000,.000,.000,.000,.000,.000, ,.000,.000,.000,.000,8.600,8.600,.000,.000,.000,.000,0,0,0,0,1100,1100,38900,0,0,0,.00,.000,.00,.00,.00
600000,浦发银行,20120104,091506,8.490,.000,.000,0,.000,0,0,.000,0,.000,.000,.000,.000,.000,.000, ,.000,.000,.000,.000,8.520,8.520,.000,.000,.000,.000,0,0,0,0,56795,56795,33605,0,0,0,.00,.000,.00,.00,.00
600000,浦发银行,20120104,091511,8.490,.000,.000,0,.000,0,0,.000,0,.000,.000,.000,.000,.000,.000, ,.000,.000,.000,.000,8.520,8.520,.000,.000,.000,.000,0,0,0,0,56795,56795,34605,0,0,0,.00,.000,.00,.00,.00
600000,浦发银行,20120104,091551,8.490,.000,.000,0,.000,0,0,.000,0,.000,.000,.000,.000,.000,.000, ,.000,.000,.000,.000,8.520,8.520,.000,.000,.000,.000,0,0,0,0,56795,56795,35205,0,0,0,.00,.000,.00,.00,.00
600000,浦发银行,20120104,091621,8.490,.000,.000,0,.000,0,0,.000,0,.000,.000,.000,.000,.000,.000, ,.000,.000,.000,.000,8.520,8.520,.000,.000,.000,.000,0,0,0,0,57795,57795,34205,0,0,0,.00,.000,.00,.00,.00
while I use this code to read it :
fields = ['SecCode', 'Tdate','Ttime','LastClose','OP','CP','Rf1','Rf2']
df = pd.read_csv('SHL1_TAQ_600000_201201.txt',usecols=fields)
But I got a problem:
Traceback (most recent call last):
File "E:/workspace/Senti/highlevel/highlevel.py", line 8, in <module>
df = pd.read_csv('SHL1_TAQ_600000_201201.txt',usecols=fields,header=1)
File "D:\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "D:\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "D:\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 645, in __init__
self._make_engine(self.engine)
File "D:\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "D:\Anaconda2\lib\site-packages\pandas\io\parsers.py", line 1257, in __init__
raise ValueError("Usecols do not match names.")
ValueError: Usecols do not match names.
I can't find any problem similar to mine.And also it's wired when I copy the txt file into another one ,the code runs well,but the original one cause the above problem.How can I solve it ?
In your message, you said that you're a running:
df = pd.read_csv('SHL1_TAQ_600000_201201.txt',usecols=fields)
Which did not throw an error for me and #Anil_M. But from your traceback, it is possible to see that the command used is another one:
df = pd.read_csv('SHL1_TAQ_600000_201201.txt',usecols=fields, header=1)
which includes a header=1 and it throws the error mentioned.
So, I would guess that the error comes from some confusion on your code.
Use names instead of usecols while specifying parameter.