Error merging multiple CSV files - Python - python

I'm trying to merge several CSV files into one.
Searching several methods, I found this one:
files = glob.glob("D:\\green_lake\\Projects\\covid_19\\tabelas_relacao\\acre\\*.csv")
files_merged = pd.concat([pd.read_csv(df) for df in files], ignore_index=True)
When running this error is returned:
>>> files_merged = pd.concat([pd.read_csv(df) for df in files], ignore_index=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 678, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\readers.py", line 1253, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\Leonardo\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in
read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 243, saw 4
I'm starting to study python and if it's a stupid mistake, I apologize ;)

Related

pandas how to read this row?

data sample: program go wrong with the second data for it has 7 "," while normal data only have 6.
7558,1488,1738539,,,,1
7559,1489,1702292,,"(segment \"Pesnya, ili Kak velikij Luarsab khor organizovyval\")",8,1
7560,1489,2146930,1975,,21,1
It is from imdb dataset's cast_info table. ([IMDB][2] is from a database task named cardinality estimination.) Its sep is ",". But if there were some sep in string, pandas can't recognize them.
The error log:
File "\pytorch\lib\site-packages\pandas\io\parsers\readers.py", line 488, in _read
return parser.read(nrows)
File "\pytorch\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read
index, columns, col_dict = self._engine.read(nrows)
File "\pytorch\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 223, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 801, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 7559, saw 8
How can I solve it?
[2]: https://www.imdb.com/interfaces/
Try this i think this should work.
import pandas as pd
pd.read_csv(data_path,sep = ",")

How to set many columns on Pandas Python?

I want to insert more than a hundred columns into a CSV file. But seems pandas library has limited columns.
Here is the error message:
Traceback (most recent call last):
File "metric.py", line 91, in <module>
finalFile(sys.argv[1])
File "metric.py", line 80, in finalFile
data = pd.read_csv(f, header=None, dtype=str)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 948, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 2010, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 540, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
And below is my function:
def finalFile(fname):
output = pd.DataFrame()
for file_name in os.listdir('test/sciprt-temp/'):
if file_name.startswith(fname):
with open(os.path.join('test/sciprt-temp/', file_name)) as f:
data = pd.read_csv(f, header=None, dtype=str)
output[file_name.rsplit('.', 4)[2]] = data[1]
output.insert(0, 'timestamp', dt.datetime.now().timestamp())
output.insert(0, 'hostname', fname.rsplit('-', 3)[0])
output.set_index(output.columns[0], inplace=True)
output.to_csv(fname.rsplit('.', 2)[2] + ".csv")
finalFile(sys.argv[1])
It seems to work fine when inserting few columns but not working with more columns.
hostname,timestamp,-diskstats_latency-sda-avgrdwait-g,-diskstats_latency-sda-avgwait-g,-diskstats_latency-sda-avgwrwait-g,-diskstats_latency-sda-svctm-g,-diskstats_latency-sda_avgwait-g
test.test.com,1617779170.62498,2.7979746835e-03,6.6681051841e-03,7.1533659185e-03,2.5977601795e-04,6.6681051841e-03

Trying to split csv file and getting Error tokenizing data

I am trying to split a csv file into multiple csvs but keep csv header.
the code I am trying is:
import pandas as pd
chunk_size = 500000
batch_no = 1
for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, ):
chunk.to_csv(file_path + str(batch_no) + '.csv', index=False)
batch_no += 1
And the error I get is this one:
Traceback (most recent call last):
File "splitcsv.py", line 5, in <module>
for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, encoding='utf-8'):
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1128, in __next__
return self.get_chunk()
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1188, in get_chunk
return self.read(nrows=size)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 908, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 274, saw 2
You can try to skip lines producing errors by adding error_bad_lines=False argument to pd.read_csv function. Then, your code would look like this:
import pandas as pd
chunk_size = 500000
batch_no = 1
for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, error_bad_lines=False):
chunk.to_csv(file_path + str(batch_no) + '.csv', index=False)
batch_no += 1

MemoryError, pandas read_csv, 32 bit, dont want to use chunksize

I am recently upgrading from python 2 to python 3.(anaconda 32-bit)
Upgrading to 64-bit is not an option at the moment.
Upon calling my function, i am getting a MemoryError
def M2():
print ('Loading datasets...')
e1 = pd.read_csv(working_dir+"E1.txt",sep=',')
E1.txt is 300,000 kb.
Is there a better way of reading in this data?
Update
I do not want to use chunksize as this will not read my data in a dataframe.
I have narrowed down my .txt file from 300k kb, to 50k kb and still memory issue.
Traceback:
Traceback (most recent call last):
File "<ipython-input-99-99e71d524b4b>", line 1, in <module>
runfile('C:/AppData/FinRecon/py_code/python3/DataJoin.py', wdir='C:/AppData/FinRecon/py_code/python3')
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 474, in <module>
M2()
File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 31, in M2
e1 = pd.read_csv(working_dir+"E1.txt",sep=',')
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\io\parsers.py", line 1154, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\construction.py", line 61, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\managers.py", line 1666, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\managers.py", line 1734, in form_blocks
int_blocks = _multi_blockify(items_dict['IntBlock'])
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\managers.py", line 1819, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\managers.py", line 1861, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError

Replicate a dataset with dask to all workers

I am using dask with distributed scheduler. I am trying to replicate a dataset read through csv on s3 to all worker nodes. Example:
from distributed import Executor
import dask.dataframe as dd
e= Executor('127.0.0.1:8786',set_as_default=True)
df = dd.read_csv('s3://bucket/file.csv', blocksize=None)
df = e.persist(df)
e.replicate(df)
distributed.utils - ERROR - unhashable type: 'list'
Traceback (most recent call last):
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/distributed/utils.py", line 102, in f
result[0] = yield gen.maybe_future(func(*args, **kwargs))
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/distributed/executor.py", line 1347, in _replicate
branching_factor=branching_factor)
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/distributed/core.py", line 444, in send_recv_from_rpc
result = yield send_recv(stream=stream, op=key, **kwargs)
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/gen.py", line 1024, in run
yielded = self.gen.send(value)
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/distributed/core.py", line 345, in send_recv
six.reraise(*clean_exception(**response))
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/distributed/core.py", line 211, in handle_stream
result = yield gen.maybe_future(handler(stream, **msg))
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/tornado/gen.py", line 285, in wrapper
yielded = next(result)
File "/root/.miniconda/envs/dask_env/lib/python3.5/site-packages/distributed/scheduler.py", line 1324, in replicate
keys = set(keys)
TypeError: unhashable type: 'list'
Is this the correct way to replicate a dataframe? It appears that e.persist(df) returned object does not work with e.replicate for some reason.
This was a bug and has been resolved in https://github.com/dask/distributed/pull/473

Categories