length of the first column in csv - python

I am new in python. I would like to ask what is the optimal way of getting length on the first column of csv file in python ? I tried this, but got this error:
File "C:/Projekty/valispace_api/CreatingCvs.py", line 6, in <module>
first_col_len = len(next(zip(*reader)))
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
Thanks
import csv
with open('export_U43V_WIRKNETZ.csv', 'rb') as f:
reader = csv.reader(f)
first_col_len = len(next(zip(*reader)))
print(first_col_len)

Related

Pyarrow/Parquet - Cast all null columns to string during batch processing

There is a problem with my code that I can not solve for a while now.
I'm trying to convert a tar.gz compressed csv file to parquet. The file itself, when uncompressed, is about 700MB large. The processing is done in a memory-restricted system, so I have to process the file in batches.
I figured out how to read the tar.gz as a stream, extract the file I need and use pyarrow's open_csv() to read batches. From here, I want to save the data to a parquet file by writing in batches.
This is where the problem appears. The file itself has lots of columns that don't have any values. But once in a while, there is a single value that appears in line 500.000 or something, so pyarrow does not recognize the dtype properly. Most of the columns are therefore of dtype null. My idea is to modify the schema and cast these columns to string, so any values are valid. Modifying the schema works fine, but when I run the code, I get this error.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 38, in <module>
batch = reader.read_next_batch()
File "pyarrow\ipc.pxi", line 682, in pyarrow.lib.RecordBatchReader.read_next_batch
File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: In CSV column #49: CSV conversion error to null: invalid value '0.0000'
Line 38 is this one:
batch = reader.read_next_batch()
Does anyone have any idea how to enforce the schema to the batches so
Here is my code.
import io
import os
import tarfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
import logging
srcs = list()
path = "C:\\data"
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith("tar.gz"):
srcs.append(os.path.join(root, name))
for source_file_name in srcs:
file_name: str = source_file_name.replace(".tar.gz", "")
target_file_name: str = source_file_name.replace(".tar.gz", ".parquet")
clean_file_name: str = os.path.basename(source_file_name.replace(".tar.gz", ""))
# download CSV file, preserving folder structure
logging.info(f"Processing '{source_file_name}'.")
with io.open(source_file_name, "rb") as file_obj_in:
# unpack all files to temp_path
file_obj_in.seek(0)
with tarfile.open(fileobj=file_obj_in, mode="r") as tf:
file_obj = tf.extractfile(f"{clean_file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=25*1024*1024))
schema = reader.schema
null_cols = list()
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
null_cols.append(index)
with pq.ParquetWriter(target_file_name, schema) as writer:
while True:
try:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
writer.write_batch(batch)
except StopIteration:
break
Also, I could leave out this part:
batch = reader.read_next_batch()
table = pa.Table.from_batches(batches=[batch]).cast(target_schema=schema)
batch = table.to_batches()[0]
But then the error is like this (shortened), showing that the schema change works at least.
Traceback (most recent call last):
File "b:\snippets\tar_converter.py", line 39, in <module>
writer.write_batch(batch)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 981, in write_batch
self.write_table(table, row_group_size)
File "C:\Users\me\AppData\Roaming\Python\Python39\site-packages\pyarrow\parquet\__init__.py", line 1004, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: null
VAT_RECEIVABLE_ID: null
MONTHLY_AMOUNT_EFFECTIVE_DATE: null vs.
file:
ACCOUNT_NAME: string
BOOK_VALUE: double
ESTIMATED_TO_REALISE: double
VAT_PAYABLE_ID: string
VAT_RECEIVABLE_ID: string
MONTHLY_AMOUNT_EFFECTIVE_DATE: string
Thank you!
So I think I figured it out. Wanted to post it for those who have similar issues.
Also, thanks to all who had a look and helped!
I did a workaround to solve this, by reading the file two times.
In the first run I only read the first batch into stream to get the schema. Then, converted null columns to string and closed the stream (this is important if you use same variable name). After this you read the file again, but now passing the modified schema as a ReadOption to the reader. Thanks to #0x26res whose comment gave me the idea.
# get initial schema by reading one batch
initial_reader = csv.open_csv(file_obj, read_options=csv.ReadOptions(block_size=16*1024*1024))
schema = initial_reader.schema
for index, entry in enumerate(schema.types):
if entry.equals(pa.null()):
schema = schema.set(index, schema.field(index).with_type(pa.string()))
# now use the modified schema for reader
# must close old reader first, otherwise wrong data is loaded
file_obj.close()
file_obj = tf.extractfile(f"{file_name}.csv")
file_obj.seek(0)
reader = csv.open_csv(file_obj,
read_options=csv.ReadOptions(block_size=16*1024*1024),
convert_options=csv.ConvertOptions(column_types=schema))

Delete a Line from BIG CSV file Python

I have an 11GB CSV file which has some corrupt lines I have to delete, I have identified the corrupted lines numbers from an ETL interface.
My program runs with small datasets, however, when I want to run on the main file I'm getting MemoryError. Below the code I'm using Do you have any suggestion to make it work?
row_to_delete = 101068
filename = "EKBE_0_20180907_065907 - Copy.csv"
with open(filename, 'r', encoding='utf8' ,errors='ignore') as file:
data = file.readlines()
print(data[row_to_delete -1 ])
data [row_to_delete -1] = ''
with open(filename, 'wb',encoding="utf8",errors='ignore') as file:
file.writelines( data )
Error:
Traceback (most recent call last):
File "/.PyCharmCE2018.2/config/scratches/scratch_7.py", line 7, in <module>
data = file.readlines()
MemoryError
Rather than read the whole list into memory, loop over the input file, and write all lines except the line you need to delete to the a new file. Use enumerate() to keep a counter if you need to delete by index:
row_to_delete = 101068
filename = "EKBE_0_20180907_065907 - Copy.csv"
with open(filename, 'r', encoding='utf8', errors='ignore') as inputfile,\
open(filename + '.fixed', 'wb', encoding="utf8") as outputfile:
for index, line in enumerate(inputfile):
if index == row_to_delete:
continue # don't write the line that matches
outputfile.writeline(line)
Rather than use an index, you could even detect a bad line directly in code this way.
Note that this writes to a new file, with the same name but with .fixed added.
You can move that file back to replace the old file if you want to, with os.rename(), once you are done copying all but the bad line:
os.rename(filename + '.fixed', filename)

Python json to CSV

I am trying to convert a json data set file into csv. I am really new to python, and have been looking on the forums and cannot seem to resolve my issues. I have attached the json data url link in below along with my code. Thanks in advance!
https://data.ny.gov/api/views/nqur-w4p7/rows.json?accessType=DOWNLOAD
import json
import csv
inputFile = ("rows.json?accessType=DOWNLOAD", "r")
data = json.load(inputFile)
with open("Data.csv","wb") as csvfile:
csv_writer = csv.DictWriter(csvfile,delimiter=",", fieldnames=["data", "new_york_state_average_gal", "albany_average_gal", "binghamton_average_gal", "bu\
ffalo_average_gal", "nassau_average_gal", "new_york_city_average_gal", "rochester_average_gal", "utica_average_gal"])
csv_writer.writerheader()
csv_writer.writerows(data)
Here is the error I am getting:
File "ChangeDataType.py", line 5, in <module>
data = json.load(inputFile)
File "/usr/lib64/python3.4/json/__init__.py", line 265, in load
return loads(fp.read(),
AttributeError: 'tuple' object has no attribute 'read'
Your error happens because you made a tuple:
inputFile = ("rows.json?accessType=DOWNLOAD", "r")
And you're trying to use json.load in that tuple. Since json.load works only on files, you need to call the open function:
inputFile = open("rows.json?accessType=DOWNLOAD", "r")
The "r" part indicates you're opening the file for reading.

How to write the contents of one CSV file to another

I have a csv file and I want to transfer the raw data without the headers to a new csv file and have the rows and columns the same as the original.
IRIS_data = "IRIS_data.csv"
with open(IRIS_data, 'wb') as data:
wr = csv.writer(data, quoting=csv.QUOTE_ALL)
with open(IRIS) as f:
next(f)
for line in f:
wr.writerow(line)
The code above is my most recent attempt, when I try run it I get the following error:
a bytes-like object is required, not 'str'
It's because you opened the input file with with open(IRIS_data, 'wb'), which opens it in binary mode, and the output file with just with open(IRIS) which opens it in text mode.
In Python 3, you should open both files in text mode and specify newline='' option)—see the examples in the csv module's documentation)
To fix it, change them as follows:
with open(IRIS_data, 'w', newline='') as data:
and
with open(IRIS, newline='') as f:
However there are other issues with you code. Here's how to use those statements to get what I think you want:
import csv
IRIS = "IRIS.csv"
IRIS_data = "IRIS_data.csv"
with open(IRIS, 'r', newline='') as f, open(IRIS_data, 'w', newline='') as data:
next(f) # Skip over header in input file.
writer = csv.writer(data, quoting=csv.QUOTE_ALL)
writer.writerows(line.split() for line in f)
Contents of IRIS_data.csv file after running the script with your sample input data:
"6.4","2.8","5.6","2.2","2"
"5","2.3","3.3","1","1"
"4.9","2.5","4.5","1.7","2"
"4.9","3.1","1.5","0.1","0"
"5.7","3.8","1.7","0.3","0"
"4.4","3.2","1.3","0.2","0"
"5.4","3.4","1.5","0.4","0"
"6.9","3.1","5.1","2.3","2"
"6.7","3.1","4.4","1.4","1"
"5.1","3.7","1.5","0.4","0"
You have to encode the line you are writing like this:
wr.writerow( line.encode(”utf8”))
Also open your file using open(..., ‘wb’). This will open the file in binary mode. So you are certain the file is actually open in binary mode. Indeed it is better to now explicitly the encoding than assuming it. Enforcing encoding for both reading and writing will save you lots of trouble.

Reading file error in Python

I am brand new to Python and am having a terrible time trying to read in a .csv file to work with. The code I am using is the following:
>>> dat = open('blue.csv','r')
>>> print dat()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'file' object is not callable
Could anyone help me diagnose this error or lend any suggestions on how to read the file in? Sorry if there is an answer to this question already, but I couldn't seem to find it.
You need to use read in order to read a file
dat = open('blue.csv','r')
print dat.read()
Alternatively, you can use with for self-closing
with open('blue.csv','r') as o:
data = o.read()
You can read the file:
dat = open('blue.csv', 'r').read()
Or you can open the file as a csv and read it row by row:
import csv
infile = open('blue.csv', 'r')
csvfile = csv.reader(infile)
for row in csvfile:
print row
column1 = row[0]
print column1
Check out the csv docs for more options for working with csv files.

Categories