Error when reading csv with merged cells - python

I have a txt file that I open in Excel that has merged cells (see image).
.
These cause an error message when reading the file:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 1883, saw 2
At the moment I'm manually taking them out in Excel. I'm sure there could be a way to taken these out when reading a file but I can't find anything on SO. I'm not sure if I'm using the right terminology though.
Using Excel may also be an option. I just wanted to see if there was a method using Python.

If you just want to skip the headers, you might look at this SO answer which suggests the following:
data = pd.read_csv('file1.csv', error_bad_lines=False)

Related

How to write a large .txt file to a csv for Biq Query dump?

I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't have 86 million rows of memory on my device and it will crash jupyter.
I'm aware of other threads such as (How to convert a tab delimited text file to a csv file in Python) but my issue seems rather niche.
Is there a way I could go about this? I thought about Vaex but I have total unfamiliarity with the toolkit, and it doesn't seem to have a writer within.
Current thoughts would be:
csv_path = r'csv_test.csv'
txt_path = r'txt_test.txt'
with open(txt_path, "r") as in_text:
in_reader = csv.reader(in_text, delimiter="|", skipinitialspace=True)
with open(csv_path, "w") as out_csv:
out_writer = csv.writer(out_csv, delimiter = ',')
for row in in_reader:
out_writer.writerow(row)
Currently, I am receiving an error stating:
Error: field larger than field limit (131072)
It seems it's the maximum row count in a single column, so I'm quite a bit off.
I've gotten a csv of smaller files to generate (only using 3 of the 35 total .txt files) but when I attempt to use all, it fails with code above.
Update: I have expanded the sys.maxsize and am still receiving this same error
I have no way to verify if this works due to the sheer size of the dataset, but it seems like it /should/ work. Trying to read it with Vaex would work if I wasn't getting parsing errors due to there being commas within the data.
So I have 3 questions:
Is there a way I can write a larger sized csv?
Is there a way to dump in the large pipe delimited .txt file to Big Query in chunks as different csv's?
Can I dump 35 csv's into Big Query in one upload?
Edit:
here is a short dataframe sample:
|CMTE_ID| AMNDT_IND| RPT_TP| TRANSACTION_PGI| IMAGE_NUM| TRANSACTION_TP| ENTITY_TP| NAME| CITY| STATE| ZIP_CODE| EMPLOYER| OCCUPATION| TRANSACTION_DT| TRANSACTION_AMT| OTHER_ID| TRAN_ID| FILE_NUM| MEMO_CD| MEMO_TEXT| SUB_ID
0|C00632562|N|M4|P|202204139496092475|15E|IND|NAME, NAME|PALO ALTO|CA|943012820.0|NOT EMPLOYED|RETIRED|3272022|5|C00401224|VTEKDYJ78M3|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955005
1|C00632562|N|M4|P|202204139496092487|15E|IND|NAME, NAME|DALLAS|TX|752054324.0|SELF EMPLOYED|PHOTOGRAPHER|3272022|500|C00401224|VTEKDYJ7BD4|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955041
I think there is some red-herring going on here:
Is there a way I can write a larger sized csv?
Yes, the reader and writer iterator style should be able to read any length of file, they step through incrementally, and at no stage do they attempt to read the whole file. Something else is going wrong in your example.
Is there a way to dump in the large tab-delimited .txt file to Big Query in chunks as different csv's?
You shouldn't need to.
Can I dump 35 csv's into Big Query in one upload?
That's more a Big Query api question, so I wont attempt to answer that here.
In your code, your text delimiter is set to a pipe, but in your question number 2, you describe it as being tab delimited. If you're giving the wrong delimiter to the code, it might try to read more content into a field than it's expecting, and fail when it hits some field-size limit. This sounds like it might be what's going on in your case.
Also, watch out when piping your file out and changing delimiters - in the data sample you post, there are some commas embedded in the text, this might result in a corrupted file when it comes to reading it in again on the other side. Take some time to think about your target CSV dialect, in terms of text quoting, chosen delimiters etc.
Try replacing the | with \t and see if that helps.
If you're only changing the delimiter from one thing to another, is that a useful process? Maybe forget the whole CSV nature of the file, and read lines iteratively, and write them without modifying them any, you could use readline and writeline for this, probably speeding things up in the process. Again, because they're iterative, you wont have to worry about loading the whole file into RAM, and just stream from one source to your target. Beware how long it might take to do this, and if you've a patchy network, it can all go horribly wrong. But at least it's a different error!

openpyxl error raise ValueError('Min value is {0}'.format(self.min)) in opening heavy file with formatting

I'm trying to use openpyxl for the first time on a very heavy file, that happens to be over 20 500 Ko, has a lot of formatting and a VBA macro.
My code keeps returning the following error:
File " \Anaconda3\lib\site-packages\openpyxl\styles\alignment.py", line 52, in __init__
self.relativeIndent = relativeIndent
File " \Anaconda3\lib\site-packages\openpyxl\descriptors\base.py", line 107, in __set__
raise ValueError('Min value is {0}'.format(self.min))
ValueError: Min value is 0
Would anyone know what the problem is / how to access the file despite it? I'm trying to post data into an existent Excel file to simplify processes and replace a heavy VBA code. So I can't just post it into a different xlsx file and call it using a VBA code (that would defeat the purpose).
Thanks a lot!
Here is my code :
wb = load_workbook(filename='C:/dev/CodeRep/ProjectName/MainFile 2021_01.xlsm', read_only = False, keep_vba = True)
The traceback says that there is a problem with the Alignment definition in the workbook's stylesheet. openpyxl follows the OOXML specification very closely to minimise unpleasant surprises later, this is why it tends to raise exceptions or give warnings rather than let things pass.
For more details we'll need to see the XML source for the stylesheet, or the Alignments part at least. You can find this by unzipping the XLSM file and looking for the styles.xml file. That will give you more information and also allow you to submit a bug report to openpyxl.
Preprocess the file
I solved this issue by preprocessing the excel file.
Found that mi problem was at "*/myfile.xlsx/xl/styles.xml" where several xf tags had an attribute indent="-1", and openpyxl only supports non-negative values, raising that exception when a negative value is found.
After some time spent trying to override entire openpyxl hierarchy in order to catch the exception, I decided to process the XLSX.
Here is my code:
def fix_xlsx(file_name):
with zipfile.ZipFile(file_name) as input_file, zipfile.ZipFile(file_name + ".out", "w") as output_file:
# Iterate over files
for inzipinfo in input_file.infolist():
with input_file.open(inzipinfo) as infile:
if "xl/styles.xml" in inzipinfo.filename:
# Read, Process & Write
lines = infile.readlines()
new_lines = b"\n".join([line.replace(b'indent="-1"', b'indent="0"') for line in lines])
output_file.writestr(inzipinfo.filename, new_lines)
else:
# Read & Write
output_file.writestr(inzipinfo.filename, b"\n".join([line for line in infile.readlines()]))
# Replace file
os.replace(file_name + ".out", file_name)
Disclaimer:
I must say this is not a very elegant solution as the entire file is processed, and an auxiliary file is used.
Also I am not so expert at excel to tell wheter changing that indent="-1" to indent="0" for those tags might cause format problems in the file. This is my working solution and can't really tell the effect of those tags.
I had the same issue — the file wasn't accepted by Openpyxl.
I just opened the file in MS Excel and saved it to a new file. And it worked after that.
I got the same error and wasn't able to figure out the exact cause, but noticed when I ran my python script in a different environment it worked without issue.
I realized it may have had something to do with the versions of the openpyxl and xlrd packages I was using so I downgraded them to openpyxl==3.0.4 and xlrd==1.2.0 (previously using openpyxl==3.0.7 and xlrd==2.0.1) and that solved my issue.
I ran into this issue, my solution was to pinpoint what was causing the error in the spreadsheet (had something to do with a table that was recently modified) and reconstruct that table in the worksheet. much easier for me than debugging openpyxl or xml.

I'm trying to load a file into Python using pd.read_csv(), but I cannot understand the file's format

This is my very first question on stackoverflow, so I must beg your patience.
I believe there is something wrong with the format of a csv file I need to load into Python. I'm using a Jupyter Notebook. The link to the file is here.
It is from the World Inequality Database data portal.
I'm pretty sure the delimiter is a semi-colon ( sep=";" ) because the bottom half of the data renders neatly when I specify this argument. However the first half of the text in the file seems to make no sense. I have no idea how to tell the pd.read_csv() function how to read it. I suspect the first half of the data simply has terrible formatting. I've also tried header=None and sep="|" to no avail.
Any ideas or suggestions would be very helpful. Thank you very much!
This is common with speadsheets. You have have some commentary, tables may be inserted all over the place. It looks great to the content creator, but the CSV is a mess. You need to preprocess the CSV to create clean content for your analysis. In this case, its easy. The content starts at canned header and you can split the file there. If that header changes, you'll get an error and now its just one more sleepless night figuring out what they've done.
import itertools
canned_header_line = "Variable Code;country;year;perc;agdpro999i;"\
"npopul999i;mgdpro999i;inyixx999i;xlceux999i;xlcusx999i;xlcyux999i"
def scrub_WID_file(in_csv_filename, out_csv_filename):
with open(in_csv_filename) as in_file,\
open(out_csv_filename, 'w') as out_file:
out_file.writelines(itertools.dropwhile(
lambda line: line.strip() != canned_header_line,
in_fp))
if not os.stat.st_size:
raise ValueError("No recognized header in " + in_csv_filename)

Unable to read modified csv file with pandas

I have exported a Excel file using the pandas .to_csv method on a 9-column DataFrame successfully, as well as accessing the created file with the .to_csv method likewise, with no errors whatsoever using the following code:
dfBase = pd.read_csv('C:/Users/MyUser/Documents/Scripts/Base.csv',
sep=';', decimal=',', index_col=0, parse_dates=True,
encoding='utf-8', engine='python')
However, upon modifying the same CSV file manually using Notepad (which also extends to simply opening the file and saving it without making any actual alterations), pandas won't read it anymore, giving the following error message:
ParserError: Expected 2 fields in line 2, saw 9
In the case of the modified CSV, if the index_col=0 parameter is removed from the code, pandas is able to read the DataFrame again, however the first 8 columns become the index as a tuple and only the last column is brought as a field.
Could anyone point me out as to why I am unable to access the DataFrame after modifying it? Also, why does the removal of index_col enables its reading again with nearly all the columns as the index?
Have you tried opening and saving the file with some other text editor? Notepad really isn't that great, probably it's adding some special characters upon opening of the file or maybe the file already contains those characters and Notepad does not let you see them, hence pandas can't convert correctly
try Notepad++ or some more advanced IDEs like Atom, VSCode or PyCharm

Pandas.read_excel: Unsupported format, or corrupt file: Expected BOF record

I'm trying to use pandas.read_excel to read in .xls files. It succeeds on most of my .xls files, but then for some it errors out with the following error message:
Unsupported format, or corrupt file: Expected BOF record; found '\x00\x05\x16\x07\x00\x02\x00\x00'
I've been trying to research why this is happening to some, but not all files. The xlrd version is 1.0.0. I tried to manually read in with xlrd.open_workbook and I get the same result.
Does anyone know what file type, this BOF record is referring to?
There are various reasons to why that error message appeared. However, the main reason could be due to the Excel file itself. Sometimes, especially if you're pulling an Excel file from some Reporting Portal, the Excel file could be corrupt so the best thing would be to open the Excel file and save it as a new .xls file then retry running pandas.read_excel.
Lemme know if it works.
I solved this problem loading it with pd.read_table (it loads everything into one column)
df = pd.read_table('path/to/xls_file/' + 'my_file.xls')
then I split this column with
df = df['column_name'].str.split("your_separator", expand=True)
Please check if you have given the right extension of the file either xlsx or csv. a wrong extension specified of the file may cause this issue.

Categories