Pandas to understand inconsistent seperator for csv files

Pandas to understand inconsistent seperator for csv files - python

Depending on the regional settings the csv file exported from excel may be different thus giving me error in my script. Therefore i m thinking what is the best way to fix this issue.
Does anyone knows how to achieve that?

I ended up using the following snippet:
import csv
sniffer = csv.Sniffer()
sample_bytes = 32
dialect = sniffer.sniff(open("semicolons.csv").read(sample_bytes))
print dialect.delimiter
source: https://www.kite.com/python/examples/3323/csv-sniff-a-sample-of-a-csv-file-to-determine-its-dialect

Related

Need a push to start with a function about text files, I can't figure this out on my own

I don't need the entire code but I want a push to help me on the way, I've been searching on the internet for clues on how to start to write a function like this but I haven't gotten any further then just the name of the function.
So I haven't got the slightest clue on how to start with this, I don't know how to work with text files. Any tips?

These text files are CSV (Comma Separated Values). It is a simple file format used to store tabular data.
You may explore Python's inbuilt module called csv.
Following code snippet an example to load .csv file in Python:
import csv
filename = 'us_population.csv'
with open(filename, 'r') as csvfile:
csvreader = csv.reader(csvfile)

openpyxl error raise ValueError('Min value is {0}'.format(self.min)) in opening heavy file with formatting

I'm trying to use openpyxl for the first time on a very heavy file, that happens to be over 20 500 Ko, has a lot of formatting and a VBA macro.
My code keeps returning the following error:
File " \Anaconda3\lib\site-packages\openpyxl\styles\alignment.py", line 52, in __init__
self.relativeIndent = relativeIndent
File " \Anaconda3\lib\site-packages\openpyxl\descriptors\base.py", line 107, in __set__
raise ValueError('Min value is {0}'.format(self.min))
ValueError: Min value is 0
Would anyone know what the problem is / how to access the file despite it? I'm trying to post data into an existent Excel file to simplify processes and replace a heavy VBA code. So I can't just post it into a different xlsx file and call it using a VBA code (that would defeat the purpose).
Thanks a lot!
Here is my code :
wb = load_workbook(filename='C:/dev/CodeRep/ProjectName/MainFile 2021_01.xlsm', read_only = False, keep_vba = True)

The traceback says that there is a problem with the Alignment definition in the workbook's stylesheet. openpyxl follows the OOXML specification very closely to minimise unpleasant surprises later, this is why it tends to raise exceptions or give warnings rather than let things pass.
For more details we'll need to see the XML source for the stylesheet, or the Alignments part at least. You can find this by unzipping the XLSM file and looking for the styles.xml file. That will give you more information and also allow you to submit a bug report to openpyxl.

Preprocess the file
I solved this issue by preprocessing the excel file.
Found that mi problem was at "*/myfile.xlsx/xl/styles.xml" where several xf tags had an attribute indent="-1", and openpyxl only supports non-negative values, raising that exception when a negative value is found.
After some time spent trying to override entire openpyxl hierarchy in order to catch the exception, I decided to process the XLSX.
Here is my code:
def fix_xlsx(file_name):
with zipfile.ZipFile(file_name) as input_file, zipfile.ZipFile(file_name + ".out", "w") as output_file:
# Iterate over files
for inzipinfo in input_file.infolist():
with input_file.open(inzipinfo) as infile:
if "xl/styles.xml" in inzipinfo.filename:
# Read, Process & Write
lines = infile.readlines()
new_lines = b"\n".join([line.replace(b'indent="-1"', b'indent="0"') for line in lines])
output_file.writestr(inzipinfo.filename, new_lines)
else:
# Read & Write
output_file.writestr(inzipinfo.filename, b"\n".join([line for line in infile.readlines()]))
# Replace file
os.replace(file_name + ".out", file_name)
Disclaimer:
I must say this is not a very elegant solution as the entire file is processed, and an auxiliary file is used.
Also I am not so expert at excel to tell wheter changing that indent="-1" to indent="0" for those tags might cause format problems in the file. This is my working solution and can't really tell the effect of those tags.

I had the same issue — the file wasn't accepted by Openpyxl.
I just opened the file in MS Excel and saved it to a new file. And it worked after that.

I got the same error and wasn't able to figure out the exact cause, but noticed when I ran my python script in a different environment it worked without issue.
I realized it may have had something to do with the versions of the openpyxl and xlrd packages I was using so I downgraded them to openpyxl==3.0.4 and xlrd==1.2.0 (previously using openpyxl==3.0.7 and xlrd==2.0.1) and that solved my issue.

I ran into this issue, my solution was to pinpoint what was causing the error in the spreadsheet (had something to do with a table that was recently modified) and reconstruct that table in the worksheet. much easier for me than debugging openpyxl or xml.

Pandas Read CSV for file address with \t in it

This may be a redundant question because I know that I can rename the file and solve the issue, but I'm still pretty new at this and it would be really useful information for the future. Thanks in advance to respondents!
So, I have a CSV file which is a table exported from SQL with the filename "t_SQLtable" located in a sub-folder of my working directory.
In order to open the file in Pandas I use the following command:
SQLfile= pd.read_csv('SUBFOLDER\t_SQLtable.csv', sep=',')
This is the error I receive:
FileNotFoundError: [Errno 2] File SUBFOLDER _SQLtable.csv does not exist: 'SUBFOLDER\t_SQLtable.csv'
My understanding is that Pandas is reading the <\t> as a tab and thus is not able to find the file, because that's not the file name it is looking for. But I don't know how to format the text in order to tell Pandas how to recognize the <t> as part of the filename. Would anyone know how to resolve this?
Thank you!

Folders are navigated using / which won't escape any character
SQLfile= pd.read_csv('SUBFOLDER/t_SQLtable.csv', sep=',')
in future if you want to keep \t without it being considered as tab
use raw string
print('SUBFOLDER\t_SQLtable.csv')
print(r'SUBFOLDER\t_SQLtable.csv')
SUBFOLDER _SQLtable.csv
SUBFOLDER\t_SQLtable.csv

Try with this.
SQLfile= pd.read_csv('SUBFOLDER\\t_SQLtable.csv', sep=',')
SQLfile= pd.read_csv('SUBFOLDER/t_SQLtable.csv', sep=',')
If doesn't work , then try this:
import os
file_path = os.path.join(os.getcwd(), "SUBFOLDER", "t_SQLtable.csv")
SQLfile= pd.read_csv(file_path, sep=',')

Simply do what you did before, except add an r right before the string:
SQLfile = pd.read_csv(r'SUBFOLDER\t_SQLtable.csv', sep=',')
Adding r to the start of a string will make python treat it as a raw string, as in, all escape codes won't be evaluated.

Error with urlopen: new-line character seen in unquoted field

I am using urllib.urlopen with Python 2.7 to read csv files located on an external webserver:
# Try & Except statements removed for clarity
import urllib
import csv
url = ...
csv_file = urllib.urlopen(url)
for row in csv.reader(csv_file):
do_something()
All 100+ files can be read fine, except one that has been updated recently and that returns:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
The file is accessible here. According to my text editor, its mode is Mac (CR), as opposed to Windows (CRLF) for the other files.
I found that based on this thread, python urlopen will handle correctly all formats of newlines. Therefore, the problem is likely to come from somewhere else. I have no clue though. The file opens fine with all my text editors and my speadsheet editors.
Does any one have any idea how to diagnose the problem ?
* EDIT *
The creator of the file informed me by email that I was not the only one to experience such issues. Therefore, he decided to make it again. The code above now works fine again. Unfortunately, using a new file also means that the issue can no longer be reproduced, and the solutions tested properly.
Before closing the question, I want to thank all the stackers who dedicated some of their time to figure out a solution and post it here.

It might be a corrupt .csv file? Otherwise, this code runs perfectly.
#!/usr/bin/python
import urllib
import csv
url = "http://www.football-data.co.uk/mmz4281/1213/I1.csv"
csv_file = urllib.urlopen(url)
for row in csv.reader(csv_file):
print row
Credits to J.F. Sebastian for the .csv file.
Altough, you might want to consider sharing the specific .csv file with us? So we can try to re-create the error.

The following code runs without any error:
#!/usr/bin/env python
import csv
import urllib2
r = urllib2.urlopen('http://www.football-data.co.uk/mmz4281/1213/I1.csv')
for row in csv.reader(r):
print row

I was having the same problem with a downloaded csv.
I know the fix would be to use open with 'rU'. But I would rather not have to save the file to disk, just to open back up into a variable. That seems unnecessary.
file = open(filepath,'rU')
mydata = csv.reader(file)
So if someone has a better solution that would be nice. Stackoverflow links that got me this far:
CSV new-line character seen in unquoted field error
Open the file in universal-newline mode using the CSV Django module
I found what I actually wanted with stringIO, or cStringIO, or io:
Using Python, how do I to read/write data in memory like I would with a file?
I ended up getting io working,
import csv
import urllib2
import io
# warning its a 20MB csv
url = 'http://poweredgec.com/latest_poweredge-11g.csv'
urlRead = urllib2.urlopen(url).read()
ramFile = io.open(urlRead, mode='w')
openRamFile = open(ramFile, 'rU')
csvCurrent = csv.reader(openRamFile)
csvTuple = map(tuple, csvCurrent)
print csvTuple

Python - Convert CSV to DBF

I would like to convert a csv file to dbf using python (for use in geocoding which is why I need the dbf file) - I can easily do this in stat/transfer or other similar programs but I would like to do as part of my script rather than having to go to an outside program. There appears to be a lot of help questions/answers for converting DBF to CSV but I am not having any luck the other way around.
An answer using dbfpy is fine, I just haven't had luck figuring out exactly how to do it.
As an example of what I am looking for, here is some code I found online for converting dbf to csv:
import csv,arcgisscripting
from dbfpy import dbf
gp = arcgisscripting.create()
try:
inFile = gp.GetParameterAsText(0) #Input
outFile = gp.GetParameterAsText(1)#Output
dbfFile = dbf.Dbf(open(inFile,'r'))
csvFile = csv.writer(open(outFile, 'wb'))
headers = range(len(dbfFile.fieldNames))
allRows = []
for row in dbfFile:
rows = []
for num in headers:
rows.append(row[num])
allRows.append(rows)
csvFile.writerow(dbfFile.fieldNames)
for row in allRows:
print row
csvFile.writerow(row)
except:
print gp.getmessage()
It would be great to get something similar for going the other way around.
Thank you!

Duplicate question at: Convert .csv file into .dbf using Python?
Promising answer there (among others) is
Use the csv library to read your data from the csv file. The third-party dbf library can write a dbf file for you.

For example, you could try:
http://packages.python.org/dbf/
http://code.activestate.com/recipes/362715-dbf-reader-and-writer/
You could also just open the CSV file in OpenOffice or Excel and save it in dBase format.
I assume you want to create attribute files for the Esri Shapefile format or something like that. Keep in mind that DBF files usually use ancient character encodings like CP 850. This may be a problem if your geo data contains names in foreign languages. However, Esri may have specified a different encoding.
EDIT: just noted that you do not want to use external tools.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas to understand inconsistent seperator for csv files - python

Depending on the regional settings the csv file exported from excel may be different thus giving me error in my script. Therefore i m thinking what is the best way to fix this issue. Does anyone knows how to achieve that?

I ended up using the following snippet: import csv sniffer = csv.Sniffer() sample_bytes = 32 dialect = sniffer.sniff(open("semicolons.csv").read(sample_bytes)) print dialect.delimiter source: https://www.kite.com/python/examples/3323/csv-sniff-a-sample-of-a-csv-file-to-determine-its-dialect

Related

Need a push to start with a function about text files, I can't figure this out on my own

openpyxl error raise ValueError('Min value is {0}'.format(self.min)) in opening heavy file with formatting

Pandas Read CSV for file address with \t in it

Error with urlopen: new-line character seen in unquoted field

Python - Convert CSV to DBF

Categories

Resources