Preventing csvkit from modifying dates/times?

Preventing csvkit from modifying dates/times? - python

I'm just trying out csvkit for converting Excel to csv. However, it's not taking into account formatting on dates and times, and producing different results from Excel's own save-as-csv. For example, this is a row of a spreadsheet:
And this what Excel's save-as produces:
22/04/1959,Bar,F,01:32.00,01:23.00,00:59.00,00:47.23
The date has no special formatting, and the time is formatted as [mm].ss.00. However, this is in2csv's version of the csv:
1959-04-22,Bar,F,0.00106481481481,0.000960648148148,0.00068287037037,0.000546643518519
which is of course of no use at all. Any ideas? There don't seem to be any command-line options for this - no-inference doesn't help. Thanks.
EDIT
Both csvkit ans xlrd do seem to take into account formatting, but they're not smart about it. A date of 21/02/1066 is passed though as the text string '21/02/1066' in both cases, but a date '22/04/1959' is turned into '21662.0' by xlrd, and 1959-04-22 by csvkit. Both of them just give up on small elapsed times and pass through the float representation. This is Ok if you know that the cell should contain an elapsed time, because you can just multiply by 24*60*60 to get the right answer.
I don't think xlrd would be much help here since its date tuple functions only handle seconds, and not centiseconds.
EDIT 2
Found out something interesting. I started with a base spreadsheet containing times. In one of them I formatted the times as [m:]ss.00, and in the other I formatted them as [mm:]ss.00. I then saved each as a .xls and a .xlsx, giving a total of 4 spreadsheets. Excel could convert all 4 to csv, and all the time text in the csv's appeared as originally written (ie. 0:21.0, for example, for 0m 21.0s).
in2csv can't handle the two .xls versions at all; this time appears as 00:00:21. It also can't handle the [m:]ss.00 version of the .xlsx - conversion gives the catch-all 'index out of range' error. The only one of the 4 spreadsheets that in2csv can handle is the .xlsx one, with [mm:]ss.00 formatting.

The optional -I argument should be working to avoid this issue. When testing your sample data, I get what Excel's save-as produces.
Command:
in2csv sample.csv -I > sample-output-i.csv
Output:
22/04/1959,Bar,F,01:32.00,01:23.00,00:59.00,00:47.23
-I, --no-inference Disable type inference when parsing CSV input.
https://csvkit.readthedocs.io/en/latest/scripts/in2csv.html

Related

Can I save the import settings while opening a txt file?

I am very new to Python and would like to use it for my mass spectrometry data analysis. I have a txt file that is separated by tabulator. I can import it into Excel with the import assistant.
I have also managed to import it into spyder with the import assistant, but I would like to automate the process.
Is there a way to "record" the import settings I use while manually loading the data? That way I would generate a code that I could use in the future for the other txt files.
I've tried using NumPy and pandas to import my data but my txt file contains strings and numbers (floats) and I have not managed to tell Python to distinguish between the two.
When in import the file manually I get the exat DataFrame I want with the first row as a header, and the strings, and numbers correctly formatted.
here is a sample of my txt file:
Protein.IDs Majority.protein.IDs Peptide.counts..all.
0 LmxM.01.0330.1-p1 LmxM.01.0330.1-p1 5
1 LmxM.01.0410.1-p1 LmxM.01.0410.1-p1 15
2 LmxM.01.0480.1-p1 LmxM.01.0480.1-p1 14
3 LmxM.01.0490.1-p1 LmxM.01.0490.1-p1 27
4 LmxM.01.0520.1-p1 LmxM.01.0520.1-p1 27

Using numpy or pandas is the best way to automate the process, so good job using the right tools.
I suggest that you look at all the options that the pandas read_csv function has to offer. There is most likely a single line of code that can import the data properly by using the right options.
In particular, look at the decimal option if the floats are not parsed correctly.
Other solutions, which you may still want to use even if you use pandas properly are:
Formatting the input data to make your life easier : either when it is generated, or using some notepad with good macros (Notepadd++ can replace expression or accomplish tedious repeating keystrokes for you).
Formatting the output of the pandas import. If you still have strings that should be interpreted as numeric values, maybe you can run a loop to check that all values are converted in the format that they should be in.
Finally, you may want to provide some examples when you ask technical questions: show an example of data, the code that you're using, and the output of your code would make answering your question easier :)
Edit:
From the data example that you posted, it seems to me that pandas should separate the data just fine and detect strings and numerical values without trouble.
Look at the options sep of read_csv. The default is ',', you probably want to switch it to a tabulation: '\t'
Try this:
pandas.read_csv(my_filename, sep='\t')
You may run into some header issue, which you can solve using the header and names options.

Python - Spyder Dataframe Float Format Variable Explorer with thousands separator

I want to know if it's possible to change the Spyder DataFrame float format in the variable explorer to include the thousands separator? I am not talking about when you use print().format , that is for when you output your results, I am talking about the Format option when you click on a DataFrame in the variable exporer as shown here:
DataFrame Spyder
The format is using the % convention and not the {}, so to have 2 decimals I need to write %.2f instead of {:.2f} . I am currently able to change it to either full integer with %i or a 2-decimal float with %.2f, but I want to know if it's possible to include the thousands separator. Essientially being able to reproduce something like {:,.2f} but with %.. I tried %:,.2f, %0:,.2f , and many other combinations with a comma but to no success. I've been looking but can't find anything addressing this, other solutions are using the pd.set_options method but this only works when you actually print out the dataframe.
Thank you!

Long numbers conversion format

The conversion of xml to csv file, this is done by some code and the specifications that I have added.
As as result I get a csv file, once I open it I see some weird numbers that look something like this
1,25151E+21
Is there any way to eliminate this and show the whole numbers. The code itself that parses xml to csv is working fine so I’m assuming it is an excel thing.
I don’t want to go and do something manually every time I am generating a new csv file
Additional
The entire code can be found HERE and I have only long numbers in Quality
for qu in sn.findall('.//Qualify'):
repeated_values['qualify'] = qu.text

CSV doesn't pass any cell formatting rules to Excel. Hence if you open a CSV that has very large numbers in it, the default cell formatting will likely be Scientific. You can try changing the cell formatting to Number and if that changes the view to the entire number like you want, consider using the Xlsxwriter to apply cell formatting to the document while writing to Xlsx instead of CSV.

I often end up running a lambda on dataframes with this issue when I bring in csv, fwf, etc, for ETL and back out to XLSX. In my case they are all account numbers, so it's pretty bad when Excel helpfully overrides it to scientific notation.
If you don't mind the long number being a string, you can do this:
# First I force it to be an int column as I import everything as objects for unrelated reasons
df.thatlongnumber = df.thatlongnumber.astype(np.int64)
# Then I convert that to a string
df.thatlongnumber.apply(lambda x: '{:d}'.format(x))
Let me know if this is useful at all.

Scientific notation is a pain, what I've used before to handle situations like this is to cast it into a float and then use a format specifier, something like this should work:
a = "1,25151E+21"
print(f"{float(a.replace(',', '.')):.0f}")
>>> 1251510000000000065536

How to correctly parse as text numbers separated by mixed commas and dots in excel file using Python?

I'm importing data coming from excel files that come from another office.
In one of the columns, for each cell, I have lists of numbers used as tags. These were manually inserted, by different people and (my guess) using computers with different thousands settings, so the result is very heterogeneous.
As an example I have:
tags= ['205', '306.3', '3,206,302','7.205.206']
If this was a CSV file (I tried converting one single file to check), using
pd.read_csv(my_file,sep=';')
would give me exactly the above mentioned list.
Unfortunately as said, we're talking about excel files (plural) and I have to deal with it, and using
pd.read_excel(my_file,sheetname=my_sheet,encoding='utf-16',converters{'my_column':str})
what I get instead is:
tags= ['205', '306.3', '3,206,302','7205206']
As you see, whenever the number can be expressed logically in thousands (so, not the second number in my list) the dot is recognised as a thousands separator and I get a single number, instead of three.
I tried reading documentation, and searching on stackoverflow and google, but the keywords to describe this problem are too vague and I didn't find a viable solution, yet.
How can I get the right list using excel files?
Thanks.

This problem is likely happening because pandas is running their number parser before their date parser.
One possible fix is to add a thousands separator. For example, if you are actually using ',' as your thousands separator, you could add thousands=',' in your excel reader:
pd.read_excel(my_file,sheetname=my_sheet,encoding='utf-16',thousands=',',converters{'my_column':str})
You could also pick an arbitrary thousand separator that doesn't exist in your data to make the output stay the same if thousands=None (which should be the default according to documentation), doesn't already deal with your problem. You should also make sure that you are converting the fields to str (in which case using thousands is kind of redundant, as it's not applied to trings either way).
EDIT:
I tried using the following dummy data ('test.xlsx'):
a b c d
205 306.3 3,206,302 7.205.206
and with
dataf = pandas.read_excel('test.xlsx', header=0, converters={'a':str, 'b':str,'c':str,'d':str})
print(dataf.to_string)
I got the following output:
Columns: [205, 306.3, 3,206,302, 7.205.206]
Which is exactly what you were looking for. Are you sure you have the latest version of pandas and that you are in fact not using converters = {'col':int} or float in your converters keyword?
As it stands, it sounds like you are either converting your fields to numeric (int or float), or there is a problem elsewhere in your code. The pandas read_excel seems to work as described, and I can get the results you specified with the code specified above. In other wods: Your code should work, if it doesn't it might be due to outdated pandas version, other parts in your code or even problems with the source data. As it stands, it's not possible to answer your question further with the information you have provided.

How do you read a cell's value from an OpenOffice Calc .ods file?

I have been able to read an Excel cell value with xlrd using column and row numbers as inputs. Now I need to access the same cell values in some spreadsheets that were saved in .ods format.
So for example, how would I read with Python the value stored in cell E10 in an .ods file?

Hacking your way through the XML shouldn't be too hard ... but there are complications. Just one example: OOo in their wisdom decided not to write the cell address explicitly. There is no cell attribute like address="E10" or column="E"; you need to count rows and columns.
Five consecutive empty cells are represented by
<table:table-cell table:number-columns-repeated="5" />
The number-colums-repeated attribute defaults to "1" and also applies to non-empty cells.
It gets worse when you have merged cells; you get a covered-table-cell tag which is 90% the same as the table-cell tag, and attributes number-columns-spanned and number-rows-spanned need to be figured into column and row counting.
A table:table-row tag may have a number-rows-repeated attribute. This can be used to repeat the contents of a whole non-empty row, but is most often seen when there are more than 1 consecutive empty rows.
So, even if you would be satisfied with a "works on my data" approach, it's not trivial.
You may like to look at ODFpy. Note the second sentence: """Unlike other more convenient APIs, this one is essentially an abstraction layer just above the XML format.""" There is an ODF-to-HTML script which (if it is written for ODS as well as for ODT) may be hackable to get what you want.
If you prefer a "works on almost everybody's data and is supported and has an interface that you're familiar with" approach, you may need to wait until the functionality is put into xlrd ... but this isn't going to happen soon.

From libraries that I tried ezodf was the one that worked.
from ezodf import opendoc, Sheet
doc = opendoc('test.ods')
for sheet in doc.sheets:
print sheet.name
cell = sheet['E10']
print cell.value
print cell.value_type
pyexcel-ods crashed, odfpy crashed and in addition its documentation is either missing or horrible.
Given that supposedly working libraries died on the first file that I tested I would prefer to avoid writing my own processing as sooner or later it would either crash or what worse fail silently on some weirder situation.
EDIT: It gets worse. ezodf may silently return bogus data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.