Long numbers conversion format - python

The conversion of xml to csv file, this is done by some code and the specifications that I have added.
As as result I get a csv file, once I open it I see some weird numbers that look something like this
1,25151E+21
Is there any way to eliminate this and show the whole numbers. The code itself that parses xml to csv is working fine so I’m assuming it is an excel thing.
I don’t want to go and do something manually every time I am generating a new csv file
Additional
The entire code can be found HERE and I have only long numbers in Quality
for qu in sn.findall('.//Qualify'):
repeated_values['qualify'] = qu.text

CSV doesn't pass any cell formatting rules to Excel. Hence if you open a CSV that has very large numbers in it, the default cell formatting will likely be Scientific. You can try changing the cell formatting to Number and if that changes the view to the entire number like you want, consider using the Xlsxwriter to apply cell formatting to the document while writing to Xlsx instead of CSV.

I often end up running a lambda on dataframes with this issue when I bring in csv, fwf, etc, for ETL and back out to XLSX. In my case they are all account numbers, so it's pretty bad when Excel helpfully overrides it to scientific notation.
If you don't mind the long number being a string, you can do this:
# First I force it to be an int column as I import everything as objects for unrelated reasons
df.thatlongnumber = df.thatlongnumber.astype(np.int64)
# Then I convert that to a string
df.thatlongnumber.apply(lambda x: '{:d}'.format(x))
Let me know if this is useful at all.

Scientific notation is a pain, what I've used before to handle situations like this is to cast it into a float and then use a format specifier, something like this should work:
a = "1,25151E+21"
print(f"{float(a.replace(',', '.')):.0f}")
>>> 1251510000000000065536

Related

export a comma-separated string as a text file without auto-formatting it as a CSV

Im developing an API which should, ideally, export a conmma-separated list as a .txt file which should look like
alphanumeric1, alphanumeric2, alphanumeric3
the data to be exported is coming from a column of a pandas dataframe, so I guess I get it, but all my attempts to get it as a single-line string literal havent worked. Instead, the text file I receive is
,ColumnHeader
0,alphanumeric1
0,alphanumeric2
0,alphanumeric3
I've tried using string literals with the backticks, writing to multiple lines, appending commas to each value in the list, but it all comes out in the form of a csv, which wont work for my purposes.
How would yall achieve this effect?
I am not sure if what you need is:
csvList = ','.join(df.ColumnHeader)
where, df is of course your pandas dataframe

Can I save the import settings while opening a txt file?

I am very new to Python and would like to use it for my mass spectrometry data analysis. I have a txt file that is separated by tabulator. I can import it into Excel with the import assistant.
I have also managed to import it into spyder with the import assistant, but I would like to automate the process.
Is there a way to "record" the import settings I use while manually loading the data? That way I would generate a code that I could use in the future for the other txt files.
I've tried using NumPy and pandas to import my data but my txt file contains strings and numbers (floats) and I have not managed to tell Python to distinguish between the two.
When in import the file manually I get the exat DataFrame I want with the first row as a header, and the strings, and numbers correctly formatted.
here is a sample of my txt file:
Protein.IDs Majority.protein.IDs Peptide.counts..all.
0 LmxM.01.0330.1-p1 LmxM.01.0330.1-p1 5
1 LmxM.01.0410.1-p1 LmxM.01.0410.1-p1 15
2 LmxM.01.0480.1-p1 LmxM.01.0480.1-p1 14
3 LmxM.01.0490.1-p1 LmxM.01.0490.1-p1 27
4 LmxM.01.0520.1-p1 LmxM.01.0520.1-p1 27
Using numpy or pandas is the best way to automate the process, so good job using the right tools.
I suggest that you look at all the options that the pandas read_csv function has to offer. There is most likely a single line of code that can import the data properly by using the right options.
In particular, look at the decimal option if the floats are not parsed correctly.
Other solutions, which you may still want to use even if you use pandas properly are:
Formatting the input data to make your life easier : either when it is generated, or using some notepad with good macros (Notepadd++ can replace expression or accomplish tedious repeating keystrokes for you).
Formatting the output of the pandas import. If you still have strings that should be interpreted as numeric values, maybe you can run a loop to check that all values are converted in the format that they should be in.
Finally, you may want to provide some examples when you ask technical questions: show an example of data, the code that you're using, and the output of your code would make answering your question easier :)
Edit:
From the data example that you posted, it seems to me that pandas should separate the data just fine and detect strings and numerical values without trouble.
Look at the options sep of read_csv. The default is ',', you probably want to switch it to a tabulation: '\t'
Try this:
pandas.read_csv(my_filename, sep='\t')
You may run into some header issue, which you can solve using the header and names options.

pandas.read_csv writes out to file

I'm taking in shipment data from a csv file, I've edited data for privacy purposes, but the thing to look at is when using pandas.read_csv on my csv file the original as shown below is normal in this sense: the ZIP code (01234) has a leading 0, and the order number (22276) is an integer.
After using pandas.read_csv and printing out my data (and viewing my data in a text editor) I see that the leading 0 was taken out from the ZIP code (it is now 1234), and the order number is now a floating number (22276.0)
Original:
GROUND,THIRD PARTY,Company Name,1 Road
Ave,Town,State,01234,,22276,22276,22276,,Customer Name,Street
Name,00000 00th Ave
Z.Z.,,Town,State,00001,V476V6,18001112222,,,,Package,1
After using pandas.read_csv:
GROUND,THIRD PARTY,Dreams,100 Higginson
Ave,LINCOLN,RI,1234,,22276.0,22276.0,22276.0,,Customer Name,Street
Name,00000 00th Ave
Z.Z.,,Town,State,00001,V476V6,18001112222,,,,Package,1
I've seen others have these issues as well, and in those questions you will see well-written answers about HOW to fix the problem. What I want to know is WHY the problem exists in the first place. Why does a reading function write out original data back to the file?
EDIT
Here's the code I'm currently working with, reference is the name of the column with the order number.
import pandas
grid = pandas.read_csv("thirdparty.csv", dtype={'ZIP': int, 'REFERENCE': int})
with pandas.option_context('display.max_rows', None, 'display.max_columns', None):
print(grid)
How
You'll want to use the dtype argument of pd.read_csv. One solution would be to read in all the columns as string type. This will preserve the values exactly as they were in your csv file.
import pandas as pd
data = pd.read_csv("thirdparty.csv", dtype=str)
Though a better solution would be to specify your desired dtype of each column:
data = pd.read_csv(("thirdparty.csv", dtype={‘ZIP’: str, ‘REFERENCE’: int}
When writing the csv file back out again you should also use the float_format argument to ensure any floats are wrote as you desire.
Why
You also asked why the "problem" exists.
Essentially, when you use pd.read_csv without specifying a dtype, anything which looks like a number is read in as a float. So, 01234 is converted to 1234 on read.
When you write back out to your file this number is now wrote as a float. The pd.read_csv function is not writing out data to the original file.

How to correctly parse as text numbers separated by mixed commas and dots in excel file using Python?

I'm importing data coming from excel files that come from another office.
In one of the columns, for each cell, I have lists of numbers used as tags. These were manually inserted, by different people and (my guess) using computers with different thousands settings, so the result is very heterogeneous.
As an example I have:
tags= ['205', '306.3', '3,206,302','7.205.206']
If this was a CSV file (I tried converting one single file to check), using
pd.read_csv(my_file,sep=';')
would give me exactly the above mentioned list.
Unfortunately as said, we're talking about excel files (plural) and I have to deal with it, and using
pd.read_excel(my_file,sheetname=my_sheet,encoding='utf-16',converters{'my_column':str})
what I get instead is:
tags= ['205', '306.3', '3,206,302','7205206']
As you see, whenever the number can be expressed logically in thousands (so, not the second number in my list) the dot is recognised as a thousands separator and I get a single number, instead of three.
I tried reading documentation, and searching on stackoverflow and google, but the keywords to describe this problem are too vague and I didn't find a viable solution, yet.
How can I get the right list using excel files?
Thanks.
This problem is likely happening because pandas is running their number parser before their date parser.
One possible fix is to add a thousands separator. For example, if you are actually using ',' as your thousands separator, you could add thousands=',' in your excel reader:
pd.read_excel(my_file,sheetname=my_sheet,encoding='utf-16',thousands=',',converters{'my_column':str})
You could also pick an arbitrary thousand separator that doesn't exist in your data to make the output stay the same if thousands=None (which should be the default according to documentation), doesn't already deal with your problem. You should also make sure that you are converting the fields to str (in which case using thousands is kind of redundant, as it's not applied to trings either way).
EDIT:
I tried using the following dummy data ('test.xlsx'):
a b c d
205 306.3 3,206,302 7.205.206
and with
dataf = pandas.read_excel('test.xlsx', header=0, converters={'a':str, 'b':str,'c':str,'d':str})
print(dataf.to_string)
I got the following output:
Columns: [205, 306.3, 3,206,302, 7.205.206]
Which is exactly what you were looking for. Are you sure you have the latest version of pandas and that you are in fact not using converters = {'col':int} or float in your converters keyword?
As it stands, it sounds like you are either converting your fields to numeric (int or float), or there is a problem elsewhere in your code. The pandas read_excel seems to work as described, and I can get the results you specified with the code specified above. In other wods: Your code should work, if it doesn't it might be due to outdated pandas version, other parts in your code or even problems with the source data. As it stands, it's not possible to answer your question further with the information you have provided.

How can I read every field as string in xlwings?

I have an exelfile that I want to convert but the default type for numbers is float. How can I change it so xlwings explicitly uses strings and not numbers?
This is how I read the value of a field:
xw.Range(sheet, fieldname ).value
The problem is that numbers like 40 get converted to 40.0 if I create a string from that. I strip it with: str(xw.Range(sheetFronius, fieldname ).value).rstrip('0').rstrip('.') but that is not very helpful and leads to errors because sometimes the same field can contain both a number and a string. (Not at the same time, the value is chosen from a list)
With xlwings if no options are set during reading/writing operations single cells are read in as 'floats'. Also, by default cells with numbers are read as 'floats'. I scoured the docs, but don't think you can convert a cell that has numbers to a 'string' via xlwings outright. Fortunately all is not lost...
You could read in the cells as 'int' with xlwings and then convert the 'int' to 'string' in Python. The way to do that is as follows:
xw.Range(sheet, fieldname).options(numbers=int).value
And finally, you can read in your data this way (by packing the string conversion into the options upfront):
xw.Range(sheet, fieldname).options(numbers=lambda x: str(int(x))).value
Then you would just convert that to string in Python in the usual way.
Good luck!
In my case conclusion was, just adding one row to the last row of raw data.
Write any text in the column you want to change to str, save, load, and then delete the last line.

Categories