fixing improper ID formatting

fixing improper ID formatting - python

Background: The following code works to export a pandas df as an excel file:
import pandas as pd
import xlsxwriter
writer = pd.ExcelWriter('Excel_File.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
Problem:
My ID column in the excel file shows up like
8.96013E+17 instead of 896013350764773376
I try to alter it in excel using format and zipcode but it still gives the wrong ID 896013350764773000
Question: Using excel or python code, how do I keep my original 896013350764773376 ID format?

Excel uses IEEE754 doubles to represent numbers and they have 15 digits of precision. So you are not going to be able to represent an 18 digit id as a number in Excel. You will need to convert it to a string to maintain all the digits.

Related

Treat everything as raw string (even formulas) when reading into pandas from excel

So, I am actually handling text responses from surveys, and it is common to have responses that starts with -, an example is: -I am sad today.
Excel would interpret it as #NAMES?
So when I import the excel file into pandas using read_excel, it would show NAN.
Now is there any method to force excel to retain as raw strings instead interpret it at formula level?
I created a vba and assigning the entire column with text to click through all the cells in the column, which is slow if there is ten thousand++ data.
I was hoping it can do it at python level instead, any idea?

I hope, it works for your solution, use openpyxl to extract excel data and then convert it into a pandas dataframe
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = './formula_contains_raw.xlsx', ).active
print(wb.values)
# sheet_names = wb.get_sheet_names()[0]
# sheet_ranges = wb[name]
df = pd.DataFrame(list(wb.values)[1:], columns=list(wb.values)[0])
df.head()

It works for me using a CSV instead of excel file.
In the CSV file (opened in excel) I need to select the option Formulas/Show Formulas, then save the file.
pd.read_csv('draft.csv')
Output:
Col1
0 hello
1 =-hello

Pandas dataframe integer column is in scientific notation in csv file when writing csv using df.to_csv

I have a dataframe that is written to a csv file with a column of integer values like 1618240891297, but the table is displaying it in scientific notation 1.61824E+12. I can correct it by changing the Number Format in Excel from General to Number. Is there a way to make this change when writing the csv file using DataFrame.to_csv?

You can use the float_format parameter of DataFrame.to_csv() to control the output formatting. For example:
df.to_csv('myfile.csv', sep=',', float_format='%.6f') # write with precision .6

reading a large number from excel with pandas

I am reading a xlsx file with pandas and a Column contain 18 digit number for example 360000036011012000
after reading the number is converted to 360000036011011968
my code
import pandas as pd
df = pd.read_excel("Book1.xlsx")
I also tried converting the column to string but the results are same
df = pd.read_excel("Book1.xlsx",dtype = {"column_name":"str" })
also tried with engine = 'openpyxl'
also if the same number is in csv file there is no problem reading works fine but I have to read it from excel only.

That is an Excel problem, not a pandas problem. See here:
The yellow marked entries, are actually the number below * 10 +1 so should not end on 0.
What happens under the hood in Excel seems to be a number limit of 18. But the last two numbers are interpreted as decimals. Since this is a Excel not a CSV problem, a csv will work just fine.
Solution:
Format the numbers in Excel as Text, as shown in the first picture with: =Text(CELL,0).
Pandas can then import it as string, but you will lose the information of the last digits. Therefore Excel should not be used for numbers with more than 18 digits. Use a different file, like csv, insert the numbers directly as strings into excel by using a leading: ' symbol.

Python transfer excel formatting between two Excel documents

I'd like to copy the formatting between two Excel sheets in python.
Here is the situation:
I have a script that effectively "alters" (ie overwrites) an excel file by opening it using pd.ExcelWriter, then updates values in the rows. Finally, file is overwritten using ExcelWriter.
The Excel file is printed/shared/read by humans between updates done by the code. Humans will do things like change number formatting, turn on/off word wrap, and alter column widths.
My goal is the code updates should only alter the content of the file, not the formatting of the columns.
Is there a way I can read/store/write the sheet format within python so the output file has the same column formatting as the input file?
Here's the basic idea of what I am doing right now:
df_in= pd.read_excel("myfile.xlsx")
# Here is where I'd like to read in format of the first sheet of this file
xlwriter = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
df_out = do_update(df_in)
df_out.to_excel(xlwriter,'sheet1')
# Here is where I'd like to apply the format I read earlier to the sheet
xlwriter.save()
Note: I have played with xlsxwriter.set_column and add_format. As far as I can tell, these don't help me read the format from the current file

Pandas uses xlrd package for parsing Excel documents to DataFrames.
Interoperability between other xlsx packages and xlrd could be problematic when it comes to the data structure used to represent formatting information.
I suggest using openpyxl as your engine when instantiating pandas.ExcelWriter. It comes with reader and writer classes that are interoperable.
import pandas as pd
from openpyxl.styles.stylesheet import apply_stylesheet
from openpyxl.reader.excel import ExcelReader
xlreader = ExcelReader('myfile.xlsx', read_only=True)
xlwriter = pd.ExcelWriter('myfile.xlsx', engine='openpyxl')
df_in = pd.read_excel("myfile.xlsx")
df_out = do_update(df_in)
df_out.to_excel(xlwriter,'sheet1')
apply_stylesheet(xlreader.archive, xlwriter.book)
xlwriter.save()

why pandas change (large)numbers when it exports data to csv and excel

I have a dataframe with one column number:
df = pd.DataFrame([34032872653290886,57875847776839336],['A','B'],columns=['numbers'])
when I save dataframe to excel and to csv, saved data are shown as scientific number and became 34032872653290900, 57875847776839300.
To convert df I use following codes.
df.to_excel('a1.xlsx')
df.to_csv('a1.csv')
Is it a bug? Or should I change a setting? I check my code from two system(Mac and windows) and my pandas version is '0.20.2'.

Turns out Excel has a limitation on displaying large numbers, nothing wrong with the CSV writer module.
Got the reply in other post Python CSV writer truncates long numbers

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

fixing improper ID formatting - python

Excel uses IEEE754 doubles to represent numbers and they have 15 digits of precision. So you are not going to be able to represent an 18 digit id as a number in Excel. You will need to convert it to a string to maintain all the digits.

Related

Treat everything as raw string (even formulas) when reading into pandas from excel

Pandas dataframe integer column is in scientific notation in csv file when writing csv using df.to_csv

reading a large number from excel with pandas

Python transfer excel formatting between two Excel documents

why pandas change (large)numbers when it exports data to csv and excel

Categories

Resources