reading a large number from excel with pandas - python

I am reading a xlsx file with pandas and a Column contain 18 digit number for example 360000036011012000
after reading the number is converted to 360000036011011968
my code
import pandas as pd
df = pd.read_excel("Book1.xlsx")
I also tried converting the column to string but the results are same
df = pd.read_excel("Book1.xlsx",dtype = {"column_name":"str" })
also tried with engine = 'openpyxl'
also if the same number is in csv file there is no problem reading works fine but I have to read it from excel only.

That is an Excel problem, not a pandas problem. See here:
The yellow marked entries, are actually the number below * 10 +1 so should not end on 0.
What happens under the hood in Excel seems to be a number limit of 18. But the last two numbers are interpreted as decimals. Since this is a Excel not a CSV problem, a csv will work just fine.
Solution:
Format the numbers in Excel as Text, as shown in the first picture with: =Text(CELL,0).
Pandas can then import it as string, but you will lose the information of the last digits. Therefore Excel should not be used for numbers with more than 18 digits. Use a different file, like csv, insert the numbers directly as strings into excel by using a leading: ' symbol.

Related

Saving columns from csv

I am trying to write a code that reads a csv file and can save each columns as a specific variable. I am having difficulty because the header is 7 lines long (something I can control but would like to just ignore if I can manipulate it in code), and then my data is full of important decimal places so it can not change to int( or maybe string?) I've also tried just saving each column by it's placement in the file but am struggling to run it. Any ideas?
Image shows my current code that I have slimmed to show important parts and circles data that prints in my console.
save each columns as a specific variable
import pandas as pd
pd.read_csv('file.csv')
x_col = df['X']
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
If what you are looking for is how to iterate through the columns, no matter how many there are. (Which is what I think you are asking.) Then this code should do the trick:
import pandas as pd
import csv
data = pd.read_csv('optitest.csv', skiprows=6)
for column in data.columns:
# You will need to define what this save() method is.
# Just placing it here as an example.
save(data[column])
The line about formatting your data as a number or a string was a little vague. But if it's decimal data, then you need to use float. See #9637665.

When importing float data from Excel using Pandas read_csv, numbers are rounded up to nearest integer

I'm not sure how to read in csv data using Pandas read_csv without having the data rounded up. Here's the first few rows of the csv (opened in Excel, where the Excel column type is Number)
When I read this CSV into my Jupyter Notebook using Pandas, both columns are being rounded up. I tried using the option float_precision = 'round_trip' and also first reading the columns in as str then converting to float, but once converted to float, the numbers are rounded up again. I made sure that my Jupyter Notebook display precision is greater than two.
How do I read this data in while preserving the precision?
This is what the csv file looks like in Xcode
I first read in the csv with the following code
CPI = pd.read_csv('CPI.csv', dtype={'Year': str, 'Annual_Avg': np.float64 , 'Annual_Percent_Change': np.float64})
Afterwards, the dataframe looks like this:
After closing out my kernel and restarting Jupyter, I'm now seeing the decimals displayed again. I don't know why it was initially rounding, it now looks ok. Should I now delete this post? Not sure what the stackoverflow protocol is if you've 'solved' your own question.

Fixed Width File manipulation in Pandas

I have a fixed-width file with the following format:
5678223313570888271712000000024XAXX0101010006461801325345088800.0784001501.25abc#yahoo.com
5678223324686600271712000000070XAXX0101010006461801325390998280.0784001501.25abcde.12345#gmail.com 5678123422992299
Here's what i tried :
import pandas as pd
ColSpecs = [(0,16),(16,31),(31,44),(44,62),(62,70),(70,73),(73,77),(77,127),(127,143)]
df = pd.read_fwf("~/filename.txt",colspecs=ColSpecs,Header=True)
Now this surely helps me to convert cleanly in Pandas format. However, the blank(or fixed white spaces) get trimmed off. For Eg: the Email field(#8) has 50 characters set fixed. They get truncated as soon as they're imported to Pandas dataframe.
For the data manipulation, I am creating 3 new fields that are extracted from the values of the previously imported fields.
Final Output file structure:
[(0,16),(16,31),(31,44),(44,62),(62,70),(70,73),(73,77),(77,127),(127,143),(143,153),(153,163),(164,165)]
Since, I haven't found any to_fwf method on dataframes or any other alternative for Pandas -> Flat File (keeping original lengths intact) , I would really appreciate if anyone has a better solution.
P.S. : I read that awk/sed in Unix works better, but still would like to know for Python

why pandas change (large)numbers when it exports data to csv and excel

I have a dataframe with one column number:
df = pd.DataFrame([34032872653290886,57875847776839336],['A','B'],columns=['numbers'])
when I save dataframe to excel and to csv, saved data are shown as scientific number and became 34032872653290900, 57875847776839300.
To convert df I use following codes.
df.to_excel('a1.xlsx')
df.to_csv('a1.csv')
Is it a bug? Or should I change a setting? I check my code from two system(Mac and windows) and my pandas version is '0.20.2'.
Turns out Excel has a limitation on displaying large numbers, nothing wrong with the CSV writer module.
Got the reply in other post Python CSV writer truncates long numbers

Load .xlsx into pandas as string

I am loading an xlsx file into pandas. One of the rows contains numbers, but some have a preceeding 0, such as 0734. Pandas converts them automatically into integers and the preceeding 0 is lost. How can I force pandas to import the whole xlsx file as strings?
The following doesn't work:
lookup = xls_file.parse('lookup',dtype=str)

Categories