Converting pandas column to string w/o scientific notation - python

Lots of questions address this, but none of the solutions seem to work exactly as I need.
I have a dataframe with two columns of numbers with 10-20 digits each. These are actually ID #s, and I'd like to concatenate them. It looks like that's best done by first converting the values to strings.
However, when converting with .astype(str), pandas keeps the scientific notation, which won't fly.
Things I've tried:
tried: dtype arg ('str') or converters (using str()) in read_csv()
outcome: df.dtypes still lists 'objects,' and values still display in sci. notation
tried: pd.set_option('display.float_format', lambda x: '%.0f' % x)
outcome: displays good in df.head(), but reverts to scientific notation upon coercion to string & concatenation using + operator
tried: coercing to int, str, or str(int(x)).
outcome: int works when i coerce one value with int(), but not when I use astype(int). using .apply() with int() throws an 'invalid literal long() with base 10' error.
This feels like it should be pretty straightforward, anxious to figure out what I'm missing.

What you tried sets the display format. You could just format the float as a string in the dataframe.
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
df=pd.DataFrame(data={'a':np.random.randint(low=1,high=100,size=10)*1e20,'b':np.random.randint(low=1,high=100,size=10)*1e20})
df.apply(lambda x: '{0:20.0f}|{1:20.0f}'.format(x.a,x.b),axis=1)
Out[34]:
0 9699999999999998951424|4600000000000000000000
1 300000000000000000000|2800000000000000000000
2 9400000000000000000000|9000000000000000000000
3 2100000000000000000000|4500000000000000000000
4 5900000000000000000000|4800000000000000000000
5 7700000000000000000000|6200000000000000000000
6 1600000000000000000000|8000000000000000000000
7 100000000000000000000|400000000000000000000
8 9699999999999998951424|8000000000000000000000
9 4500000000000000000000|3500000000000000000000

Related

Remove scientific notation floats in a dataframe

I am receiving different series from a source. Some of those series have the values in big numbers (X billions). I then combine all the series to a dataframe with individual columns for each series.
Now, when I print the dataframe, the big numbers in the series are showed in scientific notation. Even printing the series individually shows the numbers in scientific notation.
Dataframe df (multiindex) output is:
Values
Item Sub
A 1 1.396567e+12
B 1 2.868929e+12
I have tried this:
pd.set_option('display.float_format', lambda x: '%,.2f' % x)
This doesn't work as:
it converts everywhere. I only need the conversion in that specific dataframe.
it tries to convert all kinds of floats, and not just those in scientific. So, even if the float is 89.142, it will try to convert the format and as there's no digit to put ',' it shows an error.
Then I tried these:
df.round(2)
This only converted numeric floats to 2 decimals from existing 3 decimals. Didn't do anything to scientific values.
Then I tried:
df.astypes(floats)
Doesn't do anything visible. Output stayed the same.
How else can we change the scientific notation to normal float digits inside the dataframe. I do not want to create a new list with the converted values. The dataframe itself should show the values in normal terms.
Can you guys please help me find a solution for this?
Thank you.
try df['column'] = df['column'].astype(str) . if does not work you should change type of numbers to string before create pandas dataframe from your data
I would suggest keeping everything in a float type and adjust the display setting.
For example, I have generated a df with some random numbers.
df = pd.DataFrame({"Item": ["A", "B"], "Sub": [1,1],
"Value": [float(31132314122123.1), float(324231235232315.1)]})
# Item Sub Value
#0 A 1 3.113231e+13
#1 B 1 3.242312e+14
If we print(df), we can see that the Sub values are ints and the Value values are floats.
Item object
Sub int64
Value float64
dtype: object
You can then call pd.options.display.float_format = '{:.1f}'.format to suppress the scientific notation of the floats, while retaining the float format.
# Item Sub Value
#0 A 1 31132314122123.1
#1 B 1 324231235232315.1
Item object
Sub int64
Value float64
dtype: object
If you want the scientific notation back, you can call pd.reset_option('display.float_format')
Okay. I found something called option_context for pandas that allows to change the display options just for the particular case / action using a with statement.
with pd.option_context('display.float_format',{:.2f}.format):
print(df)
So, we do not have to reset the options again as well as the options stay default for all other data in the file.
Sadly though, I could find no way to store different columns in different float format (for example one column with currency - comma separated and 2 decimals, while next column in percentage - non-comma and 2 decimals.)

How do i convert one column from an imported csv using numpy from string to float?

I have two csv files which i have imported to python using numpy.
the data has 2 columns:
[['month' 'total_rainfall']
['1982-01' '107.1']
['1982-02' '27.8']
['1982-03' '160.8']
['1982-04' '157']
['1982-05' '102.2']
I need to create a 2D array and calculate statistics with the 'total_rainfall' column. (Mean,Std Dev, Min and Max)
So i have this:
import numpy as np
datafile=np.genfromtxt("C:\rainfall-monthly-total.csv",delimiter=",",dtype=None,encoding=None)
print(datafile)
rainfall=np.asarray(datafile).astype(np.float32)
print (np.mean(datafile,axis=1))
ValueError: could not convert string to float: '2019-04'
Converting str to float is like below:
>>> a = "545.2222"
>>> float(a)
545.22220000000004
>>> int(float(a))
545
but the error message says the problem is converting 2019-04 to float.
when you want to convert 2019-04 to float it doesn't work because float numbers don't have - in between . That is why you got error.
You can convert values of rainfall into float or int but date can't be converted. To convert date into int you have to split the string and combine it back as date formate then convert it to milliseconds as:
from datetime import datetime
month1 = '1982-01'
date = datetime(month1.split('-')[0], month1.split('-')[1], 1)
milliseconds = int(round(date.timestamp() * 1000))
This way, you assume its first date of the month.
Your error message reads could not convert string to float,
but actually your problem is a bit different.
Your array contains string columns, which should be converted:
month - to Period (month),
total_rainfall - to float.
Unfortunately, Numpy has been created to process arrays where all
cells are of the same type, so much more convenient tool is Pandas,
where each column can be of its own type.
First, convert your Numpy array (I assume arr) to a pandasonic
DataFrame:
import pandas as pd
df = pd.DataFrame(arr[1:], columns=arr[0])
I took column names from the initial row and data from
following rows. Print df to see the result.
So far both columns are still of object type (actually string),
so the only thing to do is to convert both columns,
each to its desired type:
df.month = pd.PeriodIndex(df.month, freq='M')
df.total_rainfall = df.total_rainfall.astype(float)
Now, when you run df.info(), you will see that both
columns are of proper types.
To process your data, use also Pandas. It is a more convenient tool.
E.g. to get quarterly sums, you can run:
df.set_index('month').resample('Q').sum()
getting (for your data sample):
total_rainfall
month
1982Q1 295.7
1982Q2 259.2

Two type of variable types in One column & one is in suppressesed Scientific Notation: pandas

I have a column which can have two type of data types varchar and float. Now since my float is a long number such as 10154346483399200, which becomes lately like 1.01543E+16 and the other entry could be like UPDATE-c1038-6036456184586739712. How to bring the 'E' data entries in their original form
so
In[3] a= ['1.01543E+16', 'UPDATE-c1038-6036456184586739712']
answer should be something like
['10154346483399200', 'UPDATE-c1038-6036456184586739712']
Applying just to solve suppressing Scientific Notation gives me an error
df.apply(lambda x: '%.3f' % x)
TypeError: float argument required, not str
Every help will be appreciating

Py Pandas .format(dataframe)

As Python newbie I recently discovered that with Py 2.7 I can do something like:
print '{:20,.2f}'.format(123456789)
which will give the resulting output:
123,456,789.00
I'm now looking to have a similar outcome for a pandas df so my code was like:
import pandas as pd
import random
data = [[random.random()*10000 for i in range(1,4)] for j in range (1,8)]
df = pd.DataFrame (data)
print '{:20,.2f}'.format(df)
In this case I have the error:
Unknown format code 'f' for object of type 'str'
Any suggestions to perform something like '{:20,.2f}'.format(df) ?
As now my idea is to index the dataframe (it's a small one), then format each individual float within it, might be assign astype(str), and rebuild the DF ... but looks so looks ugly :-( and I'm not even sure it'll work ..
What do you think ? I'm stuck ... and would like to have a better format for my dataframes when these are converted to reportlabs grids.
import pandas as pd
import numpy as np
data = np.random.random((8,3))*10000
df = pd.DataFrame (data)
pd.options.display.float_format = '{:20,.2f}'.format
print(df)
yields (random output similar to)
0 1 2
0 4,839.01 6,170.02 301.63
1 4,411.23 8,374.36 7,336.41
2 4,193.40 2,741.63 7,834.42
3 3,888.27 3,441.57 9,288.64
4 220.13 6,646.20 3,274.39
5 3,885.71 9,942.91 2,265.95
6 3,448.75 3,900.28 6,053.93
The docstring for pd.set_option or pd.describe_option explains:
display.float_format: [default: None] [currently: None] : callable
The callable should accept a floating point number and return
a string with the desired format of the number. This is used
in some places like SeriesFormatter.
See core.format.EngFormatter for an example.

unwanted type conversion in pandas.DataFrame.update

Is there any reason why pandas changes the type of columns from int to float in update, and can I prevent it from doing it? Here is some example code of the problem
import pandas as pd
import numpy as np
df = pd.DataFrame({'int': [1, 2], 'float': [np.nan, np.nan]})
print('Integer column:')
print(df['int'])
for _, df_sub in df.groupby('int'):
df_sub['float'] = float(df_sub['int'])
df.update(df_sub)
print('NO integer column:')
print(df['int'])
here's the reason for this: since you are effectively masking certain values on a column and replace them (with your updates), some values could become `nan
in an integer array this is impossible, so numeric dtypes are apriori converted to float (for efficiency), as checking first is more expensive that doing this
a change of dtype back is possible...just not in the code right now, therefor this a bug (a bit non-trivial to fix though): github.com/pydata/pandas/issues/4094
This causes data precision loss if you have big values in your int64 column, when update converts them to float. So going back with what Jeff suggests: df['int'].astype(int)
is not always possible.
My workaround for cases like this is:
df_sub['int'] = df_sub['int'].astype('Int64') # Int64 with capital I, supports NA values
df.update(df_sub)
df_sub['int'] = df_sub['int'].astype('int')
The above avoids the conversion to float type. The reason I am converting back to int type (instead of leaving it as Int64) is that pandas seems to lack support for that type in several operations (e.g. concat gives an error about missing .view).
Maybe they could incorporate the above fix in issue 4094

Categories