Remove scientific notation floats in a dataframe - python

I am receiving different series from a source. Some of those series have the values in big numbers (X billions). I then combine all the series to a dataframe with individual columns for each series.
Now, when I print the dataframe, the big numbers in the series are showed in scientific notation. Even printing the series individually shows the numbers in scientific notation.
Dataframe df (multiindex) output is:
Values
Item Sub
A 1 1.396567e+12
B 1 2.868929e+12
I have tried this:
pd.set_option('display.float_format', lambda x: '%,.2f' % x)
This doesn't work as:
it converts everywhere. I only need the conversion in that specific dataframe.
it tries to convert all kinds of floats, and not just those in scientific. So, even if the float is 89.142, it will try to convert the format and as there's no digit to put ',' it shows an error.
Then I tried these:
df.round(2)
This only converted numeric floats to 2 decimals from existing 3 decimals. Didn't do anything to scientific values.
Then I tried:
df.astypes(floats)
Doesn't do anything visible. Output stayed the same.
How else can we change the scientific notation to normal float digits inside the dataframe. I do not want to create a new list with the converted values. The dataframe itself should show the values in normal terms.
Can you guys please help me find a solution for this?
Thank you.

try df['column'] = df['column'].astype(str) . if does not work you should change type of numbers to string before create pandas dataframe from your data

I would suggest keeping everything in a float type and adjust the display setting.
For example, I have generated a df with some random numbers.
df = pd.DataFrame({"Item": ["A", "B"], "Sub": [1,1],
"Value": [float(31132314122123.1), float(324231235232315.1)]})
# Item Sub Value
#0 A 1 3.113231e+13
#1 B 1 3.242312e+14
If we print(df), we can see that the Sub values are ints and the Value values are floats.
Item object
Sub int64
Value float64
dtype: object
You can then call pd.options.display.float_format = '{:.1f}'.format to suppress the scientific notation of the floats, while retaining the float format.
# Item Sub Value
#0 A 1 31132314122123.1
#1 B 1 324231235232315.1
Item object
Sub int64
Value float64
dtype: object
If you want the scientific notation back, you can call pd.reset_option('display.float_format')

Okay. I found something called option_context for pandas that allows to change the display options just for the particular case / action using a with statement.
with pd.option_context('display.float_format',{:.2f}.format):
print(df)
So, we do not have to reset the options again as well as the options stay default for all other data in the file.
Sadly though, I could find no way to store different columns in different float format (for example one column with currency - comma separated and 2 decimals, while next column in percentage - non-comma and 2 decimals.)

Related

how to deal with strings on a numeric column in pandas?

I have a big dataset and I cannot convert the dtype from object to int because of the error "invalid literal for int() with base 10:" I did some research and it is because there are some strings within the column.
How can I find those strings and replace them with numeric values?
You might be looking for .str.isnumeric(), which will only allow you to filter the data for these numbers-in-strings and act on them independently .. but you'll need to decide what those values should be
converted (maybe they're money and you want to truncate €, or another date format that's not a UNIX epoch, or any number of possibilities..)
dropped (just throw them away)
something else
>>> df = pd.DataFrame({"a":["1", "2", "x"]})
>>> df
a
0 1
1 2
2 x
>>> df[df["a"].str.isnumeric()]
a
0 1
1 2
>>> df[~df["a"].str.isnumeric()]
a
2 x
Assuming 'col' the column name.
Just force convert to numeric, or NaN upon error:
df['col_num'] = pd.to_numeric(df['col'], errors='coerce')
If needed you can check which original values gave NaNs using:
df.loc[df['col'].notna()&df['col_num'].isna(), 'col']
Base 10 means it is a float. so In python you would do
int(float(____))
Since you used int(), I'm guessing you needed an integer value.

Python TypeError: cannot convert the series to <class 'int'> when using math.floor() for iloc index lookup value

I'm having an issue where I need a function to find a corresponding value from within a dataframe with multiple rows within them, looking similar to:
Value
0 1.2332165631653
1 6.5651324661235
2 2.3651432415454
3 1.6566584651432
4 9.5168743514354
5 ...
My function looks like this:
import math
import dataframe as df
df1 = df.read_csv('Data1.csv')
df2 = df.read_csv('Data2.csv')
def dfFunction (A, B):
Step = 10
AB = A * B
ABInt = math.floor(AB / Step)
dfValue = df1.iloc[ABInt]
return AB / dfValue
When I input A and B values as int or float, the function works, but when I try to apply the function to df2 (similar to df1 in terms of layout, just additional columns of floats), I'm returning this error.
I've tried df2.apply(dfFunction(df2.ColumnA, df2.ColumnB), axis = 1) and simply dfFunction(df2.ColumnA, df2.ColumnB).
I fundamentally understand the error, since it's highlighting the math.floor() line, but I can't use a float to look up the row index of df1 with a float. Is there another way I can have the function or looking up the data value? I'd just use iloc() if the floats weren't massive decimal places, but the values are means from another portion of the code.
Please let me know if further clarification is needed; I'm only a beginning with Python and Stack :)

Why np.mean applied to a pandas string column does not yield an error?

How does the logic of calculating a mean on a string column work (the result is 246.8)? Is there any specific use case for it?
import pandas as pd
import numpy as np
s = np.array(["0", "1", "2", "3", "4"])
pd.DataFrame(s).mean()
Out[1]:
0 246.8
dtype: float64
Just to be clear, I am aware that to calculate the mean of numbers, I should do something along these lines.
pd.DataFrame(s.astype(int)).mean()
Out[2]:
0 2.0
dtype: float64
What is happening is that the strings are getting concatenated (which is the addition of strings), forming the string "01234", which gets cast into the number 1234, then, 1234 / 5 = 246.8. This will only happen if the strings are numerical, i.e. they represent numbers in a string format, try adding a non-numerical string (e.g. "x" or "hello") to the list and you will see that it will not work.

Why the dataframe is altered using isin?

I don't understand what is happening. It was working fine, and suddently, it's not.
I got a dataframe looking like this:
And when I try to filter the indicators, the data is altered to things looking like this:
This is the code I use to filter the indicators and I expect to keep the same data
dfCountry = data.copy()
goodIndicators = ['GDP per capita (current US$)', 'Internet users (per 100 people)', 'other indicators']
dfCountry = dfCountry[dfCountry["Indicator Name"].isin(goodIndicators)]
This is something that Pandas does with large or small numbers. It uses scientific notation. You could fix it by simply round your numbers. You could even add a number between the parentheses of the round method to round to a specific number of decimals.
n_decimals = 0
df = df.round(n_decimals) # or df = df.round() since zero is default
You could also change the Pandas configuration. Change the number of decimals yourself that Pandas should show.
pd.set_option('display.float_format', lambda x: '%.5f' % x)
You could also just convert the float numbers to integers when you don't care about decimals.
for column in [c for c in df.columns if df.startswith('2')]:
df.loc[:, column] = df.loc[:, column].apply(int)
More about this Pandas notation.

Python dataframe how to group by one column and get sum of other column

I want to create a new data frame which has 2 columns, grouped by Striker_Id and other column which has sum of 'Batsman_Scored' corresponding to the grouped 'Striker_Id'
Eg:
Striker_ID Batsman_Scored
1 0
2 8
...
I tried this ball.groupby(['Striker_Id'])['Batsman_Scored'].sum() but this is what I get:
Striker_Id
1 0000040141000010111000001000020000004001010001...
2 0000000446404106064011111011100012106110621402...
3 0000121111114060001000101001011010010001041011...
4 0114110102100100011010000000006010011001111101...
5 0140016010010040000101111100101000111410011000...
6 1100100000104141011141001004001211200001110111...
It doesn't sum, only joins all the numbers. What's the alternative?
For some reason, your columns were loaded as strings. While loading them from a CSV, try applying a converter -
df = pd.read_csv('file.csv', converters={'Batsman_Scored' : int})
Or,
df = pd.read_csv('file.csv', converters={'Batsman_Scored' : pd.to_numeric})
If that doesn't work, then convert to integer after loading -
df['Batsman_Scored'] = df['Batsman_Scored'].astype(int)
Or,
df['Batsman_Scored'] = pd.to_numeric(df['Batsman_Scored'], errors='coerce')
Now, performing the groupby should work -
r = df.groupby('Striker_Id')['Batsman_Scored'].sum()
Without access to your data, I can only speculate. But it seems like, at some point, your data contains non-numeric data that prevents pandas from being able to perform conversions, resulting in those columns being retained as strings. It's a little difficult to pinpoint this problematic data until you actually load it in and do something like
df.col.str.isdigit().any()
That'll tell you if there are any non-numeric items. Note that it only works for integers, float columns cannot be debugged like this.
Also, another way of seeing what columns have corrupt data would be to query dtypes -
df.dtypes
Which will give you a listing of all columns and their datatypes. Use this to figure out what columns need parsing -
for c in df.columns[df.dtypes == object]:
print(c)
You can then apply the methods outlined above to fix them.

Categories