How to convert Decimal128 to decimal in pandas dataframe - python

I have a dataframe with many (But not all) Decimal128 columns (taken from a mongodb collection). I can't perform any math or comparisons on them (e.g. '<' not supported between instances of 'Decimal128' and 'float').
What is the quickest/easiest way to convert all these to float or some simpler built-in type that i can work with?
There is the Decimal128 to_decimal() method, and pandas astype(), but how can I do it for all (the decimal128) columns in one step/helper method?
Edit, I've tried:
testdf = my_df.apply(lambda x: x.astype(str).astype(float) if isinstance(x, Decimal128) else x)
testdf[testdf["MyCol"] > 80].head()
but I get:
TypeError: '>' not supported between instances of 'Decimal128' and 'int'
Converting a single column using .astype(str).astype(float) works.

Casting full DataFrame.
df = df.astype(str).astype(float)
For single column. IDs is the name of the column.
df["IDs"] = df.IDs.astype(str).astype(float)
Test implementation
from pprint import pprint
import bson
df = pd.DataFrame()
y = []
for i in range(1,6):
i = i *2/3.5
y.append(bson.decimal128.Decimal128(str(i)))
pprint(y)
df["D128"] = y
df["D128"] = df.D128.astype(str).astype(float)
print("\n", df)
Output:
[Decimal128('0.5714285714285714'),
Decimal128('1.1428571428571428'),
Decimal128('1.7142857142857142'),
Decimal128('2.2857142857142856'),
Decimal128('2.857142857142857')]
D128
0 0.571429
1 1.142857
2 1.714286
3 2.285714
4 2.857143

Just use:
df = df.astype(float)
You can also use apply or applymap(applying element wise operations), although these are inefficient compared to previous method.
df = df.applymap(float)
I can't reproduce a Decimal128 number in my system. Can you please check if the next line works for you?
df = df.apply(lambda x: x.astype(float) if isinstance(x, bson.decimal.Decimal128) else x)
It will check if a column is of type Decimal128 and then convert it to float.

Related

Pandas apply multiple function with list

I have a df with a 'File_name' column which contains strings of a file name, which I would like to parse:
data = [['f1h3_13oct2021_gt1.csv', 2], ['p8-gfr-20dec2021-81.csv', 0.5]]
df= pd.DataFrame(data, columns = ['File_name', 'Result'])
df.head()
Now I would like to create a new column where I parse the file name with '_' and '-' delimiters and then search in resulting list for the string that I could transform in datetime object. The name convention is not always the same (different order, so I cannot rely on string characters location) and the code should include a "try" conversion to datetime, as often the piece of string which should be the date is either in the wrong format or missing.
I came up with the following, but it does not really look pythonic to me
# Solution #1
for i, value in df['File_name'].iteritems():
chunks = value.split('-') + value.split('_')
for chunk in chunks:
try:
df.loc[i,'Date_Sol#1'] = dt.datetime.strptime(chunk, '%d%b%Y')
except:
pass
df.head()
Alternative, I was trying to use the apply method with the two functions I really cannot think a way to solve the two functions chained and the try - pass statement, but I really did not manage to get it working
# Solution #2
import re
splitme = lambda x: re.split('_|-', x)
calcdate = lambda x : dt.datetime.strptime(x, '%d%b%Y')
df['t1'] = df['File_name'].apply(splitme)
df['Date_Sol#2'] =df['t1'].apply(lambda x: calcdate(x) for x in df['t1'] if isinstance(calcdate(x),dt.datetime) else Pass)
df.head()
I thought a list comprehension might help?
Any help how Solution #2 might look like?
Thanks in advance
Assuming you want to extract and convert the possible chunks as date, you could split the string on delimiters, explode to multiple rows and attempt to convert to date with pandas.to_datetime:
df.join(pd
.to_datetime(df['File_name']
.str.split(r'[_-]')
.explode(), errors='coerce')
.dropna().rename('Date')
)
output:
File_name Result Date
0 f1h3_13oct2021_gt1.csv 2.0 2021-10-13
1 p8-gfr-20dec2021-81.csv 0.5 2021-12-20
NB. if you have potentially many dates per string, you need to add a further step to select the one you want. Please give more details if this is the case.
python version for old pandas
import re
s = pd.Series([next(iter(pd.to_datetime(re.split(r'[._-]', s), errors='coerce')
.dropna()), float('nan'))
for s in df['File_name']], index=df.index, name='date')
df.join(s)

Convert selected elements in data frame from float to integer unsuccessful

I'm trying to convert a list of elements in the dataframe called "GDP" from floating to integers. The cells that I want to convert are specified in GDP.iloc[4,-10]. I have tried the following methods:
for x in GDP.iloc[4,-10:]:
pd.to_numeric(x, downcast='signed')
GDP.iloc[4,-10:]=GDP.iloc[4,-10:].astype(int)
GDP.iloc[4,-10:]=int(GDP.iloc[4,-10:])
However, none of them seem to be working in converting the float to integers. No errors appear for methods 1 and 2 but for option 3, the following error appears:
TypeError: cannot convert the series to
The data can be found here: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
GDP = pd.read_csv('world_bank.csv',header=None)
Method 1
for x in GDP.iloc[4,-10:]:
pd.to_numeric(x, downcast='signed')
Method 2:
GDP.iloc[4,-10:]=GDP.iloc[4,-10:].astype(int)
Method 3:
GDP.iloc[4,-10:]=int(GDP.iloc[4,-10:])
Can someone help me out? Much appreciated.
enter image description here
You can use astype(np.int64) to convert to int
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# df.head()
df = df.fillna('custom_none_values')
# df.head()
df = df[df['1960'] != 'custom_none_values']
df['1960'] = df['1960'].astype(np.int64)
df.head()

How to check if float pandas column contains only integer numbers?

I have a dataframe
df = pd.DataFrame(data=np.arange(10),columns=['v']).astype(float)
How to make sure that the numbers in v are whole numbers?
I am very concerned about rounding/truncation/floating point representation errors
Comparison with astype(int)
Tentatively convert your column to int and test with np.array_equal:
np.array_equal(df.v, df.v.astype(int))
True
float.is_integer
You can use this python function in conjunction with an apply:
df.v.apply(float.is_integer).all()
True
Or, using python's all in a generator comprehension, for space efficiency:
all(x.is_integer() for x in df.v)
True
Here's a simpler, and probably faster, approach:
(df[col] % 1 == 0).all()
To ignore nulls:
(df[col].fillna(-9999) % 1 == 0).all()
If you want to check multiple float columns in your dataframe, you can do the following:
col_should_be_int = df.select_dtypes(include=['float']).applymap(float.is_integer).all()
float_to_int_cols = col_should_be_int[col_should_be_int].index
df.loc[:, float_to_int_cols] = df.loc[:, float_to_int_cols].astype(int)
Keep in mind that a float column, containing all integers will not get selected if it has np.NaN values. To cast float columns with missing values to integer, you need to fill/remove missing values, for example, with median imputation:
float_cols = df.select_dtypes(include=['float'])
float_cols = float_cols.fillna(float_cols.median().round()) # median imputation
col_should_be_int = float_cols.applymap(float.is_integer).all()
float_to_int_cols = col_should_be_int[col_should_be_int].index
df.loc[:, float_to_int_cols] = float_cols[float_to_int_cols].astype(int)
For completeness, Pandas v1.0+ offer the convert_dtypes() utility, that (among 3 other conversions) performs the requested operation for all dataframe-columns (or series) containing only integer numbers.
If you wanted to limit the conversion to a single column only, you could do the following:
>>> df.dtypes # inspect previous dtypes
v float64
>>> df["v"] = df["v"].convert_dtype()
>>> df.dtypes # inspect converted dtypes
v Int64
On 27 331 625 rows it works well. Time : 1.3sec
df['is_float'] = df[field_fact_qty]!=df[field_fact_qty].astype(int)
This way took Time : 4.9s
df[field_fact_qty].apply(lambda x : (x.is_integer()))

Numbers with hyphens or strings of numbers with hyphens

I need to make a pandas DataFrame that has a column filled with hyphenated numbers. The only way I could think of to do this was to use strings. This all worked fine, until I needed to sort them to get them back into order after a regrouping. The problem is that strings sort like this:
['100-200','1000-1100','1100-1200','200-300']
This is clearly not how I want it sorted. I want it sorted numberically. How would I get this to work? I am willing to change anything. Keeping the hyphenated string as an integer or float would be the best, but I am unsure how to do that.
You could try something like this:
>>> t = ['100-200','1000-1100','1100-1200','200-300']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100-200', '200-300', '1000-1100', '1100-1200']
This would allow you to sort on integers, and if a hyphen exists, it will sort first by the first integer in the key list and then by the second. If no hyphen exists, you will sort just on the integer equivalent of the string:
>>> t = ['100-200','1000-1100','1100-1200','200-300', '100']
>>> t.sort(key=lambda x: [int(y) for y in x.split('-')])
>>> t
['100', '100-200', '200-300', '1000-1100', '1100-1200']
If you have any float equivalents in any strings, simply change int to float like this:
>>> t = ['100-200.3','1000.5-1100','1100.76-1200','200-300.75', '100.35']
>>> t.sort(key=lambda x: [float(y) for y in x.split('-')])
>>> t
['100-200.3', '100.35', '200-300.75', '1000.5-1100', '1100.76-1200']
You could use sorted to construct a new ordering for the index, and then perform the sort (reordering) using df.take:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
order = sorted(range(len(df)),
key=lambda idx: map(int, df.ix[idx, 'foo'].split('-')))
df = df.take(order)
print(df)
yields
foo
0 100-200
3 200-300
1 1000-1100
2 1100-1200
This is similar to #275365's solution, but note that the sorting is done on range(len(df)), not on the strings. The strings are only used in the key parameter to determine the order in which range(len(df)) should be rearranged.
Using sorted works fine if the DataFrame is small. You can get better performance when the DataFrame is of moderate size (for example, a few hundred rows on my machine), by using numpy.argsort instead:
import pandas as pd
import numpy as np
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']*100})
arr = df['foo'].map(lambda item: map(int, item.split('-'))).values
order = np.argsort(arr)
df = df.take(order)
Alternatively, you could split your string column into two integer-valued columns, and then use df.sort:
import pandas as pd
df = pd.DataFrame({'foo':['100-200','1000-1100','1100-1200','200-300']})
df[['start', 'end']] = df['foo'].apply(lambda val: pd.Series(map(int, val.split('-'))))
df.sort(['start', 'end'], inplace=True)
print(df)
yields
foo start end
0 100-200 100 200
3 200-300 200 300
1 1000-1100 1000 1100
2 1100-1200 1100 1200

Py Pandas .format(dataframe)

As Python newbie I recently discovered that with Py 2.7 I can do something like:
print '{:20,.2f}'.format(123456789)
which will give the resulting output:
123,456,789.00
I'm now looking to have a similar outcome for a pandas df so my code was like:
import pandas as pd
import random
data = [[random.random()*10000 for i in range(1,4)] for j in range (1,8)]
df = pd.DataFrame (data)
print '{:20,.2f}'.format(df)
In this case I have the error:
Unknown format code 'f' for object of type 'str'
Any suggestions to perform something like '{:20,.2f}'.format(df) ?
As now my idea is to index the dataframe (it's a small one), then format each individual float within it, might be assign astype(str), and rebuild the DF ... but looks so looks ugly :-( and I'm not even sure it'll work ..
What do you think ? I'm stuck ... and would like to have a better format for my dataframes when these are converted to reportlabs grids.
import pandas as pd
import numpy as np
data = np.random.random((8,3))*10000
df = pd.DataFrame (data)
pd.options.display.float_format = '{:20,.2f}'.format
print(df)
yields (random output similar to)
0 1 2
0 4,839.01 6,170.02 301.63
1 4,411.23 8,374.36 7,336.41
2 4,193.40 2,741.63 7,834.42
3 3,888.27 3,441.57 9,288.64
4 220.13 6,646.20 3,274.39
5 3,885.71 9,942.91 2,265.95
6 3,448.75 3,900.28 6,053.93
The docstring for pd.set_option or pd.describe_option explains:
display.float_format: [default: None] [currently: None] : callable
The callable should accept a floating point number and return
a string with the desired format of the number. This is used
in some places like SeriesFormatter.
See core.format.EngFormatter for an example.

Categories