Pandas how to truncate small float values - python

I'm using pandas.DataFrame.round to truncate columns on a DataFrame, but I have a column of p-values that have small values, which are being rounded to zero. For example, all the values bellow are being rounded to 0.
p-value
2.298564e-17
6.848231e-91
1.089847e-10
9.390048e-04
5.628517e-35
4.621786e-19
4.601818e-54
9.639073e-19
I want something like
p-value
2.29e-17
6.84e-91
1.08e-10
9.39e-04
5.62e-35
4.62e-19
4.60e-54
9.63e-19

Numpy has functions for this.
data = """p-value
2.298564e-17
6.848231e-91
1.089847e-10
9.390048e-04
5.628517e-35
4.621786e-19
4.601818e-54
9.639073e-19"""
a = [x for x in data.split("\n")]
df = pd.DataFrame({"p-value":a[1:]})
df["p-value"] = df["p-value"].astype(np.float)
df["p-value"].apply(lambda x: np.format_float_scientific(x, precision=2))
output
0 2.3e-17
1 6.85e-91
2 1.09e-10
3 9.39e-04
4 5.63e-35
5 4.62e-19
6 4.60e-54
7 9.64e-19
Name: p-value, dtype: object

not quite truncate, but rather round:
df['p-value'].apply(lambda x: f'{x:.2e}')
Output:
0 2.30e-17
1 6.85e-91
2 1.09e-10
3 9.39e-04
4 5.63e-35
5 4.62e-19
6 4.60e-54
7 9.64e-19
Name: p-value, dtype: object

Related

Add commas to decimal column without rounding off

I have pandas column named Price_col. which look like this.
Price_col
1. 1000000.000
2. 234556.678900
3. 2345.00
4.
5. 23.56
I am trying to add commas to my Price_col to look like this.
Price_col
1. 1,000,000.000
2. 234,556.678900
3. 2,345.00
4.
5. 23.56
when I try convert the values it always round off. is there way that I can have original value without rounding off.
I tried below code. this what I got for the value 234556.678900.
n = "{:,}".format(234556.678900)
print(n)
>>> 234,556.6789
Add f for fixed-point
>>> "{:,}".format(234556.678900)
'234,556.6789'
>>> "{:,f}".format(234556.678900)
'234,556.678900'
You can also control the precision with .p where p is the number of digits (and should probably do so) .. beware, as you're dealing with floats, you'll have some IEEE 754 aliasing, though representation via format should be quite nice regardless of the backing data
>>> "{:,.5f}".format(234556.678900)
'234,556.67890'
>>> "{:,.20f}".format(234556.678900)
'234,556.67889999999897554517'
The full Format Specification Mini-Language can be found here:
https://docs.python.org/3/library/string.html#format-specification-mini-language
From your comment, I realized you may really want something else as described in How to display pandas DataFrame of floats using a format string for columns? and only change the view of the data
Creating a new string column formatted as a string
>>> df = pd.DataFrame({"Price_col": [1000000.000, 234556.678900, 2345.00, None, 23.56]}
>>> df["price2"] = df["Price_col"].apply(lambda x: f"{x:,f}")
>>> df
Price_col price2
0 1000000.0000 1,000,000.000000
1 234556.6789 234,556.678900
2 2345.0000 2,345.000000
3 NaN nan
4 23.5600 23.560000
>>> df.dtypes
Price_col float64
price2 object
dtype: object
Temporarily changing how data is displayed
>>> df = pd.DataFrame({"Price_col": [1000000.000, 234556.678900, 2345.00, None, 23.56]}
>>> print(df)
Price_col
0 1000000.0000
1 234556.6789
2 2345.0000
3 NaN
4 23.5600
>>> with pd.option_context('display.float_format', '€{:>18,.6f}'.format):
... print(df)
...
Price_col
0 € 1,000,000.000000
1 € 234,556.678900
2 € 2,345.000000
3 NaN
4 € 23.560000
>>> print(df)
Price_col
0 1000000.0000
1 234556.6789
2 2345.0000
3 NaN
4 23.5600

Comparing elements between two dataframes and adding columns in case of equality

Considering two dataframes as follows:
import pandas as pd
df_rp = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 'res': ['a','b','c','d','e','f','g','h']})
df_cdr = pd.DataFrame({'id':[1,2,5,6,7,1,2,3,8,9,3,4,8],
'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89, -22.84, -22.98,
-22.14, -22.28, -22.42, -22.56, -22.70, -22.13],
'LONGITUDE':[-43.19, -43.39, -43.24, -43.28, -43.67, -43.11, -43.22,
-43.33, -43.44, -43.55, -43.66, -43.77, -43.88]})
What I have to do:
Compare each df_rp['id'] element with each df_cdr['id'] element;
If they are the same, I need to add in a data structure (list, series, etc.) the latitudes and longitudes that are on the same line as the id without repeating the id.
Below is an example of how I need the data to be grouped:
1:[-22.98,-43.19],[-22.84,-43.11]
2:[-22.97,-43.39],[-22.98,-43.22]
3:[-22.14,-43.33],[-22.56,-43.66]
4:[-22.70,-43.77]
5:[-22.92,-43.24]
6:[-22.87,-43.28]
7:[-22.89,-43.67]
8:[-22.28,-43.44],[-22.13,-43.88]
I'm having a hard time choosing which data structure is best for the situation (as I did in the example looks like a dictionary, but there would be several dictionaries) and how to add latitude and logitude to pairs without repeating the id. I appreciate any help.
We need to agg the second df , then reindex assign it back
df_rp['L$L']=df_cdr.drop('id',1).apply(tuple,1).groupby(df_cdr.id).agg(list).reindex(df_rp.id).to_numpy()
df_rp
Out[59]:
id res L$L
0 1 a [(-22.98, -43.19), (-22.84, -43.11)]
1 2 b [(-22.97, -43.39), (-22.98, -43.22)]
2 3 c [(-22.14, -43.33), (-22.56, -43.66)]
3 4 d [(-22.7, -43.77)]
4 5 e [(-22.92, -43.24)]
5 6 f [(-22.87, -43.28)]
6 7 g [(-22.89, -43.67)]
7 8 h [(-22.28, -43.44), (-22.13, -43.88)]
df_cdr['lat_long'] = df_cdr.apply(lambda x: list([x['LATITUDE'],x['LONGITUDE']]),axis=1)
df_cdr = df_cdr.drop(columns=['LATITUDE' , 'LONGITUDE'],axis=1)
df_cdr = df_cdr.groupby('id').agg(lambda x: x.tolist())
Output
lat_long
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
9 [[-22.42, -43.55]]
Assume df_rp.id is unique and sorted as in your sample. I come up with solution using set_index and loc to filter out id in df_cdr, but not in df_rp. Next, call groupby with lambda returns arrays
s = (df_cdr.set_index('id').loc[df_rp.id].groupby(level=0).
apply(lambda x: x.to_numpy()))
Out[709]:
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
dtype: object

Converting exponential notation numbers to strings - explanation

I have DataFrame from this question:
temp=u"""Total,Price,test_num
0,71.7,2.04256e+14
1,39.5,2.04254e+14
2,82.2,2.04188e+14
3,42.9,2.04171e+14"""
df = pd.read_csv(pd.compat.StringIO(temp))
print (df)
Total Price test_num
0 0 71.7 2.042560e+14
1 1 39.5 2.042540e+14
2 2 82.2 2.041880e+14
3 3 42.9 2.041710e+14
If convert floats to strings get trailing 0:
print (df['test_num'].astype('str'))
0 204256000000000.0
1 204254000000000.0
2 204188000000000.0
3 204171000000000.0
Name: test_num, dtype: object
Solution is convert floats to integer64:
print (df['test_num'].astype('int64'))
0 204256000000000
1 204254000000000
2 204188000000000
3 204171000000000
Name: test_num, dtype: int64
print (df['test_num'].astype('int64').astype(str))
0 204256000000000
1 204254000000000
2 204188000000000
3 204171000000000
Name: test_num, dtype: object
Question is why it convert this way?
I add this poor explanation, but feels it should be better:
Poor explanation:
You can check dtype of converted column - it return float64.
print (df['test_num'].dtype)
float64
After converting to string it remove exponential notation and cast to floats, so added traling 0:
print (df['test_num'].astype('str'))
0 204256000000000.0
1 204254000000000.0
2 204188000000000.0
3 204171000000000.0
Name: test_num, dtype: object
When you use pd.read_csv to import data and do not define datatypes,
pandas makes an educated guess and in this case decides, that column
values like "2.04256e+14" are best represented by a float value.
This, converted back to string adds a ".0". As you corrently write,
converting to int64 fixes this.
If you know that the column has int64 values only before input (and
no empty values, which np.int64 cannot handle), you can force this type on import to avoid the unneeded conversions.
import numpy as np
temp=u"""Total,Price,test_num
0,71.7,2.04256e+14
1,39.5,2.04254e+14
2,82.2,2.04188e+14
3,42.9,2.04171e+14"""
df = pd.read_csv(pd.compat.StringIO(temp), dtype={2: np.int64})
print(df)
returns
Total Price test_num
0 0 71.7 204256000000000
1 1 39.5 204254000000000
2 2 82.2 204188000000000
3 3 42.9 204171000000000

Selection of a Series from pandas dataframe by interpolating column labels

I have a pandas dataframe that contains for multiple positions (defined by coordinate x) a value for different timesteps. I want to create a pandas.Series object that contains the value at a given position x for all timesteps (so all index-values of the dataframe). If x is not one of the column labels, I want to interpolate between the two nearest x values.
An excerpt from the dataframe object (min(x)=0 and max(x)=0.28):
0.000000 0.007962 0.018313 0.031770 0.049263 0.072004
time (s)
15760800 0.500481 0.500481 0.500481 0.500481 0.500481 0.500481
15761400 1.396126 0.487198 0.498765 0.501326 0.500234 0.500544
15762000 1.455313 0.542441 0.489421 0.502851 0.499945 0.500597
15762600 1.492908 0.592022 0.487835 0.502233 0.500139 0.500527
15763200 1.521089 0.636743 0.490874 0.500704 0.500485 0.500423
15763800 1.542632 0.675589 0.496401 0.499065 0.500788 0.500335
I can find ways to slice the dataframe by available column labels. But is there an elegant way to do the interpolation?
In the end I want a function that looks something like this: result = sliceDataframe( dataframe=dfin,x=0.01),with result a pandas.Series object so I can call it in one line (or maybe two) in another postprocessing script.
I think you would be best with writing a simple function yourself. Something like:
def sliceDataframe(df, x):
# supposing the column labels are sorted:
pos = np.searchsorted(df.columns.values, x)
# select the two neighbouring column labels:
left = df.columns[pos-1]
right = df.columns[pos]
# simple interpolation
interpolated = df[left] + (df[right] - df[left])/(right - left) * (x - left)
interpolated.name = x
return interpolated
Another option is to use the interpolate method, but therefore, you should add a column with NaNs with the label you want.
With the function of above:
In [105]: df = pd.DataFrame(np.random.randn(8,4))
In [106]: df.columns = df.columns.astype(float)
In [107]: df
Out[107]:
0 1 2 3
0 -0.336453 1.219877 -0.912452 -1.047431
1 0.842774 -0.361236 -0.245771 0.014917
2 -0.974621 1.050503 0.367389 0.789570
3 1.091484 1.352065 1.215290 0.393900
4 -0.100972 -0.250026 -1.135837 -0.339204
5 0.503436 -0.764224 -1.099864 0.962370
6 -0.599090 0.908235 -0.581446 0.662604
7 -2.234131 0.512995 -0.591829 -0.046959
In [108]: sliceDataframe(df, 0.5)
Out[108]:
0 0.441712
1 0.240769
2 0.037941
3 1.221775
4 -0.175499
5 -0.130394
6 0.154572
7 -0.860568
Name: 0.5, dtype: float64
With the interpolate method:
In [109]: df[0.5] = np.NaN
In [110]: df.sort(axis=1).interpolate(axis=1)
Out[110]:
0.0 0.5 1.0 2.0 3.0
0 -0.336453 0.441712 1.219877 -0.912452 -1.047431
1 0.842774 0.240769 -0.361236 -0.245771 0.014917
2 -0.974621 0.037941 1.050503 0.367389 0.789570
3 1.091484 1.221775 1.352065 1.215290 0.393900
4 -0.100972 -0.175499 -0.250026 -1.135837 -0.339204
5 0.503436 -0.130394 -0.764224 -1.099864 0.962370
6 -0.599090 0.154572 0.908235 -0.581446 0.662604
7 -2.234131 -0.860568 0.512995 -0.591829 -0.046959
In [111]: df.sort(axis=1).interpolate(axis=1)[0.5]
Out[111]:
0 0.441712
1 0.240769
2 0.037941
3 1.221775
4 -0.175499
5 -0.130394
6 0.154572
7 -0.860568
Name: 0.5, dtype: float64

Finding the percent change of values in a Series

I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...

Categories