I need to round a column with floats to 2 decimal places, but without rounding the data to the nearest value
My data:
df = pd.DataFrame({'numbers': [1.233,1.238,5.059,5.068, 8.556]})
df.head()
numbers
0 1.233
1 1.238
2 5.059
3 5.068
4 8.556
Expected output:
numbers
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55
The problem
Everything I've tried rounds the numbers to the nearest number (0-4 is 0 and 5-9 is added 1 to the truncated decimal place)
Examples of what didn't work
df[['numbers']].round(2)
#or df['numbers'].apply(lambda x: "%.2f" % x)
#output
numbers
0 1.23
1 1.24
2 5.06
3 5.07
4 8.56
This is more like round down
df.numbers*100//1/100
Out[186]:
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55
Name: numbers, dtype: float64
Try this, works well also
import pandas as pd
do = lambda x: float(str(x).split('.')[0] +'.' + str(x).split('.')[1][0:2])
df = pd.DataFrame({'numbers': list(map(do, [1.233,1.238,5.059,5.068, 8.556]))})
print(df.head())
output
numbers
0 1.23
1 1.23
2 5.05
3 5.06
4 8.55
Related
I have a Pandas Dataframe with a float column. The values in that column have many decimal points but I only need 2 decimal points. I don't want to round, but truncate the value after the second digit.
this is what I have so far, however with this operation i always get NaN's:
t['latitude']=[18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
sub = "."
t['latitude'].astype(str).str.slice(start=t['latitude'].astype(str).str.find(sub), stop=t['latitude'].astype(str).str.find(sub)+2)
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
Name: latitude, dtype: float64
The simpliest way to truncate:
t = pd.DataFrame()
t['latitude']=[18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
t['latitude'] = (t['latitude'] * 100).astype(int) / 100
print(t)
>>
latitude
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
Use np.round -
s = pd.Series([18.3988, 18.4439, 18.3467, 37.5079, 38.1102, 38.2927])
s_rounded = np.round(s, 2)
Output
0 18.40
1 18.44
2 18.35
3 37.51
4 38.11
5 38.29
dtype: float64
If you don't want to round, but just truncate -
s.astype(str).str.split('.').apply(lambda x: str(x[0]) + '.' + str(x[1])[:2])
Output
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
dtype: object
Use numpy.trunc for a vectorial operation:
n = 2 # number of decimals to keep
np.trunc(df['latitude'].mul(10**n)).div(10**n)
# to assign
# df['latitude'] = np.trunc(df['latitude'].mul(10**n)).div(10**n)
output:
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
Name: latitude, dtype: float64
x = 12.3614
y = round(x,2)
print(y) // 12.36
Easiest is Serious.round, but you can also try .str.extract
t['latitude'] = (t['latitude'].astype(str)
.str.extract('(.*\.\d{0,2})')
.astype(float))
print(t)
latitude
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
import re
t = [18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
truncated_lat=[]
for lat in t:
truncated_lat.append(float(re.findall('[0-9]+\.[0-9]{2}', str(lat))[0]))
print(truncated_lat)
Output:
[18.39, 18.44, 18.34, 37.5, 38.11, 38.29]
Try
import math
for i in t['latitude']:
math.trunc(i)
This question already has answers here:
Extract float/double value
(5 answers)
Closed last year.
I have the following dataframe:
df = pd.DataFrame({'A': ['2.5cm','2.5cm','2.56”','1.38”','2.2”','0.8 in','$18.00','4','2"']})
which looks like:
A
2.5cm
2.5cm
2.56”
1.38”
2.2”
0.8 in
$18.00
4
2"
I want to remove all characters except for the decimal points.
The output should be:
A
2.5
2.5
2.56
1.38
2.2
0.8
18.00
4
2
Here is what I've tried:
df['A'] = df.A.str.replace(r"[a-zA-Z]", '')
df['A'] = df.A.str.replace('\W', '')
but this is stripping out everything including the decimal point.
Any suggestions would be greatly appreciated.
Thank you in advance
You can use str.extract to extract only the floating points:
df['A'] = df['A'].astype(str).str.extract(r'(\d+.\d+|\d)').astype('float')
However, '.' here matches any character, not just the period. So it will match 18,00 instead of 18. Also it fails to extract multidigit whole numbers. Use the code below. (thanks #DYZ):
df['A'] = df['A'].astype(str).str.extract(r'(\d+\.\d+|\d+)').astype('float')
Output:
A
0 2.50
1 2.50
2 2.56
3 1.38
4 2.20
5 0.80
6 18.00
7 4.00
8 2.00
Try with str.extract
df['new'] = df.A.str.extract('(\d*\.\d+|\d+)').astype(float).iloc[:,0]
Out[31]:
0
0 2.50
1 2.50
2 2.56
3 1.38
4 2.20
5 0.80
6 18.00
I have df like below I want to create dayshigh column. This column will show the row counts until the highest date.
date high
05-06-20 1.85
08-06-20 1.88
09-06-20 2
10-06-20 2.11
11-06-20 2.21
12-06-20 2.17
15-06-20 1.99
16-06-20 2.15
17-06-20 16
18-06-20 9
19-06-20 14.67
should be like:
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16 8
18-06-20 9 0
19-06-20 14.67 1
using the below code but showing error somehow:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
for j in range(df["DaysHigh"][i].index, len(df)):
if df["high"][i] > df["high"][i-1]:
df["DaysHigh"][i] = df["DaysHigh"][i-1] + 1
else:
df["DaysHigh"][i] = 0
At which point am I doing wrong? Thank you
Is the dayshigh number for 17-06-20 supposed to be 2 instead of 8? If so, you can basically use the code you had already written here. There are three changes I'm making below:
starting i from 1 instead of 0 to avoid trying to access the -1th element
removing the loop over j (doesn't seem to be necessary)
using loc to set the values instead of df["high"][i] -- you'll see this should resolve the warnings about copies and slices.
Keeping first line same as before,
for i in range(1, len(df)):
if df["high"][i] > df["high"][i-1]:
df.loc[i,"DaysHigh"] = df["DaysHigh"][i-1] + 1
else:
df.loc[i,"DaysHigh"] = 0
procedure
Use pandas.shift() to create a column for the next row of comparison results.
calculate the cumulative sum of its created columns
delete the columns if they are not needed
df['tmp'] = np.where(df['high'] >= df['high'].shift(), 1, np.NaN)
df['dayshigh'] = df['tmp'].groupby(df['tmp'].isna().cumsum()).cumsum()
df.drop('tmp', axis=1, inplace=True)
df
date high dayshigh
0 05-06-20 1.85 NaN
1 08-06-20 1.88 1.0
2 09-06-20 2.00 2.0
3 10-06-20 2.11 3.0
4 11-06-20 2.21 4.0
5 12-06-20 2.17 NaN
6 15-06-20 1.99 NaN
7 16-06-20 2.15 1.0
8 17-06-20 16.00 2.0
9 18-06-20 9.00 NaN
10 19-06-20 14.67 1.0
Well, I think I did, here is my solution:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
#for i in range(len(df)-1000, len(df)):
for j in reversed(range(i)):
if df["high"][i] > df["high"][j]:
df["DaysHigh"][i] = df["DaysHigh"][i] + 1
else:
break
print(df)
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2.00 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16.00 8
18-06-20 9.00 0
19-06-20 14.67 1
I'm trying to parse through all of the cells in a csv file that represent heights and round what's after the decimal to match a number in a list (to round down to the nearest inch). After a few days of banging my head against the wall, this is the coding I've been able to get working:
import math
import pandas as pd
inch = [.0, .08, .16, .25, .33, .41, .50, .58, .66, .75, .83, .91, 1]
df = pd.read_csv("sample_csv.csv")
def to_number(s):
for index, row in df.iterrows():
try:
num = float(s)
num = math.modf(num)
num = list(num)
for i,j in enumerate(inch):
if num[0] < j:
num[0] = inch[i-1]
break
elif num[0] == j:
num[0] = inch[i]
break
newnum = num[0] + num[1]
return newnum
except ValueError:
return s
df = df.apply(lambda f : to_number(f[0]), axis=1).fillna('')
with open('new.csv', 'a') as f:
df.to_csv(f, index=False)
Ideally I'd like to have it parse over an entire CSV with n headers, ignoring all strings and round the floats to match the list. Is there a simple(r) way to achieve this with Pandas? And would it be possible (or a good idea?) to have it edit the existing excel workbook instead of creating a new csv i'd have to copy/paste over?
Any help or suggestions would be greatly appreciated as I'm very new to Pandas and it's pretty god damn intimidating!
Helping would be a lot easier if you include a sample mock of the data you're trying to parse. To clarify the points you don't specify, as I understand it
By "an entire CSV with n headers, ignoring all strings and round the floats to match the list" you mean some n-column dataframe with k numeric columns each of which describe someone's height in inches.
The entries in the numeric columns are measured in units of feet.
You want to ignore the non-numeric columns and transform the data as 6.14 -> 6 feet, 1 inches (I'm implicitly assuming that by "round down" you want an integer floor; i.e. 6.14 feet is 6 feet, 0.14*12 = 1.68 inches; it's up to you whether this is floored or rounded to the nearest integer).
Now for a subset of random heights measured in feet sampled uniformly over 5.1 feet and 6.9 feet, we could do the following:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
In [4]: df
Out[4]:
0 1 2
0 6.020613 6.315707 5.413499
1 5.942232 6.834540 6.761765
2 5.715405 6.162719 6.363224
3 6.416955 6.511843 5.512515
4 6.472462 5.789654 5.270047
5 6.370964 5.509568 6.113121
6 6.353790 6.466489 5.460961
7 6.526039 5.999284 6.617608
8 6.897215 6.016648 5.681619
9 6.886359 5.988068 5.575993
In [5]: np.fix(df) + np.floor(12*(df - np.fix(df)))/12
Out[5]:
0 1 2
0 6.000000 6.250000 5.333333
1 5.916667 6.833333 6.750000
2 5.666667 6.083333 6.333333
3 6.416667 6.500000 5.500000
4 6.416667 5.750000 5.250000
5 6.333333 5.500000 6.083333
6 6.333333 6.416667 5.416667
7 6.500000 5.916667 6.583333
8 6.833333 6.000000 5.666667
9 6.833333 5.916667 5.500000
We're using np.fix to extract the integral part of the height value. Likewise, df - np.fix(df) represents the fractional remainder in feet or in inches when multiplied by 12. np.floor just truncates this to the nearest inch below, and the final division by 12 returns the unit of measurement from inches to feet.
You can change np.floor to np.round to get an answer rounded to the nearest inch rather than truncated to the previous whole inch. Finally, you can specify the precision of the output to insist that the decimal portion is selected from your list.
In [6]: (np.fix(df) + np.round(12*(df - np.fix(df)))/12).round(2)
Out[6]:
0 1 2
0 6.58 5.25 6.33
1 5.17 6.42 5.67
2 6.42 5.83 6.33
3 5.92 5.67 6.33
4 6.83 5.25 6.58
5 5.83 5.50 6.92
6 6.83 6.58 6.25
7 5.83 5.33 6.50
8 5.25 6.00 6.83
9 6.42 5.33 5.08
Adding onto the other answer to address your problem with strings:
# Break the dataframe with a string
df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
df.ix[0,0] = 'str'
# Find out which things can be cast to numerics and put NaNs everywhere else
df_safe = df.apply(pd.to_numeric, axis=0, errors="coerce")
df_safe = (np.fix(df_safe) + np.round(12*(df_safe - np.fix(df_safe)))/12).round(2)
# Replace all the NaNs with the original data
df_safe[df_safe.isnull()] = df[df_safe.isnull()]
df_safe should be what you want. Despite the name, this isn't particularly safe and there are probably edge conditions that will be a problem.
print(dataframe)
Total Price test_num
0 71.7 2.04256e+14
1 39.5 2.04254e+14
2 82.2 2.04188e+14
3 42.9 2.04171e+14
I have an error when uploading to Mongo db and converting it to Str.
print(data_frame.astype(str))
Total Price test_num
0 71.7 204255705072224.0
1 39.5 204253951078915.0
2 82.2 204188075120577.0
3 42.9 204171098699772.0
When converting Int to Str, .0 is added at the end.
How can I effectively eliminate .0?
thank you
Use astype by int64:
df['test_num'] = df['test_num'].astype('int64')
#alternative
#df['test_num'] = df['test_num'].astype(np.int64)
print (df)
Total Price test_num
0 0 71.7 204256000000000
1 1 39.5 204254000000000
2 2 82.2 204188000000000
3 3 42.9 204171000000000
Explanation:
You can check dtype of converted column - it return float64.
print (df['test_num'].dtype)
float64
After converting to string it remove exponential notation and cast to floats, so added traling 0:
print (df['test_num'].astype('str'))
0 204256000000000.0
1 204254000000000.0
2 204188000000000.0
3 204171000000000.0
Name: test_num, dtype: object