Rounding up a column - python

I am new to pandas python and I am having difficulties trying to round up all the values in the column. For example,
Example
88.9
88.1
90.2
45.1
I tried using my current code below, but it gave me:
AttributeError: 'str' object has no attribute 'rint'
df.Example = df.Example.round()

You can use numpy.ceil:
In [80]: import numpy as np
In [81]: np.ceil(df.Example)
Out[81]:
0 89.0
1 89.0
2 91.0
3 46.0
Name: Example, dtype: float64
depending on what you like, you could also change the type:
In [82]: np.ceil(df.Example).astype(int)
Out[82]:
0 89
1 89
2 91
3 46
Name: Example, dtype: int64
Edit
Your error message indicates you're trying just to round (not necessarily up), but are having a type problem. You can solve it like so:
In [84]: df.Example.astype(float).round()
Out[84]:
0 89.0
1 88.0
2 90.0
3 45.0
Name: Example, dtype: float64
Here, too, you can cast at the end to an integer type:
In [85]: df.Example.astype(float).round().astype(int)
Out[85]:
0 89
1 88
2 90
3 45
Name: Example, dtype: int64

I don't have privilege to comment. Mine is not a new answer. It is a compare of two answers. Only one of them worked as below.
First I tried this https://datatofish.com/round-values-pandas-dataframe/
df['DataFrame column'].apply(np.ceil)
It did not work for me.
Then I tried the above answer
np.ceil(df.Example).astype(int)
It worked.
I hope this will help someone.

Related

Type casting a part of a Pandas dataframe (multiple columns) and assigning back does not preserve the dtype

I'm relatively new to Pandas so this may be trivial. As this should be a common problem I already searched for similar problems to this but couldn't find anything (there are some resembling this but they pertain to columns with mixed dtypes). Sorry if this is a duplicate, kind TIA for pointers.
Problem: A part of a dataframe (subset of columns and/or rows) need to be converted from "object" to e.g. a numerical type (say, float). While casting using astype() for those row/columns works, assigning back to the original (sub)dataframe using iloc indexing does revert to the original dtype.
Concrete example (simplified):
>>> import pandas as pd
>>> import numpy as np
# Numbers are deliberately added as strings -- to be converted later
>>> df = pd.DataFrame({'name': ['Olivia','Dean','Alex','Jon','Tom','Jane','Kate'],
... 'age': ['32','23','45','35','20','28','55'],
... 'height':['1.65', '1.75','1.85','1.91','1.75','1.7','1.65']})
>>> df
name age height
0 Olivia 32 1.65
1 Dean 23 1.75
2 Alex 45 1.85
3 Jon 35 1.91
4 Tom 20 1.75
5 Jane 28 1.7
6 Kate 55 1.65
```
>>> df.dtypes
name object
age object
height object
dtype: object
>>> # Convert second and third column to `float`, assign it back to the original dataframe
>>> df.iloc[:,1:] = df.iloc[:,1:].astype(float)
>>> df
name age height
0 Olivia 32.0 1.65
1 Dean 23.0 1.75
2 Alex 45.0 1.85
3 Jon 35.0 1.91
4 Tom 20.0 1.75
5 Jane 28.0 1.7
6 Kate 55.0 1.65
>>> df.dtypes
name object
age object
height object
# However, the conversion by itself has the expected result
>>> df_sub = df.iloc[:,1:].astype(float)
>>> df_sub
age height
0 32.0 1.65
1 23.0 1.75
2 45.0 1.85
3 35.0 1.91
4 20.0 1.75
5 28.0 1.70
6 55.0 1.65
>>> df_sub.dtypes
age float64
height float64
# Strangely, if instead of `iloc` I use an index for a subset of column, I get the expected result
>>> cols_idx = df.columns.drop('name')
>>> df[cols_idx] = df[cols_idx].astype(float)
>>> df.dtypes
name object
age float64
height float64
dtype: object
Can someone please explain the difference / why using iloc does NOT give me the result I want ? Also, what is the recommended way to do this for larger dataframes (several hundred rows / columns) ?
Kind TIA,
/Florian
To my understanding, by performing the iloc[:,1:] reassignemnt, you are essentially performing:
df.iloc.__setitem__((i, slice(None)), value)
In which case you are setting the new values within the corresponding index locations for the dataframe you are overwriting, but not modifying the pre-existing properties of the dataframe, or columns you are affecting. Nowhere, are you actually overwriting the dtypes of the existing column of the dataframe you are working on.
In the other two examples:
df_sub = df.iloc[:,1:].astype(float)
You are creating a new variable df_sub which equals to the output of df.iloc[:,1:].astype(float), of course being dtype == float
In this other example:
df[cols_idx] = df[cols_idx].astype(float)
You are indeed reassigning a dataframe with it's new dtypes, not only the values (as in the column properties get also re-assigned).

Converting exponential notation numbers to strings - explanation

I have DataFrame from this question:
temp=u"""Total,Price,test_num
0,71.7,2.04256e+14
1,39.5,2.04254e+14
2,82.2,2.04188e+14
3,42.9,2.04171e+14"""
df = pd.read_csv(pd.compat.StringIO(temp))
print (df)
Total Price test_num
0 0 71.7 2.042560e+14
1 1 39.5 2.042540e+14
2 2 82.2 2.041880e+14
3 3 42.9 2.041710e+14
If convert floats to strings get trailing 0:
print (df['test_num'].astype('str'))
0 204256000000000.0
1 204254000000000.0
2 204188000000000.0
3 204171000000000.0
Name: test_num, dtype: object
Solution is convert floats to integer64:
print (df['test_num'].astype('int64'))
0 204256000000000
1 204254000000000
2 204188000000000
3 204171000000000
Name: test_num, dtype: int64
print (df['test_num'].astype('int64').astype(str))
0 204256000000000
1 204254000000000
2 204188000000000
3 204171000000000
Name: test_num, dtype: object
Question is why it convert this way?
I add this poor explanation, but feels it should be better:
Poor explanation:
You can check dtype of converted column - it return float64.
print (df['test_num'].dtype)
float64
After converting to string it remove exponential notation and cast to floats, so added traling 0:
print (df['test_num'].astype('str'))
0 204256000000000.0
1 204254000000000.0
2 204188000000000.0
3 204171000000000.0
Name: test_num, dtype: object
When you use pd.read_csv to import data and do not define datatypes,
pandas makes an educated guess and in this case decides, that column
values like "2.04256e+14" are best represented by a float value.
This, converted back to string adds a ".0". As you corrently write,
converting to int64 fixes this.
If you know that the column has int64 values only before input (and
no empty values, which np.int64 cannot handle), you can force this type on import to avoid the unneeded conversions.
import numpy as np
temp=u"""Total,Price,test_num
0,71.7,2.04256e+14
1,39.5,2.04254e+14
2,82.2,2.04188e+14
3,42.9,2.04171e+14"""
df = pd.read_csv(pd.compat.StringIO(temp), dtype={2: np.int64})
print(df)
returns
Total Price test_num
0 0 71.7 204256000000000
1 1 39.5 204254000000000
2 2 82.2 204188000000000
3 3 42.9 204171000000000

Pandas convert data type from object to float

I read some weather data from a .csv file as a dataframe named "weather". The problem is that the data type of one of the columns is object. This is weird, as it indicates temperature. How do I change it to having a float data type? I tried to_numeric, but it can't parse it.
weather.info()
weather.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 304 entries, 2017-01-01 to 2017-10-31
Data columns (total 2 columns):
Temp 304 non-null object
Rain 304 non-null float64
dtypes: float64(1), object(1)
memory usage: 17.1+ KB
Temp Rain
Date
2017-01-01 12.4 0.0
2017-02-01 11 0.6
2017-03-01 10.4 0.6
2017-04-01 10.9 0.2
2017-05-01 13.2 0.0
You can use pandas.Series.astype
You can do something like this :
weather["Temp"] = weather.Temp.astype(float)
You can also use pd.to_numeric that will convert the column from object to float
For details on how to use it checkout this link :http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_numeric.html
Example :
s = pd.Series(['apple', '1.0', '2', -3])
print(pd.to_numeric(s, errors='ignore'))
print("=========================")
print(pd.to_numeric(s, errors='coerce'))
Output:
0 apple
1 1.0
2 2
3 -3
=========================
dtype: object
0 NaN
1 1.0
2 2.0
3 -3.0
dtype: float64
In your case you can do something like this:
weather["Temp"] = pd.to_numeric(weather.Temp, errors='coerce')
Other option is to use convert_objects
Example is as follows
>> pd.Series([1,2,3,4,'.']).convert_objects(convert_numeric=True)
0 1
1 2
2 3
3 4
4 NaN
dtype: float64
You can use this as follows:
weather["Temp"] = weather.Temp.convert_objects(convert_numeric=True)
I have showed you examples because if any of your column won't have a number then it will be converted to NaN... so be careful while using it.
I tried all methods suggested here but sadly none worked. Instead, found this to be working:
df['column'] = pd.to_numeric(df['column'],errors = 'coerce')
And then check it using:
print(df.info())
I eventually used:
weather["Temp"] = weather["Temp"].convert_objects(convert_numeric=True)
It worked just fine, except that I got the following message.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning:
convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
You can try the following:
df['column'] = df['column'].map(lambda x: float(x))
First check your data cuz you may get an error if you have ',' instead of '.'
if so, you need to transform every ',' into '.' with a function :
def replacee(s):
i=str(s).find(',')
if(i>0):
return s[:i] + '.' + s[i+1:]
else :
return s
then you need to apply this function on every row in your column :
dfOPA['Montant']=dfOPA['Montant'].apply(replacee)
then the convert function will work fine :
dfOPA['Montant'] = pd.to_numeric(dfOPA['Montant'],errors = 'coerce')
Eg, For Converting $40,000.00 object to 40000 int or float32
Follow this step by step :
$40,000.00 ---(**1**. remove $)---> 40,000.00 ---(**2**. remove , comma)---> 40000.00 ---(**3**. remove . dot)---> 4000000 ---(**4**. remove empty space)---> 4000000 ---(**5**. Remove NA Values)---> 4000000 ---(**6**. now this is object type so, convert to int using .astype(int) )---> 4000000 ---(**7**. divide by 100)---> 40000
Implementing code In Pandas
table1["Price"] = table1["Price"].str.replace('$','')<br>
table1["Price"] = table1["Price"].str.replace(',','')<br>
table1["Price"] = table1["Price"].str.replace('.','')<br>
table1["Price"] = table1["Price"].str.replace(' ','')
table1 = table1.dropna()<br>
table1["Price"] = table1["Price"].astype(int)<br>
table1["Price"] = table1["Price"] / 100<br>
Finally it's done

Replacing a pandas DataFrame value with np.nan when the values ends with '_h'

Below is the code that I have been working with to replace some values with np.NaN. My issue is how to replace'47614750_h' at index 111 with np.NaN. I can do this directly with drop_list, however, I need to iterate this with different values ending in '_h' over many files and would like to do this automatically.
I have tried some searches on regex as it seems the way to go, but could not find what i needed.
drop_list = ['dash_code', 'SONIC WELD']
df_clean.replace(drop_list, np.NaN).tail(10)
DASH_CODE Name Quantity
107 1011567 .156 MALE BULLET TERM INSUL 1.0
108 102066901 .032 X .187 FEMALE Q.D. TERM. 1.0
109 105137901 TERM,RING,10-12AWG,INSULATED 1.0
110 101919701 1/4 RING TERM INSUL 2.0
111 47614750001_h HARNESS, MAIN, AC, LIO 1.0
112 NaN NaN 19.0
113 7685 5/16 RING TERM INSUL. 1.0
114 102521601 CLIP,HARNESS 2.0
115 47614808001 CAP, RESISTOR, TERMINATION 1.0
116 103749801 RECPT, DEUTSCH, DTM04-4P 1.0
You can use pd.Series.apply for this with a lambda:
df['DASH_CODE'] = df['DASH_CODE'].apply(lambda x: np.NaN if x.endswith('_h') else x)
From the documentation:
Invoke function on values of Series. Can be ufunc (a NumPy function
that applies to the entire Series) or a Python function that only
works on single values
It may be faster to try to convert all the rows to float using pd.to_numeric:
In [11]: pd.to_numeric(df.DASH_CODE, errors='coerce')
Out[11]:
0 1.011567e+06
1 1.020669e+08
2 1.051379e+08
3 1.019197e+08
4 NaN
5 NaN
6 7.685000e+03
7 1.025216e+08
8 4.761481e+10
9 1.037498e+08
Name: DASH_CODE, dtype: float64
In [12]: df["DASH_CODE"] = pd.to_numeric(df["DASH_CODE"], errors='coerce')

Converting string objects to int/float using pandas

import pandas as pd
path1 = "/home/supertramp/Desktop/100&life_180_data.csv"
mydf = pd.read_csv(path1)
numcigar = {"Never":0 ,"1-5 Cigarettes/day" :1,"10-20 Cigarettes/day":4}
print mydf['Cigarettes']
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
print mydf['CigarNum']
mydf.to_csv('/home/supertramp/Desktop/powerRangers.csv')
The csv file "100&life_180_data.csv" contains columns like age, bmi,Cigarettes,Alocohol etc.
No int64
Age int64
BMI float64
Alcohol object
Cigarettes object
dtype: object
Cigarettes column contains "Never" "1-5 Cigarettes/day","10-20 Cigarettes/day".
I want to assign weights to these object (Never,1-5 Cigarettes/day ,....)
The expected output is new column CigarNum appended which consists only numbers 0,1,2
CigarNum is as expected till 8 rows and then shows Nan till last row in CigarNum column
0 Never
1 Never
2 1-5 Cigarettes/day
3 Never
4 Never
5 Never
6 Never
7 Never
8 Never
9 Never
10 Never
11 Never
12 10-20 Cigarettes/day
13 1-5 Cigarettes/day
14 Never
...
167 Never
168 Never
169 10-20 Cigarettes/day
170 Never
171 Never
172 Never
173 Never
174 Never
175 Never
176 Never
177 Never
178 Never
179 Never
180 Never
181 Never
Name: Cigarettes, Length: 182, dtype: object
The output I get shoudln't give NaN after few first rows.
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 NaN
11 NaN
12 NaN
13 NaN
14 0
...
167 NaN
168 NaN
169 NaN
170 NaN
171 NaN
172 NaN
173 NaN
174 NaN
175 NaN
176 NaN
177 NaN
178 NaN
179 NaN
180 NaN
181 NaN
Name: CigarNum, Length: 182, dtype: float64
OK, first problem is you have embedded spaces causing the function to incorrectly apply:
fix this using vectorised str:
mydf['Cigarettes'] = mydf['Cigarettes'].str.replace(' ', '')
now create your new column should just work:
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
UPDATE
Thanks to #Jeff as always for pointing out superior ways to do things:
So you can call replace instead of calling apply:
mydf['CigarNum'] = mydf['Cigarettes'].replace(numcigar)
# now convert the types
mydf['CigarNum'] = mydf['CigarNum'].convert_objects(convert_numeric=True)
you can also use factorize method also.
Thinking about it why not just set the dict values to be floats anyway and then you avoid the type conversion?
So:
numcigar = {"Never":0.0 ,"1-5 Cigarettes/day" :1.0,"10-20 Cigarettes/day":4.0}
Version 0.17.0 or newer
convert_objects is deprecated since 0.17.0, this has been replaced with to_numeric
mydf['CigarNum'] = pd.to_numeric(mydf['CigarNum'], errors='coerce')
Here errors='coerce' will return NaN where the values cannot be converted to a numeric value, without this it will raise an exception
Try using this function for all problems of this kind:
def get_series_ids(x):
'''Function returns a pandas series consisting of ids,
corresponding to objects in input pandas series x
Example:
get_series_ids(pd.Series(['a','a','b','b','c']))
returns Series([0,0,1,1,2], dtype=int)'''
values = np.unique(x)
values2nums = dict(zip(values,range(len(values))))
return x.replace(values2nums)

Categories