shift converts my column from integer to float. It turns out that np.nan is float only. Is there any ways to keep shifted column as integer?
df = pd.DataFrame({"a":range(5)})
df['b'] = df['a'].shift(1)
df['a']
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# Name: a, dtype: int64
df['b']
# 0 NaN
# 1 0
# 2 1
# 3 2
# 4 3
# Name: b, dtype: float64
Solution for pandas under 0.24:
Problem is you get NaN value what is float, so int is converted to float - see na type promotions.
One possible solution is convert NaN values to some value like 0 and then is possible convert to int:
df = pd.DataFrame({"a":range(5)})
df['b'] = df['a'].shift(1).fillna(0).astype(int)
print (df)
a b
0 0 0
1 1 0
2 2 1
3 3 2
4 4 3
Solution for pandas 0.24+ - check Series.shift:
fill_value object, optional
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.
Changed in version 0.24.0.
df['b'] = df['a'].shift(fill_value=0)
another solution starting from pandas version 0.24.0: simply provide a value for the parameter fill_value:
df['b'] = df['a'].shift(1, fill_value=0)
You can construct a numpy array by prepending a 0 to all but the last element of column a
df.assign(b=np.append(0, df.a.values[:-1]))
a b
0 0 0
1 1 0
2 2 1
3 3 2
4 4 3
As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN.
df = pd.DataFrame({"a":range(5)})
df = df.convert_dtypes()
df['b'] = df['a'].shift(1)
print(df['a'])
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# Name: a, dtype: Int64
print(df['b'])
# 0 <NA>
# 1 0
# 2 1
# 3 2
# 4 3
# Name: b, dtype: Int64
another solution is to use replace() function and type cast
df['b'] = df['a'].shift(1).replace(np.NaN,0).astype(int)
I don't like other answers which may change original dtypes, what if you have float, str in data?
Since we don't need the first nan row , why not skip it.
I would keep all dtypes and cast back:
dt = df.dtypes
df = df.shift(1).iloc[1:].astype(dt)
Related
I have a series of type object, When using Series.str.strip() the cells that contain only int are getting turned into Nan.
How do I avoid this?
example
sr = pd.Series([1,2,3,'foo '])
sr.str.strip()
0 NaN
1 NaN
2 NaN
3 foo
dtype: object
desired outcome
0 1
1 2
2 3
3 foo
dtype: object
The simpliest is replace missing values by original values by Series.fillna:
sr = pd.Series([1,2,3,'foo '])
sr.str.strip().fillna(sr)
Or striping only strings tested by isinstance:
print (sr.apply(lambda x: x.strip() if isinstance(x, str) else x))
0 1
1 2
2 3
3 foo
dtype: object
You can cast the series to str altogether and then strip:
>>> sr.astype(str).str.strip()
0 1
1 2
2 3
3 foo
dtype: object
this way 1 becomes "1" and remains unchanged against stripping. But they will remain strings at the end, not integers; not sure if that's the desired output.
Suppose I have a dataframe df as shown below
qty
0 1.300
1 1.909
Now I want to extract only the integer portion of the qty column and the df should look like
qty
0 1
1 1
Tried using df['qty'].round(0) but didn't get the desired result as it rounds of the number to the nearest integer.
Java has a function intValue() which does the desired operation. Is there a similar function in pandas ?
Convert values to integers by Series.astype:
df['qty'] = df['qty'].astype(int)
print (df)
qty
0 1
1 1
If not working above is possible use numpy.modf for extract values before .:
a, b = np.modf(df['qty'])
df['qty'] = b.astype(int)
print (df)
qty
0 1
1 1
Or by split before ., but it should be slow if large DataFrame:
df['qty'] = b.astype(str).str.strip('.').str[0].astype(int)
Or use numpy.floor:
df['qty'] = np.floor(df['qty']).astype(int)
You can use the method floordiv:
df['col'].floordiv(1).astype(int)
For example:
col
0 9.748333
1 6.612708
2 2.888753
3 8.913470
4 2.354213
Output:
0 9
1 6
2 2
3 8
4 2
Name: col, dtype: int64
I want to check every column in a dataframe whether it contains only numeric data. Specifically, my query is not about the datatype, but instead, I want to check every value in each column of the dataframe whether it's a numeric value.
How can I find this out?
You can check that using to_numeric and coercing errors:
pd.to_numeric(df['column'], errors='coerce').notnull().all()
For all columns, you can iterate through columns or just use apply
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
E.g.
df = pd.DataFrame({'col' : [1,2, 10, np.nan, 'a'],
'col2': ['a', 10, 30, 40 ,50],
'col3': [1,2,3,4,5.0]})
Outputs
col False
col2 False
col3 True
dtype: bool
You can draw a True / False comparison using isnumeric()
Example:
>>> df
A B
0 1 1
1 NaN 6
2 NaN NaN
3 2 2
4 NaN NaN
5 4 4
6 some some
7 value other
Results:
>>> df.A.str.isnumeric()
0 True
1 NaN
2 NaN
3 True
4 NaN
5 True
6 False
7 False
Name: A, dtype: object
# df.B.str.isnumeric()
with apply() method which seems more robust in case you need corner to corner comparison:
DataFrame having two different columns one with mixed type another with numbers only for test:
>>> df
A B
0 1 1
1 NaN 6
2 NaN 33
3 2 2
4 NaN 22
5 4 4
6 some 66
7 value 11
Result:
>>> df.apply(lambda x: x.str.isnumeric())
A B
0 True True
1 NaN True
2 NaN True
3 True True
4 NaN True
5 True True
6 False True
7 False True
Another example:
Let's consider the below dataframe with different data-types as follows..
>>> df
num rating name age
0 0 80.0 shakir 33
1 1 -22.0 rafiq 37
2 2 -10.0 dev 36
3 num 1.0 suraj 30
Based on the comment from OP on this answer, where it has negative value and 0's in it.
1- This is a pseudo-internal method to return only the numeric type data.
>>> df._get_numeric_data()
rating age
0 80.0 33
1 -22.0 37
2 -10.0 36
3 1.0 30
OR
2- there is an option to use method select_dtypes in module pandas.core.frame which return a subset of the DataFrame's columns based on the column dtypes. One can use Parameters with include, exclude options.
>>> df.select_dtypes(include=['int64','float64']) # choosing int & float
rating age
0 80.0 33
1 -22.0 37
2 -10.0 36
3 1.0 30
>>> df.select_dtypes(include=['int64']) # choose int
age
0 33
1 37
2 36
3 30
This will return True if all columns are numeric, False otherwise.
df.shape[1] == df.select_dtypes(include=np.number).shape[1]
To select numeric columns:
new_df = df.select_dtypes(include=np.number)
Let's say you have a dataframe called df, if you do:
df.select_dtypes(include=["float", 'int'])
This will return all the numeric columns, you can check if this is the same as the original df.
Otherwise, you can also use the exclude parameter:
df.select_dtypes(exclude=["float", 'int'])
and check if this gives you an empty dataframe.
The accepted answers seem bit overkill, as they sub-select the entire dataframe.
To check types only metadata should be used, which can be done with
pd.api.types.is_numeric_dtype.
import pandas as pd
df = pd.DataFrame(data=[[1,'a']],columns=['numeruc_col','string_col'])
print(df.columns[list(map(pd.api.types.is_numeric_dtype,df.dtypes))]) # one way
print(df.dtypes.map(pd.api.types.is_numeric_dtype)) # another way
To check for numeric columns, you could use df[c].dtype.kind in 'iufcb' where c is any given column name. The comparison will yeild a True or False boolean output.
It can be iterated through all the column names with a list comprehension:
>>> [(c, df[c].dtype.kind in 'iufcb') for c in df.columns]
[('col', False), ('col2', False), ('col3', True)]
The numpy.dtype.kind 'iufcb' notation is a representation of whether it is a signed integer (i), unsigned integer (u), float (f), complex number (c), or boolean (b). The string can be modified to exclude any of the above (e.g., 'iufc' to exclude boolean).
This solves the original question in relation to checking column data types. It also provides the benefits of (1) a shorter line of code which (2) remains sufficiently intuitive to the user.
I have a df with values:
A B C D
0 1 2 3 2
1 2 3 3 9
2 5 3 6 6
3 3 6 7
4 6 7
5 2
df.shape is 6x4, say
df.iloc[:,1] pulls out the B column, but len(df.iloc[:,1]) is also = 6
How do I "reshape" df.iloc[:,1]? Which function can I use so that the output is the length of the actual values in the column.
My expected output in this case is 3
You can use last_valid_index. Just note that since your series originally contained NaN values and these are considered float, even after filtering your series will be float. You may wish to convert to int as a separate step.
# first convert dataframe to numeric
df = df.apply(pd.to_numeric, errors='coerce')
# extract column
B = df.iloc[:, 1]
# filter to the last valid value
B_filtered = B[:B.last_valid_index()]
print(B_filtered)
0 2.0
1 3.0
2 3.0
3 6.0
Name: B, dtype: float64
You can use list comprehension like this.
len([x for x in df.iloc[:,1] if x != ''])
I'm reading in a large flatfile which has timestamped data with multiple columns. Data has a boolean column which can be True/False or can have no entry(which evaluates to nan).
When reading the csv the bool column gets typecast as object which prevents saving the data in hdfstore because of serialization error.
example data:
A B C D
a 1 2 true
b 5 7 false
c 3 2 true
d 9 4
I use the following command to read
import pandas as pd
pd.read_csv('data.csv', parse_dates=True)
One solution is to specify the dtype while reading in the csv but I was hoping for a more succinct solution like convert_objects where i can specify parse_numeric or parse_dates.
As you had a missing value in your csv the dtype of the columns is shown to be object as you have mixed dtypes, the first 3 row values are boolean, the last will be a float.
To convert the NaN value use fillna, it accepts a dict to map desired fill values with columns and produce a homogeneous dtype:
>>> t = """
A B C D
a 1 NaN true
b 5 7 false
c 3 2 true
d 9 4 """
>>> df = pd.read_csv(io.StringIO(t),sep='\s+')
>>> df
A B C D
0 a 1 NaN True
1 b 5 7 False
2 c 3 2 True
3 d 9 4 NaN
>>> df.fillna({'C':0, 'D':False})
A B C D
0 a 1 0 True
1 b 5 7 False
2 c 3 2 True
3 d 9 4 False
You can use dtype, it accepts a dictionary for mapping columns:
dtype : Type name or dict of column -> type
Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
import pandas as pd
import numpy as np
import io
# using your sample
csv_file = io.BytesIO('''
A B C D
a 1 2 true
b 5 7 false
c 3 2 true
d 9 4''')
df = pd.read_csv(csv_file, sep=r'\s+', dtype={'D': np.bool})
# then fillna to convert NaN to False
df = df.fillna(value=False)
df
A B C D
0 a 1 2 True
1 b 5 7 False
2 c 3 2 True
3 d 9 4 False
df.D.dtypes
dtype('bool')
From this very similar question, I would suggest using converters kwarg:
import pandas as pd
pd.read_csv('data.csv',
converters={'D': lambda x: True if x == 'true' else False})
as per your comment stating that NaN value should be replaced by False.
converters keyword argument can take a dictionary with keys being column names and values being functions to apply.