Converting Pandas Dataframe types - python

I have a pandas dataFrame created through a mysql call which returns the data as object type.
The data is mostly numeric, with some 'na' values.
How can I cast the type of the dataFrame so the numeric values are appropriately typed (floats) and the 'na' values are represented as numpy NaN values?

Use the replace method on dataframes:
import numpy as np
df = DataFrame({
'k1': ['na'] * 3 + ['two'] * 4,
'k2': [1, 'na', 2, 'na', 3, 4, 4]})
print df
df = df.replace('na', np.nan)
print df
I think it's helpful to point out that df.replace('na', np.nan) by itself won't work. You must assign it back to the existing dataframe.

df = df.convert_objects(convert_numeric=True) will work in most cases.
I should note that this copies the data. It would be preferable to get it to a numeric type on the initial read. If you post your code and a small example, someone might be able to help you with that.

This is what Tom suggested and is correct
In [134]: s = pd.Series(['1','2.','na'])
In [135]: s.convert_objects(convert_numeric=True)
Out[135]:
0 1
1 2
2 NaN
dtype: float64
As Andy points out, this doesn't work directly (I think that's a bug), so convert to all string elements first, then convert
In [136]: s2 = pd.Series(['1','2.','na',5])
In [138]: s2.astype(str).convert_objects(convert_numeric=True)
Out[138]:
0 1
1 2
2 NaN
3 5
dtype: float64

Related

Why is DataFrame int column value sometimes returned as float?

I add a calculated column c to a DataFrame that only contains integers.
df = pd.DataFrame(data=list(zip(*[np.random.randint(1,3,5), np.random.random(5)])), columns=['a', 'b'])
df['c'] = np.ceil(df.a/df.b).astype(int)
df.dtypes
The DataFrame reports that the column type of c is indeed int:
a int64
b float64
c int32
dtype: object
If I access a value from c like this then I get an int:
df.c.values[0] # Returns "3"
type(df.c.values[0]) # Returns "numpy.int32"
But if I access the same value using loc I get a float:
df.iloc[0].c # Returns "3.0"
type(df.iloc[0].c) # Returns "numpy.float64"
Why is this?
I would like to be able to access the value using indexes without having to cast it (again) to an int.
Looks like what's happening is when you are accessing df.iloc[0].c, you have to first access df.iloc[0] which includes all three columns. df.iloc[0] then casts to the type that represents all three columns, which is numpy.float64.
Interestingly enough, I can avoid this by adding a string column.
df = pd.DataFrame(data=list(zip(*[np.random.randint(1,3,5), np.random.random(5)])), columns=['a', 'b'])
df['c'] = np.ceil(df.a/df.b).astype(int)
df['d'] = ['hi', 'bye', 'hello', 'cya', 'sup']
print(df.iloc[0].c)
print(type(df.iloc[0].c))
print(df.dtypes)
To your end question, you can avoid this whole mess by using df.loc[0, 'c'] instead of iloc.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=list(zip(*[np.random.randint(1,3,5), np.random.random(5)])), columns=['a', 'b'])
df['c'] = np.ceil(df.a/df.b).astype(int)
print(df.loc[0, 'c'])
print(df.loc[0, 'c'].dtype)
15
int32
When I execute your code, result is this dataframe:
df
a b c
0 1 0.315388 4
1 1 0.111275 9
2 1 0.251253 4
3 2 0.043162 47
4 1 0.047985 21
When I type in the interpreter df['c'].values I get this:
array([ 4, 9, 4, 47, 21]). It's to say all the c-column values.
When I type in the interpreter df.iloc[0] I get the dataframe's first row values:
a 1.000000
b 0.315388
c 4.000000
Name: 0, dtype: float64
What we could notice
All c-column values are integers while all first row values are not of the same types because we have two integers and a float value.
This fact is very important.
Indeed by definition an array is a collection of elements of the same type.
So to represent a float in a collection of values that are integers, conversion must be to float for all elements to respect this rule, because floats can contain integers but the reverse is not true.
Conclusion
Type of a collection of integers is int...
Type of a collection of floats is float...
Type of a collection of integers containing at least one float is converted to float...
Quote
"An array is a concept that stores different items of the same type together as one and makes calculating the stance of each element easier by adding an offset to the base number." (codeinstitute.net)
To check this and go further
# case A : value 2 is an integer
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},]
df = pd.DataFrame(mydict)
df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: int64
# case B : value '2' is a string
mydict = [{'a': 1, 'b': '2', 'c': 3, 'd': 4},]
df = pd.DataFrame(mydict)
df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: object
In the case A all elements are integers so dtype remains int....
Like in the case B collection contains a string that can't be a float..., all elements are converted to object type.

Why does pandas .sum(axis=1) return 0 when one row has numpy datetime64 values?

I have a pandas dataframe with floating numbers in some rows and numpy datetime64 in another.
df2 = pd.DataFrame(
[[np.datetime64('2021-01-01'), np.datetime64('2021-01-01')], [2, 3]],
columns=['A', 'B'])
when I sum each row (axis=1), I get 0 for all rows:
df2.sum(axis=1)
0 0.0
1 0.0
dtype: float64
Why does this happen? I have tried the numeric_only=True option with same result.
I would expect each row to be handled individually, and get 5 as result for the second row, as happens if I replace the datetime64 objects with strings:
df = pd.DataFrame(
{'A': ['2021-01-01', 2],
'B': ['2021-01-01', 3]})
print(df.sum(axis=1))
0 2021-01-012021-01-01
1 5
dtype: object
Thanks!
You get get something like you're after if you make your rows columns
df2.transpose().sum(axis=0)
Your rows won't get coerced to a numerical dtype, i.e.
df2.loc[1]
results in
A 2
B 3
Name: 1, dtype: object
Rows and columns are not treated equally in pandas, rightly or wrongly.

Passing row and column name to get value [duplicate]

I have constructed a condition that extracts exactly one row from my data frame:
d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]
Now I would like to take a value from a particular column:
val = d2['col_name']
But as a result, I get a data frame that contains one row and one column (i.e., one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?
If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:
In [3]: sub_df
Out[3]:
A B
2 -0.133653 -0.030854
In [4]: sub_df.iloc[0]
Out[4]:
A -0.133653
B -0.030854
Name: 2, dtype: float64
In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493
These are fast access methods for scalars:
In [15]: df = pandas.DataFrame(numpy.random.randn(5, 3), columns=list('ABC'))
In [16]: df
Out[16]:
A B C
0 -0.074172 -0.090626 0.038272
1 -0.128545 0.762088 -0.714816
2 0.201498 -0.734963 0.558397
3 1.563307 -1.186415 0.848246
4 0.205171 0.962514 0.037709
In [17]: df.iat[0, 0]
Out[17]: -0.074171888537611502
In [18]: df.at[0, 'A']
Out[18]: -0.074171888537611502
You can turn your 1x1 dataframe into a NumPy array, then access the first and only value of that array:
val = d2['col_name'].values[0]
Most answers are using iloc which is good for selection by position.
If you need selection-by-label, loc would be more convenient.
For getting a value explicitly (equiv to deprecated
df.get_value('a','A'))
# This is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A']
Out[55]: 0.13200317033032932
It doesn't need to be complicated:
val = df.loc[df.wd==1, 'col_name'].values[0]
I needed the value of one cell, selected by column and index names.
This solution worked for me:
original_conversion_frequency.loc[1,:].values[0]
It looks like changes after pandas 10.1 or 13.1.
I upgraded from 10.1 to 13.1. Before, iloc is not available.
Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.
Like this:
lastprice = stock.iloc[-1]['Close']
Output:
date
2014-02-26 118.2
name:Close, dtype: float64
The quickest and easiest options I have found are the following. 501 represents the row index.
df.at[501, 'column_name']
df.get_value(501, 'column_name')
In later versions, you can fix it by simply doing:
val = float(d2['col_name'].iloc[0])
df_gdp.columns
Index([u'Country', u'Country Code', u'Indicator Name', u'Indicator Code',
u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967',
u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975',
u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983',
u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991',
u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999',
u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007',
u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015',
u'2016'],
dtype='object')
df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]
8100000000000.0
I am not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.
E.g.,
rate
3 0.042679
Name: Unemployment_rate, dtype: float64
float(rate)
0.0426789
I've run across this when using dataframes with MultiIndexes and found squeeze useful.
From the documentation:
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
# Example for a dataframe with MultiIndex
> import pandas as pd
> df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
],
index=pd.MultiIndex.from_tuples( [('i', 1), ('ii', 2), ('iii', 3)] ),
columns=pd.MultiIndex.from_tuples( [('A', 'a'), ('B', 'b'), ('C', 'c')] )
)
> df
A B C
a b c
i 1 1 2 3
ii 2 4 5 6
iii 3 7 8 9
> df.loc['ii', 'B']
b
2 5
> df.loc['ii', 'B'].squeeze()
5
Note that while df.at[] also works (if you aren't needing to use conditionals) you then still AFAIK need to specify all levels of the MultiIndex.
Example:
> df.at[('ii', 2), ('B', 'b')]
5
I have a dataframe with a six-level index and two-level columns, so only having to specify the outer level is quite helpful.
For pandas 0.10, where iloc is unavailable, filter a DF and get the first row data for the column VALUE:
df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')
If there is more than one row filtered, obtain the first row value. There will be an exception if the filter results in an empty data frame.
Converting it to integer worked for me:
int(sub_df.iloc[0])
Using .item() returns a scalar (not a Series), and it only works if there is a single element selected. It's much safer than .values[0] which will return the first element regardless of how many are selected.
>>> df = pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]})
>>> df[df['a'] == 1]['a'] # Returns a Series
0 1
Name: a, dtype: int64
>>> df[df['a'] == 1]['a'].item()
1
>>> df2 = df[df['a'] == 2]
>>> df2['b']
1 5
2 6
Name: b, dtype: int64
>>> df2['b'].values[0]
5
>>> df2['b'].item()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/base.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar
To get the full row's value as JSON (instead of a Serie):
row = df.iloc[0]
Use the to_json method like below:
row.to_json()

Convert float to int and leave nulls

I have the following dataframe, I want to convert values in column 'b' to integer
a b c
0 1 NaN 3
1 5 7200.0 20
2 5 580.0 20
The following code is throwing exception
"ValueError: Cannot convert NA to integer"
df['b'] = df['b'].astype(int)
How do i convert only floats to int and leave the nulls as is?
When your series contains floats and nan's and you want to convert to integers, you will get an error when you do try to convert your float to a numpy integer, because there are na values.
DON'T DO:
df['b'] = df['b'].astype(int)
From pandas >= 0.24 there is now a built-in pandas integer. This does allow integer nan's. Notice the capital in 'Int64'. This is the pandas integer, instead of the numpy integer.
SO, DO THIS:
df['b'] = df['b'].astype('Int64')
More info on pandas integer na values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions
np.NaN is a floating point only kind of thing, so it has to be removed in order to create an integer pd.Series. Jeon's suggestion work's great If 0 isn't a valid value in df['b']. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 5, 5], 'b': [np.NaN, 7200.0, 580.0], 'c': [3, 20, 20]})
print(df, '\n\n')
df['b'] = np.nan_to_num(df['b']).astype(int)
print(df)
if there are valid 0's, then you could first replace them all with some unique value (e.g., -999999999), the the conversion above, and then replace these unique values with 0's.
Either way, you have to remember that you have 0's where there were once NaNs. You will need to be careful to filter these out when doing various numerical analyses (e.g., mean, etc.)
Similar answer as TSeymour, but now using Panda's fillna:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 5, 5], 'b': [np.NaN, 7200.0, 580.0], 'c': [3, 20, 20]})
print(df, '\n\n')
df['b'] = df['b'].fillna(0).astype(int)
print(df)
Which gives:
a b c
0 1 NaN 3
1 5 7200.0 20
2 5 580.0 20
a b c
0 1 0 3
1 5 7200 20
2 5 580 20
Select the variable values that are not NaN using the pandas notnull function. Then, assign those variables to type int using astype function:
df[df[0].notnull()] = df[df[0].notnull()].astype(int)
I used the index number to make this solution more general. Of course, you can always specify using the column name like this: df['name_of_column']

Is there a way to copy only the structure (not the data) of a Pandas DataFrame?

I received a DataFrame from somewhere and want to create another DataFrame with the same number and names of columns and rows (indexes). For example, suppose that the original data frame was created as
import pandas as pd
df1 = pd.DataFrame([[11,12],[21,22]], columns=['c1','c2'], index=['i1','i2'])
I copied the structure by explicitly defining the columns and names:
df2 = pd.DataFrame(columns=df1.columns, index=df1.index)
I don't want to copy the data, otherwise I could just write df2 = df1.copy(). In other words, after df2 being created it must contain only NaN elements:
In [1]: df1
Out[1]:
c1 c2
i1 11 12
i2 21 22
In [2]: df2
Out[2]:
c1 c2
i1 NaN NaN
i2 NaN NaN
Is there a more idiomatic way of doing it?
That's a job for reindex_like. Start with the original:
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
Construct an empty DataFrame and reindex it like df1:
pd.DataFrame().reindex_like(df1)
Out:
c1 c2
i1 NaN NaN
i2 NaN NaN
In version 0.18 of pandas, the DataFrame constructor has no options for creating a dataframe like another dataframe with NaN instead of the values.
The code you use df2 = pd.DataFrame(columns=df1.columns, index=df1.index) is the most logical way, the only way to improve on it is to spell out even more what you are doing is to add data=None, so that other coders directly see that you intentionally leave out the data from this new DataFrame you are creating.
TLDR: So my suggestion is:
Explicit is better than implicit
df2 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
Very much like yours, but more spelled out.
Not exactly answering this question, but a similar one for people coming here via a search engine
My case was creating a copy of the data frame without data and without index. One can achieve this by doing the following. This will maintain the dtypes of the columns.
empty_copy = df.drop(df.index)
Let's start with some sample data
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 'a'], [2, 'b'], [3, 'c']],
...: columns=['num', 'char'])
In [3]: df
Out[3]:
num char
0 1 a
1 2 b
2 3 c
In [4]: df.dtypes
Out[4]:
num int64
char object
dtype: object
Now let's use a simple DataFrame initialization using the columns of the original DataFrame but providing no data:
In [5]: empty_copy_1 = pd.DataFrame(data=None, columns=df.columns)
In [6]: empty_copy_1
Out[6]:
Empty DataFrame
Columns: [num, char]
Index: []
In [7]: empty_copy_1.dtypes
Out[7]:
num object
char object
dtype: object
As you can see, the column data types are not the same as in our original DataFrame.
So, if you want to preserve the column dtype...
If you want to preserve the column data types you need to construct the DataFrame one Series at a time
In [8]: empty_copy_2 = pd.DataFrame.from_items([
...: (name, pd.Series(data=None, dtype=series.dtype))
...: for name, series in df.iteritems()])
In [9]: empty_copy_2
Out[9]:
Empty DataFrame
Columns: [num, char]
Index: []
In [10]: empty_copy_2.dtypes
Out[10]:
num int64
char object
dtype: object
A simple alternative -- first copy the basic structure or indexes and columns with datatype from the original dataframe (df1) into df2
df2 = df1.iloc[0:0]
Then fill your dataframe with empty rows -- pseudocode that will need to be adapted to better match your actual structure:
s = pd.Series([Nan,Nan,Nan], index=['Col1', 'Col2', 'Col3'])
loop through the rows in df1
df2 = df2.append(s)
To preserve column type you can use the astype method,
like pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
import pandas as pd
df1 = pd.DataFrame(
[
[11, 12, 'Alice'],
[21, 22, 'Bob']
],
columns=['c1', 'c2', 'c3'],
index=['i1', 'i2']
)
df2 = pd.DataFrame(columns=df1.columns).astype(df1.dtypes)
print(df2.shape)
print(df2.dtypes)
output:
(0, 3)
c1 int64
c2 int64
c3 object
dtype: object
Working example
You can simply mask by notna() i.e
df1 = pd.DataFrame([[11, 12], [21, 22]], columns=['c1', 'c2'], index=['i1', 'i2'])
df2 = df1.mask(df1.notna())
c1 c2
i1 NaN NaN
i2 NaN NaN
A simple way to copy df structure into df2 is:
df2 = pd.DataFrame(columns=df.columns)
This has worked for me in pandas 0.22:
df2 = pd.DataFrame(index=df.index.delete(slice(None)), columns=df.columns)
Convert types:
df2 = df2.astype(df.dtypes)
delete(slice(None))
In case you do not want to keep the values ​​of the indexes.
I know this is an old question, but I thought I would add my two cents.
def df_cols_like(df):
"""
Returns an empty data frame with the same column names and types as df
"""
df2 = pd.DataFrame({i[0]: pd.Series(dtype=i[1])
for i in df.dtypes.iteritems()},
columns=df.dtypes.index)
return df2
This approach centers around the df.dtypes attribute of the input data frame, df, which is a pd.Series. A pd.DataFrame is constructed from a dictionary of empty pd.Series objects named using the input column names with the column order being taken from the input df.

Categories