Pandas Dataframe Issue Converting Column dtype - python

I have a simple pandas dataframe with a column:
col = [['A']]
data = [[1.0],[2.3],[3.4]]
df = pd.DataFrame.from_records(data, columns=col)
This creates a dataframe with one column of type np.float64, which is what I want.
Later in the process, I want to add another column of type string.
df['SOMETEXT'] = "SOME TEXT FOR ANALYSIS"
The dtype of this column is coming though as dtype of object, but I need it to be type string. So I do the following:
df['SOMETEXT'] = df['SOMETEXT'].astype(str)
If I look at the dtype again, I get the same dtype for that column: object.
I have a process down my workflow that is dtype sensitive and I need the column to be a string.
Any ideas?
array = df.to_records(index=False) # convert to numpy array
The dtypes on the array still carry the object dtype, but the columns should be a string.

In pandas, all strings are object type. It confused me too when I first started.
Once in NumPy, you can cast the string:
In [24]: array['SOMETEXT'].astype(str)
Out[24]:
array(['SOME TEXT FOR ANALYSIS', 'SOME TEXT FOR ANALYSIS',
'SOME TEXT FOR ANALYSIS'],
dtype='<U22')

Related

Pandas - Convert object to string and then to int

I have the following data frame.
What I am trying to do is
Convert this object to a string and then to a numeric
I have looked at using the astype function (string) and then again to int. What I would like to get is the data type to be
df['a'] = df['a'].astype(string).astype(int).
I have tried other variations. What I have been trying to do is get the column values to become a number(obviously without the columns). It is just that the data type is an object initially.
Thanks so much!
You need to remove all the ,:
df['a'] = df['a'].str.replace(',', '').astype(int)
With both columns, you can do:
df[['a','b']] = df[['a','b']].replace(',', '', regex=True).astype('int')

Check if ENTIRE pandas object column is a string

How can I check if a column is a string, or another type (e.g. int or float), even though the dtype is object?
(Ideally I want this operation vectorised, and not applymap checking every row...)
import io
# American post code
df1_str = """id,postal
1,12345
2,90210
3,"""
df1 = pd.read_csv(io.StringIO(df1_str))
df1["postal"] = df1["postal"].astype("O") # is an object (of type float due to the null row 3)
# British post codes
df2_str = """id,postal
1,EC1
2,SE1
3,W2"""
df2 = pd.read_csv(io.StringIO(df2_str))
df2["postal"] = df2["postal"].astype("O") # is an object (of type string)
Both df1 and df2 return object when doing df["postal"].dtype
However, df2 has .str methods, e.g. df2["postal"].str.lower(), but df1 doesn't.
Similarly, df1 can have mathematical operations done to it, e.g. df1 * 2
This is different to other SO questions. who ask if there are strings inside a column (and not the WHOLE column). e.g:
Python: Check if dataframe column contain string type
Check if string is in a pandas dataframe
Check pandas dataframe column for string type
You can use pandas.api.types.infer_dtype:
>>> pd.api.types.infer_dtype(df2["postal"])
'string'
>>> pd.api.types.infer_dtype(df1["postal"])
'floating'
From the docs:
Efficiently infer the type of a passed val, or list-like array of values. Return a string describing the type.

How to convert numeric values with quotes in CSV to int type

Data is below
no,store_id,revenue,profit,state,country
'0','101','779183','281257','WD','India'
'1','101','144829','838451','WD','India'
'2','101','766465','757565','AL','Japan'
'3','102','766465','757565','AL','Japan'
Code is below
import pandas as pd
data = pd.read_csv("1.csv")
dummies = pd.get_dummies(data)
dummies.head(10)
data.info() is Object for all the column.
How to convert to automatically to new object column to dummies, For example here TEAM is object need to convert to get_dummies. If some one add tomorrow names column it is also need to convert to dummies
data.info() is object for all the column
How to convert automatically assign int to numeric column and object to non-numeric column
Tomorrow some one might add new column may be numeric or non-numeric
How to apply get_dummies after that
While reading CSV file using pd.read_csv set quotechar parameter to '(default is ")
From docs pd.read_csv under quotechar:
quotecharstr (length 1), optional:
The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
from io import StringIO
text = """no,store_id,revenue,profit,state,country
'0','101','779183','281257','WD','India'
'1','101','144829','838451','WD','India'
'2','101','766465','757565','AL','Japan'
'3','102','766465','757565','AL','Japan'"""
df = pd.read_csv(StringIO(text),quotechar='\'') # or quotechar = "'"
print(df.dtypes)
no int64
store_id int64
revenue int64
profit int64
state object
country object
dtype: object
#Ch3steR's solution is perfect.
Just to extend it, you can use converters in conjunction to efficiently handle conversions in case you would want to.
df = pd.read_csv(io.StringIO(text), converters={'no': D.Decimal, 'store_id': D.Decimal})

Apply converters to all the columns while reading excel file, Python 3.6

I am importing excel file with 30 columns to dataframe and want to change column type of all the columns to string, how to do this?
data = pd.read_excel(excelPath, sheetname='Donor', converters={'Code':str})
For Pandas 0.20.0+ you can use dtype=object parameter:
data = pd.read_excel(excelPath, sheet_name='Donor', dtype='object')
from docs:
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
Use object to preserve data as stored in Excel and not
interpret dtype. If converters are specified, they will be applied
INSTEAD of dtype conversion.
New in version 0.20.0.
converterS = {col: str for col in column_list} # Convert all field to string
data = pd.read_excel(excelPath, sheetname='Donor', converters=converterS )
In addition to solution from #Plinus, the following code read all the headers (assuming it is at row 0). It reads 0 row of data.
Using the headers (column names), it creates a dictionary of "column name"-"data conversion function" pairs converters.
It then re-read the whole Excel file using the converters.
columns = pd.read_excel(
'/pathname/to/excel/file.xlsx',
sheet_name='Sheet 1',
nrows=0, # Read 0 rows, assuming headers are at row 0
).columns
converters = {col: str for col in columns} # Convert all fields to strings
data = pd.read_excel(
'/pathname/to/excel/file.xlsx',
sheet_name='Sheet 1',
convertes=converters
)

assign dtype with from_dict

I have data in a python dictionary like:
data = {u'01-01-2017 22:34:43:871': [u'88.49197', u'valid'],
u'01-01-2017 11:23:43:803': [u'88.49486', u'valid'],
u'02-01-2017 03:11:43:898': [u'88.49773', u'valid'],
u'01-01-2017 13:54:43:819': [u'88.50205', u'valid']}
I can convert it to a pandas Dataframe with:
data = pandas.DataFrame.from_dict(data, orient='index')
but I am not able to use the dtype parameter of from_dict. I would convert the index to a datetime of similar first column to float and third to string.
I have tried:
pandas.DataFrame.from_dict((data.values()[0]), orient='index',
dtype={0: 'float', 1:'str'})
but it doesn't work.
This appears to be an ongoing issue with some of the pandas constructor methods: How to set dtypes by column in pandas DataFrame
Instead of using the dtype argument, chaining .astype may do the trick:
pandas.DataFrame.from_dict(data, orient='index').astype({0: float, 1:str})

Categories