dataframe values.tolist() datatype - python

I have a dataframe like this:
This dataframe has several columns. Two are of type float: price and change, while volme and amountare of type int.
I use the method df.values.tolist() change df to list and get the data:
datatmp = df.values.tolist()
print(datatmp[0])
[20160108150023.0, 11.12, -0.01, 4268.0, 4746460.0, 2.0]
The int types in df all change to float types.
My question is why do int types change to the float types? How can I get the int data I want?

You can convert column-by-column:
by_column = [df[x].values.tolist() for x in df.columns]
This will preserve the data type of each column.
Than convert to the structure you want:
list(list(x) for x in zip(*by_column))
You can do it in one line:
list(list(x) for x in zip(*(df[x].values.tolist() for x in df.columns)))
You can check what datatypes your columns have with:
df.info()
Very likely your column amount is of type float. Do you have any NaN in this column? These are always of type float and would make the whole column float.
You can cast to int with:
df.values.astype(int).tolist()

I think the pandas documentation helps:
DataFrame.values
Numpy representation of NDFrame
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
So here apparently float is chosen to accomodate all component types. A simple method would be (however, most possibly there are more elegant solutions around, I'm not too familiar with pandas):
datatmp = map(lambda row: list(row[1:]), df.itertuples())
Here the itertuples() gives an iterator with elements of the form (rownumber, colum1_entry, colum2_entry, ...). The map takes each such tuple and applies the lambda function, which removes the first component (rownumber), and returns a list containing the components of a single row. You can also remove the list() invocation if it's ok for you to work with a list of tuples.
[Dataframe values property][1]
"http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html#pandas.DataFrame.values"

Related

Groupby year dropping some variables

This is the original data and I need the mean of each year of all the variables.
But when I am using groupby('year') command, it is dropping all variables except 'lnmcap' and 'epu'.
Why this is happening and what needs to be done?
Probably the other columns have object or string type of the data, instead of integer, as a result of which only 'Inmcap' and 'epu' has got the average column. Use ds.dtypes or simply ds.info() to check the data types of data in the columns
it comes out to be object/string type then use
ds=ds.drop('company',axis=1)
column_names=ds.columns
for i in column_names:
ds[i]=ds[i].astype(str).astype(float)
This could work
You might want to convert all numerical columns to float before getting their mean, for example
cols = list(ds.columns)
#remove irrelevant columns
cols.pop(cols.index('company'))
cols.pop(cols.index('year'))
#convert remaining relevant columns to float
for col in cols:
ds[col] = pd.to_numeric(ds[col], errors='coerce')
#after that you can apply the aggregation
ds.groupby('year').mean()
You will need to convert the numeric columns to float types. Use df.info() to check the various data types.
for col in ds.select_dtypes(['object']).columns:
try:
ds[col] = ds[col].astype('float')
except:
continue
After this, use df.info() to check again. Those columns with objects like '1.604809' will be converted to float 1.604809
Sometimes, the column may contain some "dirty" data that cannot be converted to float. In this case, you could use below code with errors='coerce' means non-numeric data becomes NaN
column_names = list(ds.columns)
column_names.remove('company')
column_names.remove('year')
for col in column_names:
ds[col] = pd.to_numeric(ds[col], errors='coerce') #this will convert to numeric, whereas non-numeric becomes NaN

Why does pandas DataFrame convert integer to object like this?

I am using value_counts() to get the frequency for sec_id. The output of value_counts() should be integers.
When I build DataFrame with these integers, I found those columns are object dtype. Does anyone know the reason?
They are the object dtype because your sec_id column contains string values (e.g. "94114G"). When you call .values on the dataframe created by .reset_index(), you get two arrays which both contain string objects.
More importantly, I think you are doing some unnecessary work. Try this:
>>> sec_count_df = df['sec_id'].value_counts().rename_axis("sec_id").rename("count").reset_index()

How to assign integer to a cell in Pandas?

I would like to iterate through a dataframe and create a new column with the indice returned by enumerate(), but I'm unable to assign the value as integer and I need to convert it later. Is there a solution to do that in one shot ?
As you can see, direct assignment of an integer fails.
df.loc[indexes, ('route', 'id')] = int(i)
print(df.loc[indexes, ('route', 'id')].dtypes) # float64
Conversion with a second line of code is necessary:
df.loc[indexes, ('route', 'id')] = df.loc[indexes, ('route', 'id')].astype(int)
print(df.loc[indexes, ('route', 'id')].dtypes) # int64
This link shows you how to assing the value of a column to an int type in pandas.
Basically you can do this by using:
to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)
astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).
infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.
convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas' object to indicate a missing value).

Pandas adding NA changes column from float to object

I have a dataframe with a column that is of type float64. Then when I run the following it converts to object, which ends up messing with a downstream calculation where this is the denominator and there are some 0s:
df.loc[some criteria,'my_column'] = pd.NA
After this statement, my column is now of type object. Can someone explain why that is, and how to correct this? I'd rather do it right the first time rather than adding this to the end:
df['my_column'] = df['my_column'].astype(float)
Thanks.
In pandas, you can use NA or np.nan to represent missing values. np.nan casts it to float which is what you want and what the documentation states it as. NA casts the column as string and is best used for dtype string columns.
df.loc[some criteria,'my_column'] = np.nan

Is there a way check if an object is actually a string to use .str accessor without running into AttributeError?

I'm converting pyspark data frames to pandas data frames using toPandas(). However, because some data types don't line up, pandas casts certain columns in the data frame, such as decimal fields, to object.
I'd like to run .str on my columns with actual strings, but can't see to get it to work (without explicitly finding which columns to convert first).
I run into:
AttributeError: Can only use .str accessor with string values!
I've tried df.fillna(0) and df.infer_objects(), to no avail. I can't see to get the objects to register as int64 or float64, so I can't do:
for col in df.columns:
if df[col].dtype == np.object:
# insert logic here
beforehand.
I also cannot use .str.contains, because even though the columns with numeric values are dtype objects, upon using .str it will error out. (For reference, what I'm trying to do is if the column in the data frame actually has string values, do a str.split().)
Any ideas?
Note: I am curious for an answer on the pandas side, without having to explicitly identify which columns actually have strings beforehand. One possible solution is to get the list of columns of strings on the pyspark side, and pass those as the columns to run .str methods on.
I also tried astype(str) but it won't work because some objects are arrays. I.e. if I wanted to split on _, and I had an array like ['Red_Apple', 'Orange'] in a column, doing astype(str).str.split on this column would return ['Red', 'Apple', 'Orange'], which doesn't make sense. I only want to split string columns, not turn arrays into strings and split them too.
You can use isinstance():
var = 'hello world'
if isinstance(var,str):
# Do something
A couple of ideas here:
Convert the column to string anyways using astype: df[col_name].astype(str).str.split().
Check the column types with df.dtypes(), and only run the str.split() on columns that are already type object.
This is really up to you for how you want to implement it, but if you want to treat the column as a str anyways, I would go with option 1.
Hope I got you right. You can use [.select_dtypes][1]
df = pd.DataFrame({'A':['9','3','7'],'b':['11.0','8.0','9'], 'c':[2,5,9]})#DataFrame
print(df.dtypes)#Check df dtypes
A object
b object
c int64
dtype: object
df2=df.select_dtypes(include='object')#Isolate object dtype columns
df3=df.select_dtypes(exclude='object')#Isolate nonobject dtype columns
df2=df2.astype('float')#Convert object columns to float
res=df3.join(df2)#Rejoin the datframes
res.dtypes#Recheck the dtypes
c int64
A float64
b float64
dtype: object

Categories