Why does pandas DataFrame convert integer to object like this? - python

I am using value_counts() to get the frequency for sec_id. The output of value_counts() should be integers.
When I build DataFrame with these integers, I found those columns are object dtype. Does anyone know the reason?

They are the object dtype because your sec_id column contains string values (e.g. "94114G"). When you call .values on the dataframe created by .reset_index(), you get two arrays which both contain string objects.
More importantly, I think you are doing some unnecessary work. Try this:
>>> sec_count_df = df['sec_id'].value_counts().rename_axis("sec_id").rename("count").reset_index()

Related

Do not convert numerical column names to float in pandas read_excel

I have an Excel file where column name might be a number, i.e. 2839238. I am reading it using pd.read_excel(bytes(filedata), engine='openpyxl') and, for some reason, this column name gets converted to a float 2839238.0. How to disable this conversion?
This is an issue for me because I then operate on column names using string-only methods like df = df.loc[:, ~df.columns.str.contains('^Unnamed')], and it gives me the following error:
TypeError: bad operand type for unary ~: 'float'
Column names are arbitrary.
try to change the type of the columns.
df['col'] = df['col'].astype(int)
the number you gave in the example shows that maybe you have a big numbers so python can't handle the big numbers as int but it can handle it like a float or double, check the ranges of the data types in python and compare it to your data and see which one you can use
Verify that you don't have any duplicate column names. Pandas will add .0 or .1 if there is another instance of 2839238 as a header name.
See description of mangle_dupe_colsbool
which says:
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.

panda dataframe extracting values

I have a dataframe called "nums" and am trying to find the value of the column "angle" by specifying the values of other columns like this:
nums[(nums['frame']==300)&(nums['tad']==6)]['angl']
When I do so, I do not get a singular number and cannot do calculations on them. What am I doing wrong?
nums
First of all, in general you should use .loc rather than concatenate indexes like that:
>>> s = nums.loc[(nums['frame']==300)&(nums['tad']==6), 'angl']
Now, to get the float, you may use the .item() accessor.
>>> s.item()
-0.466331

Is there a way check if an object is actually a string to use .str accessor without running into AttributeError?

I'm converting pyspark data frames to pandas data frames using toPandas(). However, because some data types don't line up, pandas casts certain columns in the data frame, such as decimal fields, to object.
I'd like to run .str on my columns with actual strings, but can't see to get it to work (without explicitly finding which columns to convert first).
I run into:
AttributeError: Can only use .str accessor with string values!
I've tried df.fillna(0) and df.infer_objects(), to no avail. I can't see to get the objects to register as int64 or float64, so I can't do:
for col in df.columns:
if df[col].dtype == np.object:
# insert logic here
beforehand.
I also cannot use .str.contains, because even though the columns with numeric values are dtype objects, upon using .str it will error out. (For reference, what I'm trying to do is if the column in the data frame actually has string values, do a str.split().)
Any ideas?
Note: I am curious for an answer on the pandas side, without having to explicitly identify which columns actually have strings beforehand. One possible solution is to get the list of columns of strings on the pyspark side, and pass those as the columns to run .str methods on.
I also tried astype(str) but it won't work because some objects are arrays. I.e. if I wanted to split on _, and I had an array like ['Red_Apple', 'Orange'] in a column, doing astype(str).str.split on this column would return ['Red', 'Apple', 'Orange'], which doesn't make sense. I only want to split string columns, not turn arrays into strings and split them too.
You can use isinstance():
var = 'hello world'
if isinstance(var,str):
# Do something
A couple of ideas here:
Convert the column to string anyways using astype: df[col_name].astype(str).str.split().
Check the column types with df.dtypes(), and only run the str.split() on columns that are already type object.
This is really up to you for how you want to implement it, but if you want to treat the column as a str anyways, I would go with option 1.
Hope I got you right. You can use [.select_dtypes][1]
df = pd.DataFrame({'A':['9','3','7'],'b':['11.0','8.0','9'], 'c':[2,5,9]})#DataFrame
print(df.dtypes)#Check df dtypes
A object
b object
c int64
dtype: object
df2=df.select_dtypes(include='object')#Isolate object dtype columns
df3=df.select_dtypes(exclude='object')#Isolate nonobject dtype columns
df2=df2.astype('float')#Convert object columns to float
res=df3.join(df2)#Rejoin the datframes
res.dtypes#Recheck the dtypes
c int64
A float64
b float64
dtype: object

How to drop DataFrame columns with dtype ndarray and dataframe

As it says in the title, I have a pandas DataFrame with different data types (strings, floats, integers, ndarrays, dataframes, ect). I would like to drop all those columns with types ndarrays and dataframe (object dtype in pandas nomenclature), but I can't find a proper way.
I found in the pandas documentation that 'To select strings you must use the object dtype, but note that this will return all object dtype columns'. And this is exactly my problem. I want to exclude the columns which values are ndarrays, but not those which are strings or e.g. pathlib.Path's.
There is a similar question here, where they just want to keep the numerical dtype columns, but again this is not exactly what I want and I can't find a way to properly do it.
I've tried
dfSimple = df.select_dtypes (exclude=[np.ndarray])
but that leaves me with only np.number and bool types, exluding all others.

Convert Pandas Series to Categorical

I have a Panda series 'ids' of only unique ids, which is a dtype of object.
data_df.id.dtype
returns dtype('O')
I'm trying to follow the example here to create a sparse matrix from my df: Efficiently create sparse pivot tables in pandas?
id_u= list(data_df.id.unique())
row = data_df.id.astype('category', categories=reviewer_u).cat.codes
and I get:
TypeError: data type "category" not understood
I'm not sure what this error means and I haven't been able to find much on it.
Try instead:
row = pd.Categorical(data_df['id'], categories=reviewer_u)
You can get the codes using:
row.codes

Categories