Groupby year dropping some variables - python

This is the original data and I need the mean of each year of all the variables.
But when I am using groupby('year') command, it is dropping all variables except 'lnmcap' and 'epu'.
Why this is happening and what needs to be done?

Probably the other columns have object or string type of the data, instead of integer, as a result of which only 'Inmcap' and 'epu' has got the average column. Use ds.dtypes or simply ds.info() to check the data types of data in the columns
it comes out to be object/string type then use
ds=ds.drop('company',axis=1)
column_names=ds.columns
for i in column_names:
ds[i]=ds[i].astype(str).astype(float)
This could work

You might want to convert all numerical columns to float before getting their mean, for example
cols = list(ds.columns)
#remove irrelevant columns
cols.pop(cols.index('company'))
cols.pop(cols.index('year'))
#convert remaining relevant columns to float
for col in cols:
ds[col] = pd.to_numeric(ds[col], errors='coerce')
#after that you can apply the aggregation
ds.groupby('year').mean()

You will need to convert the numeric columns to float types. Use df.info() to check the various data types.
for col in ds.select_dtypes(['object']).columns:
try:
ds[col] = ds[col].astype('float')
except:
continue
After this, use df.info() to check again. Those columns with objects like '1.604809' will be converted to float 1.604809
Sometimes, the column may contain some "dirty" data that cannot be converted to float. In this case, you could use below code with errors='coerce' means non-numeric data becomes NaN
column_names = list(ds.columns)
column_names.remove('company')
column_names.remove('year')
for col in column_names:
ds[col] = pd.to_numeric(ds[col], errors='coerce') #this will convert to numeric, whereas non-numeric becomes NaN

Related

.tolist() converting pandas dataframe values to bytes

I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?
You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.

Dataframe.max() giving all NaNs

I have a pandas DataFrame with data that looks like this:
With the data extending beyond what you can see here. I can't tell if the blue cells hold numeric or string data, but it should be numeric, since I transformed them to those values with multiplication. But I don't know pandas well enough to be sure.
Anyway, I call .max(axis=1) on this dataframe, and it gives me this:
As far as I know, there are no empty cells or cells with weird data. So why am I getting all nan?
First convert all values to numeric by DataFrame.astype:
df = df.astype(float)
If not working, use to_numeric with errors='coerce' for NaNs if not numeric values:
df = df.apply(pd.to_numeric, errors='coerce')
And then count max:
print (df.max(axis=1))

How to extract non-numeric special character values in a large Series object?

This is an extended question to this post.
I'm working on a large dataset and it seems to be very messy containing random strings here and there with no order. I would like to check on these values column wise without printing the numeric values of those columns. For example if df.Col is a float object but has strings such as #$, ^%#, etc., is there a way to view/print only those?
EDIT:
For example I have a column Client_Income.
print(str(df["Client_Income"].name).upper())
print("No. of unique values:", df["Client_Income"].nunique())
print(df["Client_Income"].unique())
CLIENT_INCOME
No. of unique values: 1217
['6750' '20250' '18000' ... '13140' '9764.1' '12840.75']
While this shows that the column could be purely float deliberately converted to object, when I do .astype("float16") for example, I get the error Cannot convert to float: '$'. There are other columns that contains values such as 'XNA', '##', etc.
Use to_numeric with errors='coerce' for convert values to numeric and if failed is created NaNs, so then filter by NaN in converted Series and non NaNs by original column:
df = pd.DataFrame({
'Client_Income':['6750', '20.250', '18000','XNA', '##45', '#4.5#', np.nan],
})
s = pd.to_numeric(df['Client_Income'], errors='coerce')
out1 = df.loc[s.isna() & df['Client_Income'].notna(), 'Client_Income']
print (out1)
3 XNA
4 ##45
5 #4.5#
Name: Client_Income, dtype: object

How to convert varchar to int/float in pandas

My data is coming from mysql table.
id,revenue,cost,state are varchar columns.
I need to do get_dummies(one hot encoding) for my categorical variable that is state only
if its reading directly from csv(pd.read_csv) I am getting dtypes of id,revenue,cost as int/float and state as object
My Question is how to convert object to int64/float if its numeric and object for category variable
There is a chance of some strange like ?,- character might appear in revenue, still i want this column to be numeric
What I have done
To fix this right now change the varchar to int in the database directly and issue got fixed
But i need to do in pandas
df.apply(pd.to_numeric, errors='coerce').fillna(df) still my int/float columns such as id,revenue,cost is not changing dtype
I think first is necesarry test dtypes after pd.read_csv:
print (df.dtypes)
Then converting columns to numeric, but cannot replace missing values to original, because get mixed values - numeric with strings:
cols = ['id','revenue','cost']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

Removing Decimal from a column extracted from a dataframe using pandas

I have a dataset in an excel. I read the data into a dataframe "df" using read_excel.
During this process, I observed that col1 from df is providing decimals, when it should only have numbers with only 4 digits.
So, I have two questions here:
Why is it returning a decimal when the source of the data does not have any decimals
How can I remove the decimals in the result column
I have tried astype(int) and astype(float)
Assumed the reason it is providing decimals could be because of a few empty values. So, used fillna(0)
df_A = pd.read_excel("path\filename.xls")
Data = {
"A" : df_A['col1'].fillna(0)
(Also tried "A" : df_A['col1'].astype(int))
}
df_B = pd.DataFrame(Data_A)
Expected... column values of "5124, 5487, 9487, 3598"
Actual results.. column values of "5124.0, 5487.0, 9487.0, 3598.0"
Since df_A is a dataframe, you can fillna and then convert the column to int.
df_A['col1'] = df_A['col1'].fillna(0).astype(int)
Since you are getting the error invalid literal for int() with base 10: with the above code, it means that there are some non-numeric values in your data which can not be converted to int. Use pd.to_numeric to coerce those values to NaN and then use the above code.
df_A['col1'] = pd.to_numeric(df_A['col1'], errors = 'coerce')
df_A['col1'] = df_A['col1'].fillna(0).astype(int)

Categories