I have an Excel file where column name might be a number, i.e. 2839238. I am reading it using pd.read_excel(bytes(filedata), engine='openpyxl') and, for some reason, this column name gets converted to a float 2839238.0. How to disable this conversion?
This is an issue for me because I then operate on column names using string-only methods like df = df.loc[:, ~df.columns.str.contains('^Unnamed')], and it gives me the following error:
TypeError: bad operand type for unary ~: 'float'
Column names are arbitrary.
try to change the type of the columns.
df['col'] = df['col'].astype(int)
the number you gave in the example shows that maybe you have a big numbers so python can't handle the big numbers as int but it can handle it like a float or double, check the ranges of the data types in python and compare it to your data and see which one you can use
Verify that you don't have any duplicate column names. Pandas will add .0 or .1 if there is another instance of 2839238 as a header name.
See description of mangle_dupe_colsbool
which says:
Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
Related
I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?
You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.
I have a dataframe named 'x'.
This dataframe is about the size and type of houses (eg 35A, 9B, 50C..) and is of type 'object' and contains missing values.
I want to extract only numbers from this dataframe and convert them to numeric type.
What should I do in this case?
I tried the following, but it didn't work:
df['x'] = df['x'].str[0:2]
df['x'] = pd.to_numeric(df['x'])
Output
ValueError: Unable to parse string "9A" at position 3766
I would use str.extract here:
df['x'] = pd.to_numeric(df['x'].str.extract(r'^(\d+)'))
The challenge with trying to use a pure substring approach is that we don't necessarily know how many characters to take. Regex gets around this problem.
You are making the assumption that for your strings in the x column, the first two characters will always be digits. Unfortunately, you have a row where x is 9A which doesn't convert to a numeric value.
I am trying to convert a pands data frame (read in from a .csv file) from string to float. The columns 1 until 20 are recognized as "strings" by the system. However, they are float values in the format of "10,93847722". Therefore, I tried the following code:
new_df = df[df.columns[1:20]].transform(lambda col: col.str.replace(',','.').astype(float))
The last line causes the Error:
AttributeError: 'DataFrame' object has no attribute 'transform'
Maybe important to know, I can only use pands version 0.16.2.
Thank you very much for your help!
#all: Short extract from one of the columns
23,13854599
23,24945831
23,16853714
23,0876255
23,05908775
Use DataFrame.apply:
df[df.columns[1:20]].apply(lambda col: col.str.replace(',','.').astype(float))
EDIT: If some non numeric values is possible use to_numeric with errors='coerce' for replace these values to NaNs:
df[df.columns[1:20]].apply(lambda col: pd.to_numeric(col.str.replace(',','.'),errors='coerce'))
You should load them directly as numbers:
pd.read_csv(..., decimal=',')
This will recognize , as decimal point for every column.
I have a Dataframe where a column is full of observations which could be numbers or strings. Currently, the data type of this column (called Code) is an object. How can I change the data type of this column to integers just for those observations that are numbers? I tried the following code:
df['Code'].astype(str).astype(int)
But of course it doesn't work because it tries to converts the strings into numbers.
Thank you!
How about using str.isnumeric() to create a mask of True or False? You won’t be able to change part of the records data type because pandas (and numpy) require that everything in the column is the same data type. You could save only the numbers as a new dataframe and then convert the column type.
https://www.tutorialspoint.com/python/string_isnumeric.htm
Sorry I can’t code backticks on my phone.
create Boolean mask to filter numbers only
mask_numeric = df[‘code’].str.isnumeric()
filter only the True records in the mask
df_numbers = df.loc[mask_numeric]
Recently I have been developing some code to read a csv file and store key data columns in a dataframe. Afterwards I plan to have some mathematical functions performed on certain columns in the dataframe.
I've been fairly successful in storing the correct columns in the dataframe. I have been able to have it do whatever maths is necessary such as summations, additions of dataframe columns, averaging etc.
My problem lies in accessing specific columns once they are stored in the dataframe. I was working with a test file to get everything working and managed this no problem. The problems arise when I open a different csv file, it will store the data in the dataframe, but the accessing the column I want no longer works and it stops at the calculation part.
From what I can tell the problem lies with how it reads the column name. The column names are all numbers. For example, df['300'], df['301'] etc. When accessing the column df['300'] works fine in the testfile, while the next file requires df['300.0']. If I switch to a different file it may require df['300'] again. All the data was obtained in the same way so I am not certain why some are read as 300 and the others 300.0.
Short of constantly changing the column labels each time I open a different file, is there anyway to have it automatically distinguish between '300' and '300.0' when opening the file, or force '300.0' = '300'?
Thanks
In your dataframe df, one way to keep consistency may be to convert to similar types of columns. You can update all the column name to string value of integer from float i.e. '300.0' to '300' using .columns as below. Then, I think using integer value of string should work i.e. df['300] or any other columns other than 300.
df.columns = [str(int(float(column))) for column in df.columns]
Or, if integer value is not required,extra int conversion can be removed and float string value can be used:
df.columns = [str(float(column)) for column in df.columns]
Then, df['300.0'] can be used instead of df['300'].
If string type is not required then, I think converting them float would work as well.
df.columns = [float(column) for column in df.columns]
Then, df[300.0] would work as well.
Other alternative to change column names may be using map:
Changing to float value for all columns, then as mentioned above use df[300.0]:
df.columns = map(float, df.columns)
Changing to string value of float, then df['300.0']:
df.columns = map(str, map(float, df.columns))
Changing to string value of int, then df['300']:
df.columns = map(str, map(int, map(float, df.columns)))
Some solutions:
Go through all the files, change the columns names, then save the result in a new folder. Now when you read a file, you can go to the new folder and read it from there.
Wrap the normal file read function in another function that automatically changes the column names, and call that new function when you read a file.
Wrap column selection in a function. Use a try/except block to have the function try to access the given column, and if it fails, use the other form.
This answer assumes you want only the integer part to remain in the column name. It takes the column names and does a float->int->string conversion to strip the decimal places.
Be careful, if you have numbers like '300.5' as a column name, this will turn them into '300'.
cols = df.columns.tolist()
new_columns = dict([(c,str(int(float(c)))) for c in cols])
df = df.rename(columns = new_columns)
For clarity, most of the 'magic' is happening on the middle line. I iterate over the currently existing columns, and turn them into tuples of the form (old_name, new_name). df.rename takes that dictionary and then does the renaming for you.
My thanks to user Nipun Batra for this answer that explained df.rename.