I have a dataframe called "modified_df".I have a variable that I am trying to aggregate, 'age' (trying to calculate things like mean). Currently, the datatype is showing as "object," which is why I am not able to aggregate it. I have cleaned through it, and everything seems to be an integer, but there is a chance I missed something.
I tried running this code
modified_df['Age'] = modified_df['Age'].astype('int')
I have attached the error message along with what "Age" looks like
You can try two different things.
Option 1: (converts to a float instead. This might not work, but will rule out if you have any ages that have any values that can't be an int, but can be a float.)
modified_df['Age'] = modified_df['Age'].astype('float')
Option 2: (ignores whatever is causing the error and returns original value)
modified_df['Age'] = modified_df['Age'].astype('int',errors = 'ignore')
Obviously there are some values in the "Age" column which are not converting to int like mentioned. Try using value_counts() to and explore the column, or drop non-int columns. Try doing:
modified_df['Age'] = modified_df['Age'].astype('int',errors='ignore')
See the astype() documentation here.
Related
I have a very large dataframe where only the first two columns are not bools. However everything is brought in as a string due to the source. The True/False fields do contain actual blanks (not nan) as well and are spelled out 'True' and 'False'
I'm trying to come up with a dynamic-ish way to do this without typing out or listing every column.
ndf.iloc[:,2:].astype(bool)
That seems to at least run and change it to bool, but when I add 'inplace=True' it has no effect at storing the property types. I've also tried the code below to no luck. It runs but doesn't actually do anything that I can tell.
ndf.iloc[:,2:] = ndf.iloc[:,2:].astype(bool)
I need to be able to write this table back into a database as 0s and 1s ultimately. I'm not the most versed at bools and am hoping there is an easy one liner way to do this that I don't know yet.
Actually
ndf.iloc[:,2:] = ndf.iloc[:,2:].astype(bool)
should work and change your data from str/object to bool. It's just you get the same print out with 'True' and True. Check with ndf.dtypes to see the changes after that command.
If you want the booleans as 0 and 1, try:
ndf.iloc[:,2:] = ndf.iloc[:,2:].astype(bool).astype(int)
I have a dataframe with a column test that contains test names, which I am using that to extract some information about what grade a test was written. Because I know that the string used for the test name always has the grade in it as the next digit after the date I have been extracting that data using this line of code:
df['Grade'] = df['test'].apply(lambda x: str(list(filter(str.isdigit, x[10:]))[0]))
This line, however, gives a TypeError: 'float' object is not subscriptable. Now, I should note that before I ran this, I did a check with df.dtypes and the column test was listed as object. That makes sense, as the string for the test names are something like 2015-2016_math_grade_7, so there is no way it could be seen as a float by Pandas.
I have checked, and test names are the only data in that column, so I have no idea why I am getting this type error. No matter what I change the code to, I get this error because I need to perform a string operation after x[:10]. (I have used df['Grade'] = df['test'].apply(lambda x: str(re.sub("\D", "", str(x[:10])))))
I should also note, that I have used this code before and it worked perfectly, but for some reason on this data set it seems to fail, if that helps.
I am quite new in Python coding, and I am dealing with a big dataframe for my internship.
I had an issue as sometimes there are wrong values in my dataframe. For example I find string type values ("broken leaf") instead of integer type values as ("120 cm") or (NaN).
I know there is the df.replace() function, but therefore you need to know that there are wrong values. So how do I find if there are any wrong values inside my dataframe?
Thank you in advance
"120 cm" is a string, not an integer, so that's a confusing example. Some ways to find "unexpected" values include:
Use "describe" to examine the range of numerical values, to see if there are any far outside of your expected range.
Use "unique" to see the set of all values for cases where you expect a small number of permitted values, like a gender field.
Look at the datatypes of columns to see whether there are strings creeping in to fields that are supposed to be numerical.
Use regexps if valid values for a particular column follow a predictable pattern.
I have various csv files and I import them as a DataFrame. The problem is that many files use different symbols for missing values. Some use nan, others NaN, ND, None, missing etc. or just live the entry empty. Is there a way to replace all these values with a np.nan? In other words, any non-numeric value in the dataframe becomes np.nan. Thank you for the help.
I found what I think is a relatively elegant but also robust method:
def isnumber(x):
try:
float(x)
return True
except:
return False
df[df.applymap(isnumber)]
In case it's not clear: You define a function that returns True only if whatever input you have can be converted to a float. You then filter df with that boolean dataframe, which automatically assigns NaN to the cells you didn't filter for.
Another solution I tried was to define isnumber as
import number
def isnumber(x):
return isinstance(x, number.Number)
but what I liked less about that approach is that you can accidentally have a number as a string, so you would mistakenly filter those out. This is also a sneaky error, seeing that the dataframe displays the string "99" the same as the number 99.
EDIT:
In your case you probably still need to df = df.applymap(float) after filtering, for the reason that float works on all different capitalizations of 'nan', but until you explicitely convert them they will still be considered strings in the dataframe.
Replacing non-numeric entries on read, the easier (more safe) way
TL;DR: Set a datatype for the column(s) that aren't casting properly, and supply a list of na_values
# Create a custom list of values I want to cast to NaN, and explicitly
# define the data types of columns:
na_values = ['None', '(S)', 'S']
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctapi': np.float64}, na_values=na_values)
Longer Explanation
I believe best practices when working with messy data is to:
Provide datatypes to pandas for columns whose datatypes are not inferred properly.
Explicitly define a list of values that should be cast to NaN.
This is quite easy to do.
Pandas read_csv has a list of values that it looks for and automatically casts to NaN when parsing the data (see the documentation of read_csv for the list). You can extend this list using the na_values parameter, and you can tell pandas how to cast particular columns using the dtypes parameter.
In the example above, pctapi is the name of a column that was casting to object type instead of float64, due to NaN values. So, I force pandas to cast to float64 and provide the read_csv function with a list of values to cast to NaN.
Process I follow
Since data science is often completely about process, I thought I describe the steps I use to create an na_values list and debug this issue with a dataset.
Step 1: Try to import the data and let pandas infer data types. Check if the data types are as expected. If they are = move on.
In the example above, Pandas was right on about half the columns. However, I expected all columns listed below the 'count' field to be of type float64. We'll need to fix this.
Step 2: If data types are not as expected, explicitly set the data types on read using dtypes parameter. This will throw errors by default on values that cannot be cast.
# note: the dtypes dictionary specifying types. pandas will attempt to infer
# the type of any column name that's not listed
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64})
Here's the error message I receive when running the code above:
Step 3: Create an explicit list of values pandas cannot convert and cast them to NaN on read.
From the error message, I can see that pandas was unable to cast the value of (S). I add this to my list of na_values:
# note the new na_values argument provided to read_csv
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64}, na_values=['(S)'])
Finally, I repeat steps 2 & 3 until I have a comprehensive list of dtype mappings and na_values.
If you're working on a hobbyist project this method may be more than you need, you may want to use u/instant's answer instead. However, if you're working in production systems or on a team, it's well worth the 10 minutes it takes to correctly cast your columns.
My PANDAS data has columns that were read as objects. I want to change these into floats. Following the post linked below (1), I tried:
pdos[cols] = pdos[cols].astype(float)
But PANDAS gives me an error saying that an object can't be recast as float.
ValueError: invalid literal for float(): 17_d
But when I search for 17_d in my data set, it tells me it's not there.
>>> '17_d' in pdos
False
I can look at the raw data to see what's happening outside of python, but feel if I'm going to take python seriously, I should know how to deal with this sort of issue. Why doesn't this search work? How could I do a search over objects for strings in PANDAS? Any advice?
Pandas: change data type of columns
of course it does, because you're only looking in the column list!
'17_d' in pdos
checks to see if '17_d' is in pdos.columns
so what you want to do is pdos[cols] == '17_d', which will give you a truth table. if you want to find which row it is, you can do (pdos[cols] == '17_d').any(1)