How to find if there are wrong values in a pandas dataframe? - python

I am quite new in Python coding, and I am dealing with a big dataframe for my internship.
I had an issue as sometimes there are wrong values in my dataframe. For example I find string type values ("broken leaf") instead of integer type values as ("120 cm") or (NaN).
I know there is the df.replace() function, but therefore you need to know that there are wrong values. So how do I find if there are any wrong values inside my dataframe?
Thank you in advance

"120 cm" is a string, not an integer, so that's a confusing example. Some ways to find "unexpected" values include:
Use "describe" to examine the range of numerical values, to see if there are any far outside of your expected range.
Use "unique" to see the set of all values for cases where you expect a small number of permitted values, like a gender field.
Look at the datatypes of columns to see whether there are strings creeping in to fields that are supposed to be numerical.
Use regexps if valid values for a particular column follow a predictable pattern.

Related

showing cells with a particular symbols in pandas dataframe

i have not seen such question, so if you happen to know the answer or have seen the same question, please let me know
i have a dataframe in pandas with 4 columns and 5k rows, one of the columns is "price" and i need to do some manipulations with it. but the data was parsed from web-page and it is not clean, so i cannot convert this column to integer type after getting rid of dollar sign and comas. i found out that it also contains data in the format 3500/mo. so i need to filter cells with /mo and decide whether i can drop them, basing on how many of those i have and what is the price.
now, i have managed to count those cells using
df["price"].str.contains("/").sum()
but when i want to see those cells, i cannot do that, because when i create another variable to extract slash-containing cells and use "contains" or smth, i get the series with true/false values - showing me the condition of whether the cell does or does not contain that slash, while i actually need to see cells themselves. any ideas?
You need to use the boolean mask returned by df["price"].str.contains("/") as index to get the respective rows, i.e., df[df["price"].str.contains("/")] (cf. the pandas docs on indexing).

How does the isnull() method work to return all rows that are missing in my data frame?

I'm new to Python and just trying to figure out how this small bit of code works. Hoping this'll be easy to explain without an example data frame.
My data frame, called df_train, contains a column called Age. This column is NaN for 177 records.
I submit the following code...
df_train[df_train['Age'].isnull()]
... and it returns all records that are missing.
Now if I submit df_train['Age'].isnull(), all I get is a Boolean List of values. How does the data frame object then work to convert this Boolean List to the rows we actually want?
I don't understand how passing the boolean list to the data frame again results in just the 177 records that we need - could someone please ELI5 for a newbie?
You will have to create subsets of the dataframe you want to use. Suppose you want to use only those rows where df_train['Age'] is not null. In that case, you have to select
df_train_to_use = df_train[df_train['Age'].isnull() == False]
Now, you may cross check any other column that you may want to use and have nulls like
df_train['Column_name'].isnull().any()
If this returns True, you may go ahead and replace nulls with default values, average, zeros or whatever methods you prefer, usually put in application for machine learning programs.
Example
df_train['Column_name'].dropna()
df_train['Column_name'].fillna('') #for strings
df_train['Column_name'].fillna(0) #for int
df_train['Column_name'].fillna(0.0) #for float
Etc.
I hope this helps you explain.

Pandas Python: Delete Rows of DF That Have ASCII Letters

I have a column in my dataframe that I would like to convert to datatype int. However it is throwing an error because some of the rows have letters in their entries. I would like to create a new dataframe that only has entries in this column with pure numeric entries (or at least no letters).
So my question is: Is there a way to do something like the following,
df=df[df['addzip'].str.contains("a")==False]
But with a list where the "a" is? See the example below,
df=df[df['addzip'].str.contains(list(str(string.ascii_lowercase)+str(string.ascii_uppercase)))==False]
I know that this very possible to do with an apply command but I would like to keep this as vectorized as possible so that is not what I am looking for. So far I haven't found any solutions anywhere else on stack overflow.
Just use a regular expression
df = df[~df['addzip'].str.contains("[a-zA-Z]").fillna(False)]

changing column types of a pandas data frame -- finding offending rows that prevent casting

My PANDAS data has columns that were read as objects. I want to change these into floats. Following the post linked below (1), I tried:
pdos[cols] = pdos[cols].astype(float)
But PANDAS gives me an error saying that an object can't be recast as float.
ValueError: invalid literal for float(): 17_d
But when I search for 17_d in my data set, it tells me it's not there.
>>> '17_d' in pdos
False
I can look at the raw data to see what's happening outside of python, but feel if I'm going to take python seriously, I should know how to deal with this sort of issue. Why doesn't this search work? How could I do a search over objects for strings in PANDAS? Any advice?
Pandas: change data type of columns
of course it does, because you're only looking in the column list!
'17_d' in pdos
checks to see if '17_d' is in pdos.columns
so what you want to do is pdos[cols] == '17_d', which will give you a truth table. if you want to find which row it is, you can do (pdos[cols] == '17_d').any(1)

Python categorize datatypes

I plan to make a 'table' class that I can use throughout my data-analyzis program to store gathered data to. Objective is to make simple tables like this:
ID Mean size Stdv Date measured Relative flatness
----------------------------------------------------------------
1 133.4242 34.43 Oct 20, 2013 32093
2 239.244 34.43 Oct 21, 2012 3434
I will follow the sqlite3 suggestion from this post: python-data-structure-for-maintaing-tabular-data-in-memory, but I will still need to save it as a csv file (not as a dbase) and I want it to eat my data as we go: add columns on the fly whenever new measures become available and are deemed to be interesting. For that the class will need to be able to determine the data type of the data thrown at it.
Sqlite3 has limited datatypes, float, int, date and string. Python and numpy together have many types. Is there an easy was to quickly decide what the datatype is of the variable? So my table class can automatically add a column when new data is entered containing new fields.
I am not too concerned about performance, the table should be fairly small.
I want to use my class like so:
dt = Table()
dt.add_record({'ID':5, 'Mean size':39.4334'})
dt.add_record({'ID':5, 'Goodness of fit': 12})
In the last line, there is new data. the Table class needs to figure out what kind of data that is and then add a column to the sqlite3 table. Making it all string seems a bit to floppy, I still want to keep my high precision floats correct....
Also: If something like this already exists, I'd like to know about it.
It seems that your question is: "Is there an easy was to quickly decide what the datatype is of the variable?". This is a simple question, and the answer is:
type(variable).
But the context you provide requires a more careful answer.
Since SQLite3 only provides only a few data types (slightly different ones than what you said), you need to map your input variables to the types provided by SQLite3.
But you may encounter further problems: You may need to change the types of columns as you receive new records, if you do not want to require that the column type be fixed in advance.
For example, for the Goodness of fit column in your example, you get an int (12) first. But you may get a float (e.g. 10.1) the second time, which shows that both values must be interpreted as floats. And if next time you receive a string, then all of them must be strings, right? But then the exact formatting of the numbers also counts: whereas 12 and 12.0 are the same when you interpret them as floats, they are not when you interpret them as strings; and the first value may become "12.0" when you convert all of them to strings.
So either you throw an exception when the type of consecutive values for the same column do not match, or you try to convert the previous values according to the new ones; but occasionally you may need to re-read the input.
Nevertheless, once you make those decision regarding the expected behavior, it should not be a very difficult problem to implement.
Regarding your last question: I personally do not know of an existing implementation to this problem.

Categories