I am using pandas to read a csv file. The data are numbers but stored in the csv file as text. Some of the values are non-numeric when they are bad or missing. How do I filter out these values and convert the remaining data to integers.
I assume there is a better/faster way than looping over all the values and using isdigit() to test for them being numeric.
Does pandas or numpy have a way of just recognizing bad values in the reader? If not, what is the easiest way to do it? Do I have to specific the dtypes to make this work?
pandas.read_csv has the parameter na_values:
na_values : list-like, default None
List of additional strings to recognize as NA/NaN
where you can define these bad values.
You can pass a custom list of values to be treated as missing using pandas.read_csv . Alternately you can pass functions to the converters argument.
NumPy provides the function genfromtxt() specifically for this purpose. The first sentence from the linked documentation:
Load data from a text file, with missing values handled as specified.
Related
Is there a way to read each data type as it is by a PCollection from a CSV file?
By default, all the values in a row read by a PCollection are converted into a list of strings, but is there a way such that, an integer is considered as integer, float as float, double as double, and string as string, etc.
So that, the PTransformations can be easily performed on each value of the row separately.
Or is it has to be done externally using a ParDo function?
The root of your issue is that a CSV file only contains strings, so it is necessary to parse the strings as whatever type you know the column contains.
A convenient way to do this is to use Beam's pandas-compatible dataframe API to read your CSV files, as in:
pipeline | read_csv(...)
This will use pandas to sample the CSV file and guess what the types may be for each column.
You can see more examples and explanation at https://beam.apache.org/documentation/dsls/dataframes/overview/#using-dataframes
The default behavior for pandas.read_csv() (as far as I can tell) is to first try to interpret column data as int, and only if that fails does it try to interpret as float.
I want to modify this behavior to prefer float or simply exclude int, but it doesn't seem like I can get there with read_csv's options. It appears the only way to modify type handling is thorough the dtypes parameter, but that is only useful for specifying either (1) a single type for all columns or (2) a mapping of column names to dtypes.
Is there a way to get read_csv to behave in either of the two ways described? To handle this currently, I am iterating over all numeric columns after calling read_csv and then coercing them explicitly with .astype(np.float), an unnecessary extra step (in thoery!).
I'm trying to set up a Python script that will be able to read in many fixed width data files and then convert them to csv. To do this I'm using pandas like this:
pandas.read_fwf('source.txt', colspecs=column_position_length).\
to_csv('output.csv', header=column_name, index=False, encoding='utf-8')
Where column_position_length and column_name are lists containing the information needed to read and write the data.
Within these files I have long strings of numbers representing test answers. For instance: 333133322122222223133313222222221222111133313333 represents the correct answers on a multiple choice test. So this is more of a code than a numeric value. The problem that I am having is pandas interpreting these values as floats and then writing these values in scientific notation into the csv (3.331333221222221e+47).
I found a lot of questions regarding this issue, but they didn't quite resolve my issue.
Solution 1 - I believe at this point the values have already been converted to floats so this wouldn't help.
Solution 2 - according to the pandas documentation, dtype is not supported as an argument for read_fwf in Python.
Solution 3 Use converters - the issue with using converters is that you need to specify the column name or index to convert to a data type, but I would like to read all of the columns as strings.
The second option seemes to be the go to answer for reading every column in as a string, but unfortunately it just isn't supported for read_fwf. Any suggestions?
So I think I figured out a solution, but I don't know why it works. Pandas was interpreting these values as floats because there were NaN values (blank lines) in the columns. By adding keep_default_na=False to the read_fwf() parameters, it resolved this issue. According to the documentation:
keep_default_na : bool, default True If na_values are specified and
keep_default_na is False the default NaN values are overridden,
otherwise they’re appended to.
I guess I'm not quite understanding how this is fixing my issue. Could anyone add any clarity on this?
I have various csv files and I import them as a DataFrame. The problem is that many files use different symbols for missing values. Some use nan, others NaN, ND, None, missing etc. or just live the entry empty. Is there a way to replace all these values with a np.nan? In other words, any non-numeric value in the dataframe becomes np.nan. Thank you for the help.
I found what I think is a relatively elegant but also robust method:
def isnumber(x):
try:
float(x)
return True
except:
return False
df[df.applymap(isnumber)]
In case it's not clear: You define a function that returns True only if whatever input you have can be converted to a float. You then filter df with that boolean dataframe, which automatically assigns NaN to the cells you didn't filter for.
Another solution I tried was to define isnumber as
import number
def isnumber(x):
return isinstance(x, number.Number)
but what I liked less about that approach is that you can accidentally have a number as a string, so you would mistakenly filter those out. This is also a sneaky error, seeing that the dataframe displays the string "99" the same as the number 99.
EDIT:
In your case you probably still need to df = df.applymap(float) after filtering, for the reason that float works on all different capitalizations of 'nan', but until you explicitely convert them they will still be considered strings in the dataframe.
Replacing non-numeric entries on read, the easier (more safe) way
TL;DR: Set a datatype for the column(s) that aren't casting properly, and supply a list of na_values
# Create a custom list of values I want to cast to NaN, and explicitly
# define the data types of columns:
na_values = ['None', '(S)', 'S']
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctapi': np.float64}, na_values=na_values)
Longer Explanation
I believe best practices when working with messy data is to:
Provide datatypes to pandas for columns whose datatypes are not inferred properly.
Explicitly define a list of values that should be cast to NaN.
This is quite easy to do.
Pandas read_csv has a list of values that it looks for and automatically casts to NaN when parsing the data (see the documentation of read_csv for the list). You can extend this list using the na_values parameter, and you can tell pandas how to cast particular columns using the dtypes parameter.
In the example above, pctapi is the name of a column that was casting to object type instead of float64, due to NaN values. So, I force pandas to cast to float64 and provide the read_csv function with a list of values to cast to NaN.
Process I follow
Since data science is often completely about process, I thought I describe the steps I use to create an na_values list and debug this issue with a dataset.
Step 1: Try to import the data and let pandas infer data types. Check if the data types are as expected. If they are = move on.
In the example above, Pandas was right on about half the columns. However, I expected all columns listed below the 'count' field to be of type float64. We'll need to fix this.
Step 2: If data types are not as expected, explicitly set the data types on read using dtypes parameter. This will throw errors by default on values that cannot be cast.
# note: the dtypes dictionary specifying types. pandas will attempt to infer
# the type of any column name that's not listed
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64})
Here's the error message I receive when running the code above:
Step 3: Create an explicit list of values pandas cannot convert and cast them to NaN on read.
From the error message, I can see that pandas was unable to cast the value of (S). I add this to my list of na_values:
# note the new na_values argument provided to read_csv
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64}, na_values=['(S)'])
Finally, I repeat steps 2 & 3 until I have a comprehensive list of dtype mappings and na_values.
If you're working on a hobbyist project this method may be more than you need, you may want to use u/instant's answer instead. However, if you're working in production systems or on a team, it's well worth the 10 minutes it takes to correctly cast your columns.
I'm tearing my hair out a bit with this one. I've imported two csv's into pandas dataframes both have a column called SiteReference i want to use pd.merge to join dataframes using SiteReference as a key.
Initial merged failed as pd.read took different interpretations of the SiteReference values, in one instance 380500145.0 in the other 380500145 both stored as objects. I ran Regex to clean the columns and then pd.to_numeric, this resulted in one value of 380500145.0 and another of 3.805001e+10. They should both be 380500145. I then attempted;
df['SiteReference'] = df['SiteReference'].astype(int).astype('str')
But got back;
ValueError: cannot convert float NaN to integer
How can i control how pandas is dealing with these, preferably on import?
Perharps the best solution is to avoid that pd.read affect the type of this field :
df=pd.read_csv('data.csv',sep=',',dtype={'SiteReference':str})
Following the discussion in the comments, if you want to format floats as integer strings, you can use this:
df['SiteReference'] = df['SiteReference'].map('{:,.0f}'.format)
This should handle null values gracefully.