How to replace all non-numbers while importing csv to python? - python

I want to import a dirty csv file into a numpy object. I have a very few amount of values that are apparently not an integer or float, because the output is not of the correct dtype.
The code I use:
d.data = np.genfromtxt(inputtable, delimiter=";",skip_header=2, comments="#", dtype=np.float)
I would like to know if there is an easy way to just replace all non floats into -1 so that I do not need to find these values by hand in the 10.000+ rows.

You just have a provide a set of callbacks as the converters argument, as documented here:
converters : variable, optional
The set of functions that convert the data of a column to a value. The converters can also be used to provide a default value for missing
data: converters = {3: lambda s: float(s or 0)}.

Related

Python: How does converters work in genfromtxt() function?

I am new to Python, I have a following example that I don't understand
The following is a csv file with some data
%%writefile wood.csv
item,material,number
100,oak,33
110,maple,14
120,oak,7
145,birch,3
Then, the example tries to define a function to convert those trees name above to integers.
tree_to_int = dict(oak = 1,
maple=2,
birch=3)
def convert(s):
return tree_to_int.get(s, 0)
The first question is why is there a "0" after "s"? I removed that "0" and get same result.
The last step is to read those data by numpy.array
data = np.genfromtxt('wood.csv',
delimiter=',',
dtype=np.int,
names=True,
converters={1:convert}
)
I was wondering for the converters argument, what does {1:convert} exact mean? Especially what does number 1 mean in this case?
For the second question, according to the documentation (https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html), {1:convert} is a dictionary whose keys are column numbers (where the first column is column 0) and whose values are functions that convert the entries in that column.
So in this code, the 1 indicates column one of the csv file, the one with the names of the trees. Including this argument causes numpy to use the convert function to replace the tree names with their corresponding numbers in data.

Weird lines in python file, can't extract column

0 0: 1
0 1: 1
0 1: 0
1 0: 0
I have a file, that looks like something above.
I am trying to extract this by columns into arrays by using numpy.loadtxt of python. Ideally, I want many arrays, or at least a data structure in which the arrays are [0,0,0,1], [0,1,1,0]. To my utter discomfort, because there is that semicolon after the second number, I'm unable to use numpy.loadtxt. Would anyone have any solutions to how to either surpass that, or simply remove that semicolon without having to really separate the file?
np.loadtxt(file, converters = {1: lambda s: int(s.strip(":"))})
From numpy.loadtxt:
converters : dict, optional
A dictionary mapping column number to a function that will convert that column to a float. E.g., if column 0 is a date string: converters = {0: datestr2num}. Converters can also be used to provide a default value for missing data (but see also genfromtxt): converters = {3: lambda s: float(s.strip() or 0)}. Default: None.

Data.Frames in Python Numpy

How do I build data.frames containing multiple types of data(Strings, int, logical) and both continuous and factors in Python Numpy?
The following code makes my headers NaN's and all but my float values Nan's
from numpy import genfromtxt
my_data = genfromtxt('FlightDataTraining.csv', delimiter=',')
This puts a "b'data'" on all of my data, such that year becomes "b'year'"
import numpy as np
d = np.loadtxt('FlightDataTraining.csv',delimiter=',',dtype=str)
Try genfromtxt('FlightDataTraining.csv', delimiter=',', dtype=None). This tells genfromtxt to intelligently guess the dtype of each column. If that does not work, please post a sample of your CSV and what the desired output should look like.
The b in b'data' is Python's way of representing bytes as opposed to str objects. So the b'data' is okay. If you want strs, you would need to decode the bytes.
NumPy does not have a dtype for representing factors, though Pandas does have a pd.Categorical type.

Python, Pandas Dataframes: Why can't I assign data directly to the df.values=df.values/100?

I want to
1) Read in data from the FamaFrench website
2) Convert the date (Month,Year) into a DateTime object
3) Convert all the returns data into percentage returns (returns/100)
My code below reads in data from the famafrench website.
industry30Web =web.DataReader("30_Industry_Portfolios","famafrench")
industry30_monthlyVal = industry30Web[4]
dateInt = industry30_monthlyVal.index
conv = lambda x: datetime.datetime.strptime(str(x),'%Y%m')
dates = [conv(x) for x in dateInt]
industry30_monthlyVal.index = dates
industry30_monthlyVal.values = industry30_monthlyVal.values/100
The last line is showing an AttributeError
Please help and let me know what I'm doing wrong.
The documentation specifically states that you cannot assign to the values attribute.
However, you can achieve what you want by doing industry30_monthlyVal[:]= industry30_monthlyVal[:]/100
When i had a look at the source under generic of pd.DataFrame i found:
#property
def values(self):
"""Numpy representation of NDFrame
Notes
-----
The dtype will be a lower-common-denominator dtype (implicit
upcasting); that is to say if the dtypes (even of numeric types)
are mixed, the one that accommodates all will be chosen. Use this
with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to
float32. If dtypes are int32 and uint8, dtype will be upcase to
int32.
"""
return self.as_matrix()
The method has no capability for writing data to the DataFrame (like i.e. an class variable that you can overwrite).
The apply functionality is probably what your are looking for:
import numpy as np, pandas as pd
s = pd.dataFrame(np.random.randn(10))
s = s.apply(lambda x: x/100)
It works the same way for a data frame

genfromtxt - Force column name generation for unknown number of columns

I have trouble getting numpy to load tabular data and automatically generate column names. It seems pretty simple but I cannot nail it.
If i knew the number of columns I could easily create names parameter, but I don't have this knowledge, and I would like to avoid prior introspection of the data file.
How can I force numpy to generate the column names, or use tuple-like dtype automatically, when I have no knowledge how many columns there are in file? I want to manipulate the column names after reading the data.
My approaches so far:
data = np.genfromtxt(tar_member, unpack = True, names = '') - I wanted to force automatic generation of column names by giving some "empty" parameter. Results with error ValueError: size of tuple must match number of fields.
data = np.genfromtxt(tar_member, unpack = True, names = True) - "Works" but consumes 1st row of data.
data = np.genfromtxt(tar_member, unpack = True, dtype = None) - Worked for data with mixed types. Automatic type guessing expanded dtype into a tuple, and assigned the names. However, for data where everything was actually float, dtype was set to float64, and I got ValueError: there are no fields defined when I tried accessing data.dtype.names.
I know there is a cleaner way to do this, but if you don't mind forcing the issue you can generate your dtype structure and assign it directly to the data array.
x = numpy.random.rand(10,10)
numpy.savetxt('test.out', x, delimiter=',')
dataa = numpy.genfromtxt('test.out',delimiter=",", dtype=None)
if dataa.dtype.names is None:#then dataa is homogenous?
l1 = map(lambda z:('f%d'%(z),dataa.dtype),range(0,dataa.shape[1]))
dataa.dtype = dtype(l1)
dataa.dtype
dataa.dtype.names

Categories