How to read string column only using numpy in python?
csv file:
1,2,3,"Hello"
3,3,3,"New"
4,5,6,"York"
How to get array like:
["Hello","york","New"]
without using pandas and sklearn.
I give the column name as a,b,c,d in csv
import numpy as np
ary=np.genfromtxt(r'yourcsv.csv',delimiter=',',dtype=None)
ary.T[-1]
Out[139]:
array([b'd', b'Hello', b'New', b'York'],
dtype='|S5')
import numpy
fname = 'sample.csv'
csv = numpy.genfromtxt(fname, dtype=str, delimiter=",")
names = csv[:,-1]
print(names)
Choosing the data type
The main way to control how the sequences of strings we have read from the file are converted to other types is to set the dtype argument. Acceptable values for this argument are:
a single type, such as dtype=float. The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the names argument (see below). Note that dtype=float is the default for genfromtxt.
a sequence of types, such as dtype=(int, float, float).
a comma-separated string, such as dtype="i4,f8,|U3".
a dictionary with two keys 'names' and 'formats'.
a sequence of tuples (name, type), such as dtype=[('A', int), ('B', float)].
an existing numpy.dtype object.
the special value None. In that case, the type of the columns will be determined from the data itself (see below).
When dtype=None, the type of each column is determined iteratively from its data. We start by checking whether a string can be converted to a boolean (that is, if the string matches true or false in lower cases); then whether it can be converted to an integer, then to a float, then to a complex and eventually to a string. This behavior may be changed by modifying the default mapper of the StringConverter class.
The option dtype=None is provided for convenience. However, it is significantly slower than setting the dtype explicitly.
A quick file substitute:
In [275]: txt = b'''
...: 1,2,3,"Hello"
...: 3,3,3,"New"
...: 4,5,6,"York"'''
In [277]: np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None,usecols=3)
Out[277]:
array([b'"Hello"', b'"New"', b'"York"'],
dtype='|S7')
bytestring array in Py3; or a default unicode string dtype:
In [278]: np.genfromtxt(txt.splitlines(), delimiter=',',dtype=str,usecols=3)
Out[278]:
array(['"Hello"', '"New"', '"York"'],
dtype='<U7')
Or the whole thing:
In [279]: data=np.genfromtxt(txt.splitlines(), delimiter=',',dtype=None)
In [280]: data
Out[280]:
array([(1, 2, 3, b'"Hello"'), (3, 3, 3, b'"New"'), (4, 5, 6, b'"York"')],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S7')])
select the f3 field:
In [282]: data['f3']
Out[282]:
array([b'"Hello"', b'"New"', b'"York"'],
dtype='|S7')
Speed should be basically the same
To extract specific values into the numpy array one approach could be:
with open('Exercise1.csv', 'r') as file:
file_content = list(csv.reader(file, delimiter=","))
data = np.array(file_content)
print(file_content[1][1], len(file_content))
for i in range(1, len(file_content)):
patient.append(file_content[i][0])
first_column_array = np.array(patient, dtype=(''))
i iterates through the rows of data and j is the place of the value in the row, so for 0, the first value
Related
I'm pretty illiterate in using Python/numpy.
I have the following piece of code:
data = np.array([])
for i in range(10):
data = np.append(data, GetData())
return data
GetData() returns a numpy array with a custom dtype. However when executing the above piece of code, the numbers convert to float64 which I suspect is the culprit for other issues I'm having. How can I copy/append the output of the functions while preserving the dtype as well?
Given the comments stating that you will only know the type of data once you run GetData(), and that multiple types are expected, you could do it like so:
# [...]
dataByType = {} # dictionary to store the dtypes encountered and the arrays with given dtype
for i in range(10):
newData = GetData()
if newData.dtype not in dataByType:
# If the dtype has not been encountered yet,
# create an empty array with that dtype and store it in the dict
dataByType[newData.dtype] = np.array([], dtype=newData.dtype)
# Append the new data to the corresponding array in dict, depending on dtype
dataByType[newData.dtype] = np.append(dataByType[newData.dtype], newData)
Taking into account hpaulj's answer, if you wish to conserve the different types you might encounter without creating a new array at each iteration you can adapt the above to:
# [...]
dataByType = {} # dictionary to store the dtypes encountered and the list storing data with given dtype
for i in range(10):
newData = GetData()
if newData.dtype not in dataByType:
# If the dtype has not been encountered yet,
# create an empty list with that dtype and store it in the dict
dataByType[newData.dtype] = []
# Append the new data to the corresponding list in dict, depending on dtype
dataByType[newData.dtype].append(newData)
# At this point, you have all your data pieces stored according to their original dtype inside the dataByType dictionary.
# Now if you wish you can convert them to numpy arrays as well
# Either by concatenation, updating what is stored in the dict
for dataType in dataByType:
dataByType[dataType] = np.concatenate(dataByType[dataType])
# No need to specify the dtype in concatenate here, since previous step ensures all data pieces are the same type
# Or by creating array directly, to store each data piece at a different index
for dataType in dataByType:
dataByType[dataType] = np.array(dataByType[dataType])
# As for concatenate, no need to specify the dtype here
A little example:
import numpy as np
# to get something similar to GetData in the example structure:
getData = [
np.array([1.,2.], dtype=np.float64),
np.array([1,2], dtype=np.int64),
np.array([3,4], dtype=np.int64),
np.array([3.,4.], dtype=np.float64)
] # dtype precised here for clarity, but not needed
dataByType = {}
for i in range(len(getData)):
newData = getData[i]
if newData.dtype not in dataByType:
dataByType[newData.dtype] = []
dataByType[newData.dtype].append(newData)
print(dataByType) # output formatted below for clarity
# {dtype('float64'):
# [array([1., 2.]), array([3., 4.])],
# dtype('int64'):
# [array([1, 2], dtype=int64), array([3, 4], dtype=int64)]}
Now if we use concatenate on that dataset, we get 1D arrays, conserving the original type (dtype=float64 not precised in the output since it is the default type for floating point values):
for dataType in dataByType:
dataByType[dataType] = np.concatenate(dataByType[dataType])
print(dataByType) # once again output formatted for clarity
# {dtype('float64'):
# array([1., 2., 3., 4.]),
# dtype('int64'):
# array([1, 2, 3, 4], dtype=int64)}
And if we use array, we get 2D arrays:
for dataType in dataByType:
dataByType[dataType] = np.array(dataByType[dataType])
print(dataByType)
# {dtype('float64'):
# array([[1., 2.],
# [3., 4.]]),
# dtype('int64'):
# array([[1, 2],
# [3, 4]], dtype=int64)}
Important thing to note: using array will not work as intended if all the arrays to combine don't have the same shape:
import numpy as np
print(repr(np.array([
np.array([1,2,3]),
np.array([4,5])])])))
# array([array([1, 2, 3]), array([4, 5])], dtype=object)
You get an array of dtype object, which are all in this case arrays of different lengths.
Your use of [] and append indicates that your are naively copying that common list idiom:
alist = []
for x in another_list:
alist.append(x)
Your data is not a clone of the [] list:
In [220]: np.array([])
Out[220]: array([], dtype=float64)
It's an array with shape (0,) and dtype float.
np.append is not an list append clone. I stress that, because too many new users make that mistake, and the result is many different errors. It is really just a cover for np.concatenate, one that takes 2 arguments instead of a list of arguments. As the docs stress it returns a new array, and when used iteratively, that means a lot of copying.
It is best to collect your arrays in a list, and give it to concatenate. List append is in-place, and better when done iteratively. If you give concatenate a list of arrays, the resulting dtype will be the common one (or whatever promoting requires). (new versions do let you specify dtype when calling concatenate.)
Keep the numpy documentation at hand (python too if necessary), and look up functions. Pay attention to how they are called, including the keyword parameters). And practice with small examples. I keep an interactive python session at hand, even when writing answers.
When working with arrays, pay close attention to shape and dtype. Don't make assumptions.
concatenating 2 int arrays:
In [238]: np.concatenate((np.array([1,2]),np.array([4,3])))
Out[238]: array([1, 2, 4, 3])
making one a float array (just by adding a decimal point to one number):
In [239]: np.concatenate((np.array([1,2]),np.array([4,3.])))
Out[239]: array([1., 2., 4., 3.])
It won't let me change the result to int:
In [240]: np.concatenate((np.array([1,2]),np.array([4,3.])), dtype=int)
Traceback (most recent call last):
File "<ipython-input-240-91b4e3fec07a>", line 1, in <module>
np.concatenate((np.array([1,2]),np.array([4,3.])), dtype=int)
File "<__array_function__ internals>", line 180, in concatenate
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'same_kind'
If an element is a string, the result is also a string dtype:
In [241]: np.concatenate((np.array([1,2]),np.array(['4',3.])))
Out[241]: array(['1', '2', '4', '3.0'], dtype='<U32')
Sometimes it is necessary to adjust dtypes after a calculation:
In [243]: np.concatenate((np.array([1,2]),np.array(['4',3.]))).astype(float)
Out[243]: array([1., 2., 4., 3.])
In [244]: np.concatenate((np.array([1,2]),np.array(['4',3.]))).astype(float).as
...: type(int)
Out[244]: array([1, 2, 4, 3])
I'm trying to use the np.ceil function on a structrued numpy array, but all I get is the error message:
TypeError: ufunc 'ceil' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Here's a simply example of what that array would look like:
arr = np.array([(1.4,2.3), (3.2,4.1)], dtype=[("x", "<f8"), ("y", "<f8")])
When I try
np.ceil(arr)
I get the above mentioned error. When I just use one column, it works:
In [77]: np.ceil(arr["x"])
Out[77]: array([ 2., 4.])
But I need to get the entire array. Is there any way other than going column by column, or not using structured arrays all together?
Here's a dirty solution based on viewing the array without its structure, taking the ceiling, and then converting it back to a structured array.
# sample array
arr = np.array([(1.4,2.3), (3.2,4.1)], dtype = [("x", "<f8"), ("y", "<f8")])
# remove struct and take the ceiling
arr1 = np.ceil(arr.view((float, len(arr.dtype.names))))
# coerce it back into the struct
arr = np.array(list(tuple(t) for t in arr1), dtype = arr.dtype)
# kill the intermediate copy
del arr1
and here it is as an unreadable one-liner but without assigning the intermediate copy arr1
arr = np.array(
list(tuple(t) for t in np.ceil(arr.view((float, len(arr.dtype.names))))),
dtype = arr.dtype
)
# array([(2., 3.), (4., 5.)], dtype=[('x', '<f8'), ('y', '<f8')])
I don't claim this is a great solution, but it should help you move on with your project until something better is proposed
I need to generate a numpy array with named columns from list. I dont know how to do it so now i use temp txt file to use it with genfromtxt numpy function.
my_data = np.genfromtxt('tmp.txt',delimiter='|', dtype=None, names ["Num", "Date", "Desc", "Rgh" ,"Prc", "Color", "Smb", "MType"])
How to get rid of genfromtxt cause i need to generate the same structure array from list of strings instead of file
List of strings
For a list of strings, you can use genfromtxt directly. It accepts any iterable that can feed it strings/lines one at a time. I use this approach all the time when answering genfromtxt questions, e.g. in
https://stackoverflow.com/a/35874408/901925
In [186]: txt='1|abc|Red|no'
In [187]: txt=[txt,txt,txt]
In [188]: A=np.genfromtxt(txt, dtype=None, delimiter='|')
In [189]: A
Out[189]:
array([(1, 'abc', 'Red', 'no'), (1, 'abc', 'Red', 'no'),
(1, 'abc', 'Red', 'no')],
dtype=[('f0', '<i4'), ('f1', 'S3'), ('f2', 'S3'), ('f3', 'S2')])
In Python3 there's the added complication of byte strings v. regular ones.
List of values
In ways genfromtxt is an easy of creating a structured array. But with a few key facts, it isn't hard to generate it directly.
First, define the dtype. There are various ways of doing this.
dt = np.dtype([('name1',int),('name2',float),('name3','S10'),...])
I usually test this expression in an interactive shell.
A = np.zeros((n,), dtype=dt)
creates an 'blank' array of the correct type. Try it with a small n, and print the result.
Now try assigning values. The easiest is by field
A['name1'] = [1,2,3]
A['name3'] = ['abc','def',...]
Or by record
A[0] = (1, 1.23, 'str', ...)
multiple records are assign values with a list of tuples. That is the key. For a 2d array, a list of lists works; but for a structured 1d array the elements have to be tuples.
A = np.array([(1,1.2,'abc'),(2,342.,'xyz'),(3,0,'')], dtype=dt)
Sometimes it helps to use a list comprehension to turn a list of lists into a list of tuples.
alist = [[1,1.0,'str'],[]...]
A[:] = [tuple(l) for l in alist]
I need convert the data stored in a pandas.DataFrame into a byte string where each column can have a separate data type (integer or floating point). Here is a simple set of data:
df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8')
df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8')
and df looks something like this:
a b c
0 10 18446744073709551615 1.324000e+10
1 15 230498234019 3.141590e+00
2 20 32094812309 2.341341e+02
The DataFrame knows about the types of each column df.dtypes so I'd like to do something like this:
data_to_pack = [tuple(record) for _, record in df.iterrows()]
data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes))
data_bytes = data_array.tostring()
This typically works fine but in this case (due to the maximum value stored in df['b'][0]. The second line above converting the array of tuples to an np.array with a given set of types causes the following error:
OverflowError: Python int too large to convert to C long
The error results (I believe) in the first line which extracts the record as a Series with a single data type (defaults to float64) and the representation chosen in float64 for the maximum uint64 value is not directly convertible back to uint64.
1) Since the DataFrame already knows the types of each column is there a way to get around creating a row of tuples for input into the typed numpy.array constructor? Or is there a better way than outlined above to preserve the type information in such a conversion?
2) Is there a way to go directly from DataFrame to a byte string representing the data using the type information for each column.
You can use df.to_records() to convert your dataframe to a numpy recarray, then call .tostring() to convert this to a string of bytes:
rec = df.to_records(index=False)
print(repr(rec))
# rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159),
# (20, 32094812309, 234.1341)],
# dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')])
s = rec.tostring()
rec2 = np.fromstring(s, rec.dtype)
print(np.all(rec2 == rec))
# True
import pandas as pd
df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
df_byte = df.to_json().encode()
print(df_byte)
I'm running genfromtxt like below:
date_conv = lambda x: str(x).replace(":", "/")
time_conv = lambda x: str(x)
a = np.genfromtxt(input.txt, delimiter=',', skip_header=4,
usecols=[0, 1] + radii_indices, converters={0: date_conv, 1: time_conv})
Where input.txt is from this gist.
When I look at the results, it is a 1D array not a 2D array:
>>> np.shape(a)
(918,)
It seems to be an array of tuples instead:
>>> a[0]
('06/03/2006', '08:27:23', 6.4e-05, 0.000336, 0.001168, 0.002716, 0.004274, 0.004658, 0.003756, 0.002697, 0.002257, 0.002566, 0.003522, 0.004471, 0.00492, 0.005602, 0.006956, 0.008442, 0.008784, 0.006976, 0.003917, 0.001494, 0.000379, 6.4e-05)
If I remove the converters specification from the genfromtxt call it works fine and produces a 2D array:
>>> np.shape(a)
(918, 24)
What is returned is called a structured ndarray, see e.g. here: http://docs.scipy.org/doc/numpy/user/basics.rec.html. This is because your data is not homogeneous, i.e. not all elements have the same type: the data contains both strings (the first two columns) and floats. Numpy arrays have to be homogeneous (see here for an explanation).
The structured array 'solves' this constraint of homogeneity by using tuples for each record or row, that's the reason the returned array is 1D: one series of tuples, but each tuple (row) consists of several fields, so you can regard it as rows and columns. The different columns are accessible as a['nameofcolumn'] e.g. a['Julian_Day'].
The reason that it returns a 2D array when removing the converters for the first two columns is that in that case, genfromtxt regards all data of the same type, and a normal ndarray is returned (the default type is float, but you can specify this with the dtype argument).
EDIT: If you want to make use of the column names, you can use the names argument (and set the skip_header at only three):
a2 = np.genfromtxt("input.txt", delimiter=',', skip_header=3, names = True, dtype = None,
usecols=[0, 1] + radii_indices, converters={0: date_conv, 1: time_conv})
the you can do e.g.:
>>> a2['Dateddmmyyyy']
array(['06/03/2006', '06/03/2006', '18/03/2006', '19/03/2006',
'19/03/2006', '19/03/2006', '19/03/2006', '19/03/2006',
'19/03/2006', '19/03/2006'],
dtype='|S10')