How do I build data.frames containing multiple types of data(Strings, int, logical) and both continuous and factors in Python Numpy?
The following code makes my headers NaN's and all but my float values Nan's
from numpy import genfromtxt
my_data = genfromtxt('FlightDataTraining.csv', delimiter=',')
This puts a "b'data'" on all of my data, such that year becomes "b'year'"
import numpy as np
d = np.loadtxt('FlightDataTraining.csv',delimiter=',',dtype=str)
Try genfromtxt('FlightDataTraining.csv', delimiter=',', dtype=None). This tells genfromtxt to intelligently guess the dtype of each column. If that does not work, please post a sample of your CSV and what the desired output should look like.
The b in b'data' is Python's way of representing bytes as opposed to str objects. So the b'data' is okay. If you want strs, you would need to decode the bytes.
NumPy does not have a dtype for representing factors, though Pandas does have a pd.Categorical type.
Related
I wish to have an int matrix which has only its first column filled and the rest of elements are Null. Sorry but, I have a background of R. So, I know if I leave some Null elements it would be easier to manage them later. Meanwhile, if I leave 0 then it would be lots of problems later.
I have the following code:
import numpy as np
import numpy.random as random
import pandas as pa
def getRowData():
rowDt = np.full((80,20), np.nan)
rowDt[:,0] = random.choice([1,2,3],80) # Set the first column
return rowDt
I wish that this function returns the int, but seems that it gives me float.
I have seen this link, and tried the below code:
return pa.to_numeric(rowDt)
But, it did not help me. Also the rowDT object does not have .astype(<type>).
How can I convert an int array?
You create a full (np.full ) matrix of np.nan, which holds float dtype. This means you start off with a matrix defined to hold float numbers, not integers.
To fix this, fefine a full matrix with the integer 0 as initial value. That way, the dtype of your array is np.int and there is no need for astype or type casting.
rowDt = np.full((80,20), 0)
If you still wish to hold np.nan in your matrix, then I'm afraid you cannot use numpy arrays for that. You either hold all integers, or all floats.
You can use numpy.ma.masked_array() to create a numpy masked array
The numpy masked array "remembers" which elements are "masked". It provides methods and functions similar to those of numpy arrays, but excluding the masked values from the computations (such as, eg, mean()).
Once you have the masked array, you can always mask or unmask specific elements or rows or columns of elements whenever you want.
I am having the following trouble in Python. Assume a numpy.matrix A with entities of dtype to be complex128. I want to export A in CSV format so that the entries are separated by commas and each line at the output file corresponds to a row of A. I also need 18 decimal points of precision for both the real and imaginary parts and no spaces within an entry for example I need this
`6.103515626000000000e+09+1.712134684679831166e+05j`
instead of
`6.103515626000000000e+09 + 1.712134684679831166e+05j`
The following command works but only for 1-by-1 matrix
numpy.savetxt('A.out', A, fmt='%.18e%+.18ej', delimiter=',')
If I use:
numpy.savetxt('A.out', A, delimiter=',')
there are two problems. First, I don't know how many decimal points are preserved by default. Second, each complex entry is put in parentheses like
(6.103515626000000000e+09+1.712134684679831166e+05j)
and I cannot read the file in Matlab.
What do you suggest?
This is probably not the most efficient way of converting data in the large matrix and I am sure there exists a more efficient one-line-of-code solution, but you can try executing the code below and see if it works. Here I will be using pandas to save data to a csv file. The first columns in the generated csv file would be respectively your real and imaginary parts. Here I also assume that the dimension of the input matrix is Nx1.
import pandas as pd
import numpy as np
def to_csv(t, nr_of_decimal = 18):
t_new = np.matrix(np.zeros((t.shape[0], 2)))
t_new[:,:] = np.round(np.array(((str(np.array(t[:])[0][0])[1:-2]).split('+')), dtype=float), decimals=nr_of_decimal)
(pd.DataFrame(t_new)).to_csv('out.csv', index = False, header = False)
#Assume t is your complex matrix
t = np.matrix([[6.103515626000000000e+09+1.712134684679831166e+05j], [6.103515626000000000e+09+1.712134684679831166e+05j]])
to_csv(t)
I have a JSON file containing properties of some mathematical objects (Calabi-Yau manifolds). These objects are defined by a matrix and a vector, and two additional properties I am storing are the matrix size (such that it does not need to be computed again) and the Euler number of the manifold (some integer). In total there are roughly 1 million entries, the biggest matrix is 16 x 20.
I would like to convert the matrix and vectors to numpy arrays. Hence I was wondering if it is possible to do it directly when loading from json, or at least how to convert afterwards. The reason for converting is that I will need in any case some functions from numpy later, but I also hope (especially if the conversion is done on loading) that it will speed up my code: for the moment loading the complete dataset takes roughly 90 seconds (a previous loading using json module required only 20 s; I will open another thread on this question if using numpy does not improve).
Here is a minimal working code:
import pandas as pd
import numpy as np
json = '''
{"1":{"euler":2610,"matrix":[[6]],"size":[1,1],"vec":[5]},
"2":{"euler":2190,"matrix":[[2,5]],"size":[1,2],"vec":[6]},
"4":{"euler":1632,"matrix":[[2,2,4]],"size":[1,3],"vec":[7]},
"6":{"euler":1152,"matrix":[[2,2,2,3]],"size":[1,4],"vec":[8]},
"7":{"euler":960,"matrix":[[2,2,2,2,2]],"size":[1,5],"vec":[9]},
"8":{"euler":2160,"matrix":[[2],[5]],"size":[2,1],"vec":[1,4]},
"9":{"euler":1836,"matrix":[[0,2],[2,4]],"size":[2,2],"vec":[1,5]}}
'''
data = pd.read_json(json, orient="index")
data.sort_index(inplace=True)
My first guess was to use the numpy argument, but it fails with an error:
>>> data = pd.read_json(json, orient="index", numpy=True)
ValueError: cannot reshape array of size 51 into shape (7,4,2,2)
Then I have tried giving the dtype argument but it does not look like changing anything (my hope was that by using a numpy type it would convert the list to an array):
>>> dtype = {"euler": np.int16, "matrix": np.int8, "vector": np.int8,
... "size": np.int8, "number": np.int32}
>>> data = pd.read_json(json, orient="index", dtype=dtype)
>>> type(type(data["matrix"][1]))
list
For the conversion I was wondering if there is a more subtle (and perhaps more efficient) way than brutal conversion:
data["matrix"] = data["matrix"].apply(lambda x: np.array(x, dtype=np.int8))
I have a data file (csv) with Nilsimsa hash values. Some of them would have as long as 80 characters. I wish to read them in Python for data analysis tasks. Is there a way to import the data in python without information loss?
EDIT: I have tried the implementations proposed in the comments but that does not work for me.
Example data in csv file would be: 77241756221441762028881402092817125017724447303212139981668021711613168152184106
Start with a simple text file to read in, just one variable and one row.
%more foo.txt
x
77241756221441762028881402092817125017724447303212139981668021711613168152184106
In [268]: df=pd.read_csv('foo.txt')
Pandas will read it in as a string because it's too big to store as a core number type like int64 or float64. But the info is there, you didn't lose anything.
In [269]: df.x
Out[269]:
0 7724175622144176202888140209281712501772444730...
Name: x, dtype: object
In [270]: type(df.x[0])
Out[270]: str
And you can use plain python to treat it as a number. Recall the caveats from the links in the comments, this isn't going to be as fast as stuff in numpy and pandas where you have stored a whole column as int64. This is using the more flexible but slower object mode to handle things.
You can change a column to be stored as longs (long integers) like this. (But note that the dtype is still object because everything except the core numpy types (int32, int64, float64, etc.) are stored as objects.)
In [271]: df.x = df.x.map(int)
And then can more or less treat it like a number.
In [272]: df.x * 2
Out[272]:
0 1544835124428835240577628041856342500354488946...
Name: x, dtype: object
You'll have to do some formatting to see the whole number. Or go the numpy route which will default to showing the whole number.
In [273]: df.x.values * 2
Out[273]: array([ 154483512442883524057762804185634250035448894606424279963336043423226336304368212L], dtype=object)
As explained by #JohnE in his answer that we do not lose any information while reading big numbers using Pandas. They are stored as dtype=object, to make numerical computation on them we need to transform this data into numerical type.
For series:
We have to apply the map(func) to the series in the dataframe:
df['columnName'].map(int)
Whole dataframe:
If for some reason, our entire dataframe is composed of columns with dtype=object, we look at applymap(func)
from the documentation of Pandas:
DataFrame.applymap(func): Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame
so to transform all columns in dataframe:
df.applymap(int)
Is there any reason why pandas changes the type of columns from int to float in update, and can I prevent it from doing it? Here is some example code of the problem
import pandas as pd
import numpy as np
df = pd.DataFrame({'int': [1, 2], 'float': [np.nan, np.nan]})
print('Integer column:')
print(df['int'])
for _, df_sub in df.groupby('int'):
df_sub['float'] = float(df_sub['int'])
df.update(df_sub)
print('NO integer column:')
print(df['int'])
here's the reason for this: since you are effectively masking certain values on a column and replace them (with your updates), some values could become `nan
in an integer array this is impossible, so numeric dtypes are apriori converted to float (for efficiency), as checking first is more expensive that doing this
a change of dtype back is possible...just not in the code right now, therefor this a bug (a bit non-trivial to fix though): github.com/pydata/pandas/issues/4094
This causes data precision loss if you have big values in your int64 column, when update converts them to float. So going back with what Jeff suggests: df['int'].astype(int)
is not always possible.
My workaround for cases like this is:
df_sub['int'] = df_sub['int'].astype('Int64') # Int64 with capital I, supports NA values
df.update(df_sub)
df_sub['int'] = df_sub['int'].astype('int')
The above avoids the conversion to float type. The reason I am converting back to int type (instead of leaving it as Int64) is that pandas seems to lack support for that type in several operations (e.g. concat gives an error about missing .view).
Maybe they could incorporate the above fix in issue 4094