I have trouble getting numpy to load tabular data and automatically generate column names. It seems pretty simple but I cannot nail it.
If i knew the number of columns I could easily create names parameter, but I don't have this knowledge, and I would like to avoid prior introspection of the data file.
How can I force numpy to generate the column names, or use tuple-like dtype automatically, when I have no knowledge how many columns there are in file? I want to manipulate the column names after reading the data.
My approaches so far:
data = np.genfromtxt(tar_member, unpack = True, names = '') - I wanted to force automatic generation of column names by giving some "empty" parameter. Results with error ValueError: size of tuple must match number of fields.
data = np.genfromtxt(tar_member, unpack = True, names = True) - "Works" but consumes 1st row of data.
data = np.genfromtxt(tar_member, unpack = True, dtype = None) - Worked for data with mixed types. Automatic type guessing expanded dtype into a tuple, and assigned the names. However, for data where everything was actually float, dtype was set to float64, and I got ValueError: there are no fields defined when I tried accessing data.dtype.names.
I know there is a cleaner way to do this, but if you don't mind forcing the issue you can generate your dtype structure and assign it directly to the data array.
x = numpy.random.rand(10,10)
numpy.savetxt('test.out', x, delimiter=',')
dataa = numpy.genfromtxt('test.out',delimiter=",", dtype=None)
if dataa.dtype.names is None:#then dataa is homogenous?
l1 = map(lambda z:('f%d'%(z),dataa.dtype),range(0,dataa.shape[1]))
dataa.dtype = dtype(l1)
dataa.dtype
dataa.dtype.names
Related
so I am reading in data from a csv and saving it to a dataframe so I can use the columns. Here is my code:
filename = open(r"C:\Users\avalcarcel\Downloads\Data INSTR 9 8_16_2022 11_02_42.csv")
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(filename,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
length_ = len(df.date)
scan = list(range(1,length_+1))
plt.plot(scan,df.ch104)
plt.show()
When I try to plot scan vs. df.ch104, I get the following exception thrown:
'value' must be an instance of str or bytes, not a None
So what I thought to do was make each column in my df a list:
ch104 = df.ch104.tolist()
But it is turning my data from this to this:
before .tolist()
To this:
after .tolist()
This also happens when I use df.ch104.values.tolist()
Can anyone help me? I haven't used python/pandas in a while and I am just trying to get the data read in first. Thanks!
So, the df.ch104.values.tolist() code beasicly turns your column into a 2d 1XN array. But what you want is a 1D array of size N.
So transpose it before you call .tolist(). Lastly call [0] to convert Nx1 array to N array
df.ch104.values.tolist()[0]
Might I also suggest you include dropna() to avoid 'value' must be an instance of str or bytes, not a Non
df.dropna(subset=['ch104']).ch104.values.tolist()[0]
The error clearly says there are None or NaN values in your dataframe. You need to check for None and deal with them - replace with a suitable value or delete them.
I am currently performing multiple analysis steps on all the columns of my Pandas dataframe to get a good sense and overview of the data quality and structure (e.g. number of unique values, # missing values, # values by data type int/float/str ...).
My approach appears rather memory expensive and inefficient, especially with regards to larger data sets. I would really appreciate your thoughts on how to optimize the current process.
I am iterating over all the different columns in my dataset and create two dictionaries (see below) for every column separately which hold the relevant information. As I am checking/testing each row item anyways would it be reasonable to somehow combine all the checks in one go? And if so, how would you approach the issue? Thank you very much in advance for your support.
data_column = input_dataframe.loc[:,"column_1"] # as example, first column of my dataframe
dictionary_column = {}
unique_values = data_column.nunique()
dictionary_column["unique_values"] = unique_values
na_values = data_column.isna().sum()
dictionary_column["na_values"] = na_values
zero_values = (data_column == 0).astype(int).sum()
dictionary_column["zero_values"] = zero_values
positive_values = (data_column > 0).astype(int).sum()
dictionary_column["positive_values"] = positive_values
negative_values = (data_column < 0).astype(int).sum()
dictionary_column["negative_values"] = negative_values
data_column.dropna(inplace=True) # drop NaN otherwise elemts will be considered as float
info_dtypes = data_column.apply(lambda x: type(x).__name__).value_counts()
dictionary_data_types = {} # holds the count of the different data types (e.g. int, float, datetime, str, ...)
for index, value in info_dtypes.iteritems():
dictionary_data_types[str(index)] = int(value)
to read data from a mysql table , and use numpy to transfer the data to numpy array, the data in the mysql table include varchar(128),int, bigint,float, therefore, I think I may read these data all as string type at first, try using numpy.fromiter:
select_sql = "select * from fb_web_active_group_members_user_mbkmeansclustering_ng_six_test"
count = cur.execute(select_sql)
if count:
user_level_cluster_data = cur.fetchall()
user_level_cluster_data_df = numpy.fromiter(user_level_cluster_data,dtype = numpy.str,count = -1)
but it errors:
File "F:/MyDocument/F/My Document/Training/Python/PyCharmProject/FaceBookCrawl/FB_group_user_stability.py", line 21, in get_pre_new_user_level_data
user_level_cluster_data_df = numpy.fromiter(user_level_cluster_data,dtype = numpy.str,count = -1)
ValueError: Must specify length when using variable-size data-type.
could you please tell me the reason and how to resolve it, if I want read all the data from the mysql table as their own data types(not read them all as string type at first), such as: the varchar(128) data as string, int type as int, float type as float....how I should do
dtype needs to be the entire, full dtype for a whole record. Your current error occurs because NumPy strings are fixed-capacity, so you'd need to say dtype='S128' for example, to get strings up to 128 characters in capacity. But your actual dtype probably consists of several columns, so you might want something like this:
dtype=[('colA', 'i4'), ('colB', 'f8'), ('colC', 'S128')]
Also note that fromiter() may not be helping you, since you're using fetchall() which I think returns a list anyway. You can simply do:
np.array(user_level_cluster_data, dtype)
Or if you want to use fromiter(), you should pass it the count parameter and use lazy fetching instead of fetchall().
I am new to Python, I have a following example that I don't understand
The following is a csv file with some data
%%writefile wood.csv
item,material,number
100,oak,33
110,maple,14
120,oak,7
145,birch,3
Then, the example tries to define a function to convert those trees name above to integers.
tree_to_int = dict(oak = 1,
maple=2,
birch=3)
def convert(s):
return tree_to_int.get(s, 0)
The first question is why is there a "0" after "s"? I removed that "0" and get same result.
The last step is to read those data by numpy.array
data = np.genfromtxt('wood.csv',
delimiter=',',
dtype=np.int,
names=True,
converters={1:convert}
)
I was wondering for the converters argument, what does {1:convert} exact mean? Especially what does number 1 mean in this case?
For the second question, according to the documentation (https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html), {1:convert} is a dictionary whose keys are column numbers (where the first column is column 0) and whose values are functions that convert the entries in that column.
So in this code, the 1 indicates column one of the csv file, the one with the names of the trees. Including this argument causes numpy to use the convert function to replace the tree names with their corresponding numbers in data.
A bit of context: I am writting a code to save the data I plot to a text file. This data should be stored in such a way it can be loaded back using a script so it can be displayed again (but this time without performing any calculation). The initial idea was to store the data in columns with a format x1,y1,x2,y2,x3,y3...
I am using a code which would be simplified to something like this (incidentally, I am not sure if using a list to group my arrays is the most efficient approach):
import numpy as np
MatrixResults = []
x1 = np.array([1,2,3,4,5,6])
y1 = np.array([7,8,9,10,11,12])
x2 = np.array([0,1,2,3])
y2 = np.array([0,1,4,9])
MatrixResults.append(x1)
MatrixResults.append(y1)
MatrixResults.append(x2)
MatrixResults.append(y2)
MatrixResults = np.array(MatrixResults)
TextFile = open('/Users/UserName/Desktop/Datalog.txt',"w")
np.savetxt(TextFile, np.transpose(MatrixResults))
TextFile.close()
However, this code gives and error when any of the data sets have different lengths. Reading similar questions:
Can numpy.savetxt be used on N-dimensional ndarrays with N>2?
Table, with the different length of columns
However, this requires to break the format (either with flattening or adding some filling strings to the shorter columns to fill the shorter arrays)
My issue summarises as:
1) Is there any method that at the same time we transpose the arrays these are saved individually as consecutive columns?
2) Or maybe is there anyway to append columns to a text file (given a certain number of rows and columns to skip)
3) Should I try this with another library such as pandas?
Thank you very for any advice.
Edit 1:
After looking a bit more it seems that leaving blank spaces is more innefficient than filling the lists.
In the end I wrote my own (not sure if there is numpy function for this) in which I match the arrays length with "nan" values.
To get the data back I use the genfromtxt method and then I use this line:
x = x[~isnan(x)]
To remove the these cells from the arrays
If I find a better solution I will post it :)
To save your array you can use np.savez and read them back with np.load:
# Write to file
np.savez(filename, matrixResults)
# Read back
matrixResults = np.load(filename + '.npz').items[0][1]
As a side note you should follow naming conventions i.e. only class names start with upper case letters.