Excluding certain rows while importing data with Numpy - python

I am generating data-sets from experiments. I end up with csv data-sets that are typically are n x 4 dimensional (n rows; n > 1000 and 4 columns). However, due to an artifact of the data-collection process, typically the first couple of rows and the last couple of rows have only 2 or 3 columns. So a data-set looks like:
8,0,4091
8,0,
8,0,4091,14454
10,0,4099,14454
2,0,4094,14454
8,-3,4104,14455
3,0,4100,14455
....
....
14,-1,4094,14723
0,3,4105,14723
7,0,4123,14723
7,
6,-2,4096,
3,2,
As you can see, the first two rows and the last three don't have the 4 columns that I expect. When I try importing this file using np.loadtxt(filename, delimiter = ','), I get an error. Once I remove the rows which have fewer than 4 columns (first 2 rows, and last 3 rows, in this case), the import works fine.
Two questions:
Why doesn't the usual importing work. I am not sure what is the exact error in this importing. In other words, why is not having the same number of columns in all rows a problem?
As a workaround, I know how to ignore the first two rows while importing the files with numpy np.loadtxt(filename, skiprows= 2), but is there a simple way to also select a fixed number of rows at the bottom to ignore?
Note: This is NOT about finding unique rows in a numpy array. Its more about importing csv data that are non-uniform in the number of columns that each row contains.

Your question is similar (duplicate) to Using genfromtxt to import csv data with missing values in numpy
1) I'm not sure about why this is the default behavior.
Could be to warn users that the CSV file might be corrupt.
Could be to optimize the array and make it N x M, instead of having multiple column lengths.
2) Use numpy's genfromtext. For this you'll need to know the correct number of columns in advance.
data = numpy.genfromtxt('data.csv', delimiter=',', usecols=[0,1,2,3], invalid_raise=False)
Hope this helps!

You can use genfromtxt, which allows to skip lines a the beginning and at the end:
np.genfromtxt('array.txt', delimiter=',', skip_header=2, skip_footer=3)

Related

Splitting a datafile into multiple, infinite, columns for use in MatPlotLib WITHOUT Pandas

I'm very new to Python (Python 3.6 to be exact), and I'm having difficulty with str.split().
I have a .dat file with 11 columns of numbers in several rows (in string format for now) but I need to plot a graph with the data.
The first column of numbers is "x", and the 10 other columns are "Y". I've split files into 2 columns before, but not 11. One of the requisites was that it needs to be infinitely expandable, and that's what I can't figure out.
So I have so far;
#Make Columns Data_X and _Y
data_X=[]
data_Y=[]
#Open file.dat in Python and split columns
file = open('file.data','r')
for line in file.readlines():
data_x,data_y, data_y2, data y3...,data y10 =line.split()
Then after this;
#Convert string to float
data_X = numpy.array(x, dtype=float)
data_Y = numpy.array(y, dtype=float)
This can make the 11 columns, and I can make them floats afterwards for my plot afterwards, but I know this isn't infinitely repeatable (a y12 column will bust it)... and I'm not so sure about the Data_X/Data_Y=[] part either.
How do I split the strings into columns with the potential to do it infinitely? A big stipulation in this is that I can't use pandas either (on top of that, I don't what they do).
Thank you, and I'm sorry if this has been asked a lot but the closest solution I found to my problem was this, which only brought up one row;
for line in file.readlines():
data_X, data_Y = line.split(' ', 1)

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.
Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

error only for one column by using Genfromtxt. All other columns could be read. how can i fix it?

I am new with python and I want to read my data from a .txt file. There are except of the header only floats. I have 6 columns and very much rows.
To read it, I'm using genfromtxt. If I want to read the first two columns it's working, but if i want to read the 5th column I'm getting the following error:
Line #1357451 (got 4 columns instead of 4)
here's my code:
import numpy as np
data=np.genfromtxt(dateiname, skip_header=1, usecols=(0,1,2,5))
print(data[0:2, 0:3])
I think there are missing some values in the 5th column, so it doesn't work.
Has anyone an idea to fix my problem and read the datas of the 5th column?
From the genfromtxt docs:
Notes
-----
* When spaces are used as delimiters, or when no delimiter has been given
as input, there should not be any missing data between two fields.
If all columns, including missing ones, line up properly you could use a fixed column width version of the delimiter.
An integer or sequence of integers
can also be provided as width(s) of each field.
When a line looks like:
one, 2, 3.4, , 5, ,
it can unambiguously identify 7 columns. If instead it is is
one 2 3.4 5
it can only identify 4 columns (in general two blanks count as one, etc, and trailing blanks are ignored)
I found another solution. With filling_values=0 I could fill the empty values with zero. Now it is working! :)
import numpy as np
data=np.genfromtxt(dateiname, skip_header=1, usecols=(0,1,2,5), delimiter='\t', invalid_raise=False, filling_values=0)
Furthermore I didn't leave the delimiter on default anymore but defined the tab distance and with invalid_raise you could skip the values that are missing.

Reading multiple CSVs into different arrays

Update. Here is my code. I am importing 400 csv files into 1 list. Each csv file is 200 rows and 5 columns. My end goal is to sum the values from the 4th column of each row or each csv file. The below code imports all the csv files. However, I am struggling to isolate 4th column of data from each csv file from the large list.
for i in range (1, 5, 1):
data = list()
for i in range(1,400,1):
datafile = 'particle_path_%d' % i
data.append(np.genfromtxt(datafile, delimiter = "", skip_header=2))
print datafile
I want to read 100 csv files into 100 different arrays in python. For example:
array1 will have csv1
array2 will have csv2 etc etc.
Whats the best way of doing this? I am appending to a list right now but I have one big list which is proving difficult to split into smaller lists. My ultimate goal is to be able to perform different operations of each array (add, subtract numbers etc)
Could you provide more detail on what needs to be done? If you are simply trying to read line by line in the csv files and make that the array then this should work:
I would create a 2 dimensional array for this, something like:
csv_array_container = []
for csv_file in csv_files:
csv_lines = csv_file.readlines()
csv_array_container.append(csv_lines)
#Now close your file handlers
Assuming that csv_files is a list of open file_handlers for the csv files. Something more appropriate would likely open the files in the loop and close them after use rather than open 100, gather data, and close 100 due to limits on file handlers.
If you would like more detail on this, please give us more info on what you are exactly trying to do with examples. Hope this helps.
So you have a list of 100 arrays. What can you tell us about their shapes?
If they all have the same shape you could use
arr = np.stack(data)
I expect arr.shape will be (100,200,5)
fthcol = arr[:,:,3] # 4th column
If they aren't all the same, then a simple list comprehension will work
fthcol = [a[:,3] for a in data]
Again, depending on the shapes you could np.stack(fthcol) (choose your axis).
Don't be afraid to iterate over the elements of the data list. With 100 items the cost won't be prohibitive.

Append multiple columns into two columns python

I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.
If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.

Categories