Deleting rows of data for multiple variables - python

I have over 500 files that I cleaned up using a pandas data frame, and read in later as a matrix. I now want to delete missing rows of data from multiple variables for the entirety of my files. Each variable is pretty lengthy for its shape, for example, tc and wspd have the shape (84479, 558) and pressure has the shape (558,). I have tried the following example before and has worked in the past for single dimensional arrays with the same shape, but will no longer work with a two dimensional array.
bad=[]
for i in range(len(p)):
if p[i]==-9999 or tc[i]==-9999:
bad.append(i)
p=numpy.delete(p, bad)
tc=numpy.delete(tc, bad)
I tried using the following code instead but with no success (unfortunately).
import numpy as n
import pandas as pd
wspd=pd.read_pickle('/home/wspd').as_matrix()
tc=pd.read_pickle('/home/tc').as_matrix()
press=n.load('/home/file1.npz')
p=press['press']
names=press['names']
length=n.arange(0,84479)
for i in range(len(names[0])): #using the first one as a trial to run faster
print i #used later to see how far we have come in the 558 files
bad=[]
for j in range(len(length)):
if (wspd[j,i]==n.nan or tc[j,i]==n.nan):
bad.append(j)
print bad
From there I plan on deleting missing data as I had done previously except indexing which dimension I am deleting from within my first forloop.
new_tc=n.delete(tc[j,:], bad)
Unfortunately, this has not worked. I have also tried masking the array which also has not worked.
The reason I need to delete the data is my next library does not understand nan values, it requires strictly integers, floats, etc.
I am open to new methods for removing rows of data if anyone has any guidance. I greatly appreciate it.

I would load your 2 dimensional arrays as pandas DataFrames and then use the dropna function to drop any rows that contain a null value
wspd = pd.read_pickle('/home/wspd').dropna()
tc = pd.read_pickle('/home/tc').dropna()
The documentation for pandas.DataFrame.dropna is here

Related

Can't merge dataframes because of index type mismatch

I have loaded some data from CSV files into two dataframes, X and Y that I intend to perform some analysis on. After cleaning them up a bit I can see that the indexes of my dataframes appear to match (they're just sequential numbers), except one has index with type object and the other has index with type int64. Please see attached image for a clearer idea of what I'm talking about.
I have tried manually altering this using X.index.astype('int64') and also X.reindex(Y.index) but neither seem to do anything here. Could anyone suggest anything?
Edit: Adding some additional info in case it is helpful. X was imported as row data from the csv file and transposed whereas Y was imported directly with the index set from the first column of the csv file.
So I've realised what I've done and it was pretty dumb. I should have written
X.index = X.index.astype('int64')
instead of just
X.index.astype('int64')
Oh well, the more you know.

Python Pandas: Data Slices

I am stuck with an issue when it comes to taking slices of my data in python (I come from using Matlab).
So here is the code I'm using,
import scipy.io as sc
import math as m
import numpy as np
from scipy.linalg import expm, sinm, cosm
import matplotlib.pyplot as plt
import pandas as pd
import sys
data = pd.read_excel('DataDMD.xlsx')
print(data.shape)
print(data)
The out put looks like so,
Output
So I wish to take certain rows only (or from my understand in Python slices) of this data matrix. The other problem I have is that the top row of my matrix becomes almost like the titles of the columns instead of actually data points. So I have two problems,
1) I don't need the top of the matrix to have any 'titles' or anything of that sort because it's all numeric and all symbolizes data.
2) I only need to take the 6th row of the whole matrix as a new data matrix.
3) I plan on using matrix multiplication later so is panda allowed or do I need numpy?
So this is what I've tried,
data.iloc[0::6,:]
this gives me something like this,
Output2
which is wrong because I don't need the values of 24.8 to be the 'title' but be the first row of the new matrix.
I've also tried using np.array for this but my problem is when I try to using iloc, it says (which makes sense)
'numpy.ndarray' object has no attribute 'iloc'
If anyone has any ideas, please let me know! Thanks!
To avoid loading the first record as the header, try using the following:
pd.read_excel('DataDMD.xlsx', header=None)
The read_excel function has an header argument; the value for the header argument indicates which row of the data should be used as header. It gets a default value of 0. Use None as a value for the header argument if none of the rows in your data functions as the header.
There are many useful arguments, all described in the documentation of the function.
This should also help with number 2.
Hope this helps.
Good luck!

Excel worksheet to Numpy array

I'm trying to do an unbelievably simple thing: load parts of an Excel worksheet into a Numpy array. I've found a kludge that works, but it is embarrassingly unpythonic:
say my worksheet was loaded as "ws", the code:
A = np.zeros((37,3))
for i in range(2,39):
for j in range(1,4):
A[i-2,j-1]= ws.cell(row = i, column = j).value
loads the contents of "ws" into array A.
There MUST be a more elegant way to do this. For instance, csvread allows to do this much more naturally, and while I could well convert the .xlsx file into a csv one, the whole purpose of working with openpyxl was to avoid that conversion. So there we are, Collective Wisdom of the Mighty Intertubes: what's a more pythonic way to perform this conceptually trivial operation?
Thank you in advance for your answers.
PS: I operate Python 2.7.5 on a Mac via Spyder, and yes, I did read the openpyxl tutorial, which is the only reason I got this far.
You could do
A = np.array([[i.value for i in j] for j in ws['C1':'E38']])
EDIT - further explanation.
(firstly thanks for introducing me to openpyxl, I suspect I will use it quite a bit from time to time)
the method of getting multiple cells from the worksheet object produces a generator. This is probably much more efficient if you want to work your way through a large sheet as you can start straight away without waiting for it all to load into your list.
to force a generator to make a list you can either use list(ws['C1':'E38']) or a list comprehension as above
each row is a tuple (even if only one column wide) of
Cell objects. These have a lot more about them than just a number but if you want to get the number for your array you can use the .value attribute. This is really the crux of your question, csv files don't contain the structured info of an excel spreadsheet.
there isn't (as far as I can tell) a built in method for extracting values from a range of cells so you will have to do something effectively as you have sketched out.
The advantages of doing it my way are: no need to work out the dimension of the array and make an empty one to start with, no need to work out the corrected index number of the np array, list comprehensions faster. Disadvantage is that it needs the "corners" defining in "A1" format. If the range isn't know then you would have to use iter_rows, rows or columns
A = np.array([[i.value for i in j[2:5]] for j in ws.rows])
if you don't know how many columns then you will have to loop and check values more like your original idea
If you don't need to load data from multiple files in an automated manner, the package tableconvert I recently wrote may help. Just copy and paste the relevant cells from the excel file into a multiline string and use the convert() function.
import numpy as np
from tableconvert.converter import convert
array = convert("""
123 456 3.14159
SOMETEXT 2,71828 0
""")
print(type(array))
print(array)
Output:
<class 'numpy.ndarray'>
[[ 123. 456. 3.14159]
[ nan 2.71828 0. ]]

Creating and Storing Multi-Dimensional Array in a netCDF File

This question has potentially two parts but maybe only one if the first part can be encapsulated by the second. I am using python with numpy and netCDF4
First:
I have four lists of different variable values (hereafter referred to elevation values) each of which has a length of 28. These four lists are one set of 5 different latitude values of which are one set of the 24 different time values.
So 24 times...each time with 5 latitudes...each latitude with four lists...each list with 28 values.
I want to create an array with the following dimensions (elevation, latitude, time, variable)
In words, I want to be able to specify which of the four lists I access,which index in the list, and specify a specific time and latitude. So an index into this array would look like this:
array(0,1,2,3) where 0 specifies the first index of the the 4th list specified by the 3. 1 specifies the 2nd latitude, and 2 specifies the 3rd time and the output is the value at that point.
I won't include my code for this part since literally the only things of mention are the lists
list1=[...]
list2=[...]
list3=[...]
list4=[...]
How can I do this, is there an easier structure of the array, or is there anything else I a missing?
Second:
I have created a netCDF file with variables with these four dimensions. I need to set those variables to the array structure made above. I have no idea how to do this and the netCDF4 documentation does a 1-d array in a fairly cryptic way. If the arrays can be made directly into the netCDF file bypassing the need to use numpy first, by all means show me how.
Thanks!
After talking to a few people where I work we came up with this solution:
First we made an array of zeroes using the following argument:
array1=np.zeros((28,5,24,4))
Then appended this array by specifying where in the array we wanted to change:
array1[:,0,0,0]=list1
This inserted the values of the list into the first entry in the array.
Next to write the array to a netCDF file, I created a netCDF in the same program I made the array, made a single variable and gave it values like this:
netcdfvariable[:]=array1
Hope that helps anyone who finds this.

Conditionally add 1 to int element of a NumPy record array

I have a large NumpPy record array 250 million rows by 9 columns (MyLargeRec). and I need to add 1 to the 7th column (dtype = "int") if the the index of that row is in another list or 300,000 integers(MyList). If this was a normal python list I would use the following simple code...
for m in MyList:
MyLargeRec[m][6]+=1
However I can not seem to get a similar functionality using the NumPy record array. I have tried a few options such as nditer, but this will not let me select the specific indices I want.
Now you may say that this is not what NumPy was designed for, so let me explain why I a using this format. I am using it is because it only takes 30 mins to build the record array from scratch whereas it takes over 24 hours if using a conventional 2D list format. I spent all of yesterday trying to find a way to do this and could not, I eventually converted it to a list using...
MyLargeList = list(MyLargeRec)
so I could use the simple code above to achieve what I want, however this took 8.5 hours to perform this function.
Therefore, can anyone tell me first, is there a method to achieve what i want within a NumPy record array? and second, if not, any ideas on the best methods within python 2.7 to create, update and store such a large 2D matrix?
Many thanks
Tom
your_array[index_list, 6] += 1
Numpy allows you to construct some pretty neat slices. This selects the 6th column of all rows in your list of indices and adds 1 to each. (Note that if an index appears multiple times in your list of indices, this will still only add 1 to the corresponding cell.)
This code...
for m in MyList:
MyLargeRec[m][6]+=1
does actually work, silly question by me.

Categories