Python Pandas: Data Slices

Python Pandas: Data Slices - python

I am stuck with an issue when it comes to taking slices of my data in python (I come from using Matlab).
So here is the code I'm using,
import scipy.io as sc
import math as m
import numpy as np
from scipy.linalg import expm, sinm, cosm
import matplotlib.pyplot as plt
import pandas as pd
import sys
data = pd.read_excel('DataDMD.xlsx')
print(data.shape)
print(data)
The out put looks like so,
Output
So I wish to take certain rows only (or from my understand in Python slices) of this data matrix. The other problem I have is that the top row of my matrix becomes almost like the titles of the columns instead of actually data points. So I have two problems,
1) I don't need the top of the matrix to have any 'titles' or anything of that sort because it's all numeric and all symbolizes data.
2) I only need to take the 6th row of the whole matrix as a new data matrix.
3) I plan on using matrix multiplication later so is panda allowed or do I need numpy?
So this is what I've tried,
data.iloc[0::6,:]
this gives me something like this,
Output2
which is wrong because I don't need the values of 24.8 to be the 'title' but be the first row of the new matrix.
I've also tried using np.array for this but my problem is when I try to using iloc, it says (which makes sense)
'numpy.ndarray' object has no attribute 'iloc'
If anyone has any ideas, please let me know! Thanks!

To avoid loading the first record as the header, try using the following:
pd.read_excel('DataDMD.xlsx', header=None)
The read_excel function has an header argument; the value for the header argument indicates which row of the data should be used as header. It gets a default value of 0. Use None as a value for the header argument if none of the rows in your data functions as the header.
There are many useful arguments, all described in the documentation of the function.
This should also help with number 2.
Hope this helps.
Good luck!

Related

H2OFrame column to array: quickest way?

Suppose I have an H2OFrame called df. What is the quickest way to get the values of column x from said frame as a numpy array?
One could do
x_array = df['x'].as_data_frame()['x'].values
But that seems unnecessarily verbose. Especially passing via a pandas DataFrame with as_data_frame seems superfluous. I was hoping for something more elegant like, e.g. df['x'].to_array(). But I can't find it.

here is another way. However, I'm not sure it's faster. I'm using the h2o.as_list() function to convert a column to a list and then I use the np.array() function to convert the list to an array.
import h2o
import numpy as np
h2o.init()
# Using sample dataset from H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
## Creating np array from h2o frame column
np.array(h2o.as_list(train['x1']))

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')

You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.

Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

Converting python Dataframe to Matlab file

I am trying to convert a python Dataframe to a Matlab (.mat) file.
I initially have a txt (EEG signal) that I import using panda.read_csv:
MyDataFrame = pd.read_csv("data.txt",sep=';',decimal='.'), data.txt being a 2D array with labels. This creates a dataframe which looks like this.
In order to convert it to .mat, I tried this solution where the idea is to convert the dataframe into a dictionary of lists but after trying every aspect of this solution it's still unsuccessful.
scipy.io.savemat('EEG_data.mat', {'struct':MyDataFrame.to_dict("list")})
It did create a .mat file but it did not save my dataframe properly. The file I obtain after looks like this, so all the values are basically gone, and the remaining labels you see are empty when you look into them.
I also tried using mat4py which is designed to export python structures into Matlab files, but it did not work either. I don't understand why, because converting my dataframe to a dictionary of lists is exactly what should be done according to the mat4py documentation.

I believe that the reason the previous solutions haven't worked for you is that your DataFrame column names are not valid MATLAB struct field names, because they contain spaces and/or start with digit characters.
When I do:
import pandas as pd
import scipy.io
MyDataFrame = pd.read_csv('eeg.txt',sep=';',decimal='.')
truncDataFrame = MyDataFrame[0:1000] # reduce data size for test purposes
scipy.io.savemat('EEGdata1.mat', {'struct1':truncDataFrame.to_dict("list")})
the result in MATLAB is a struct with the 4 fields reltime, datetime, iSensor and quality. Each of these has 1000 elements, so the data from these columns has been converted, but the rest of your data is missing.
However if I first rename the DataFrame columns:
truncDataFrame.rename(columns=lambda x:'col_' + x.replace(' ', '_'), inplace=True)
scipy.io.savemat('EEGdata2.mat', {'struct2':truncDataFrame.to_dict("list")})
the result in MATLAB is a struct with 36 fields. This is not the same format as your mat4py solution but it does contain (as far as I can see) all the data from the source DataFrame.
(Note that in your question, you are creating a .mat file that contains a variable called struct and when this is loaded into MATLAB it masks the builtin struct datatype - that might also cause issues with subsequent MATLAB code.)

I finally found a solution thanks to this post. There, the poster did not create a dictionary of lists but a dictionary of integers, which worked on my side. It is a small example, easily reproductible. Then I tried to manually add lists by entering values like [1, 2], an it did not work. But what worked was when I manually added tuples !
MyDataFrame needs to be converted to a dictionary and if a dictionary of lists doesn't work, try with tuples.
For beginners : lists are contained by [] and tuples by (). Here is an image showing both.
This worked for me:
import mat4py as mp
EEGdata = MyDataFrame.apply(tuple).to_dict()
mp.savemat('EEGdata.mat',{'structs': EEGdata})
EEGdata.mat should now be readable by Matlab, as it is on my side.

Deleting rows of data for multiple variables

I have over 500 files that I cleaned up using a pandas data frame, and read in later as a matrix. I now want to delete missing rows of data from multiple variables for the entirety of my files. Each variable is pretty lengthy for its shape, for example, tc and wspd have the shape (84479, 558) and pressure has the shape (558,). I have tried the following example before and has worked in the past for single dimensional arrays with the same shape, but will no longer work with a two dimensional array.
bad=[]
for i in range(len(p)):
if p[i]==-9999 or tc[i]==-9999:
bad.append(i)
p=numpy.delete(p, bad)
tc=numpy.delete(tc, bad)
I tried using the following code instead but with no success (unfortunately).
import numpy as n
import pandas as pd
wspd=pd.read_pickle('/home/wspd').as_matrix()
tc=pd.read_pickle('/home/tc').as_matrix()
press=n.load('/home/file1.npz')
p=press['press']
names=press['names']
length=n.arange(0,84479)
for i in range(len(names[0])): #using the first one as a trial to run faster
print i #used later to see how far we have come in the 558 files
bad=[]
for j in range(len(length)):
if (wspd[j,i]==n.nan or tc[j,i]==n.nan):
bad.append(j)
print bad
From there I plan on deleting missing data as I had done previously except indexing which dimension I am deleting from within my first forloop.
new_tc=n.delete(tc[j,:], bad)
Unfortunately, this has not worked. I have also tried masking the array which also has not worked.
The reason I need to delete the data is my next library does not understand nan values, it requires strictly integers, floats, etc.
I am open to new methods for removing rows of data if anyone has any guidance. I greatly appreciate it.

I would load your 2 dimensional arrays as pandas DataFrames and then use the dropna function to drop any rows that contain a null value
wspd = pd.read_pickle('/home/wspd').dropna()
tc = pd.read_pickle('/home/tc').dropna()
The documentation for pandas.DataFrame.dropna is here

Excel worksheet to Numpy array

I'm trying to do an unbelievably simple thing: load parts of an Excel worksheet into a Numpy array. I've found a kludge that works, but it is embarrassingly unpythonic:
say my worksheet was loaded as "ws", the code:
A = np.zeros((37,3))
for i in range(2,39):
for j in range(1,4):
A[i-2,j-1]= ws.cell(row = i, column = j).value
loads the contents of "ws" into array A.
There MUST be a more elegant way to do this. For instance, csvread allows to do this much more naturally, and while I could well convert the .xlsx file into a csv one, the whole purpose of working with openpyxl was to avoid that conversion. So there we are, Collective Wisdom of the Mighty Intertubes: what's a more pythonic way to perform this conceptually trivial operation?
Thank you in advance for your answers.
PS: I operate Python 2.7.5 on a Mac via Spyder, and yes, I did read the openpyxl tutorial, which is the only reason I got this far.

You could do
A = np.array([[i.value for i in j] for j in ws['C1':'E38']])
EDIT - further explanation.
(firstly thanks for introducing me to openpyxl, I suspect I will use it quite a bit from time to time)
the method of getting multiple cells from the worksheet object produces a generator. This is probably much more efficient if you want to work your way through a large sheet as you can start straight away without waiting for it all to load into your list.
to force a generator to make a list you can either use list(ws['C1':'E38']) or a list comprehension as above
each row is a tuple (even if only one column wide) of
Cell objects. These have a lot more about them than just a number but if you want to get the number for your array you can use the .value attribute. This is really the crux of your question, csv files don't contain the structured info of an excel spreadsheet.
there isn't (as far as I can tell) a built in method for extracting values from a range of cells so you will have to do something effectively as you have sketched out.
The advantages of doing it my way are: no need to work out the dimension of the array and make an empty one to start with, no need to work out the corrected index number of the np array, list comprehensions faster. Disadvantage is that it needs the "corners" defining in "A1" format. If the range isn't know then you would have to use iter_rows, rows or columns
A = np.array([[i.value for i in j[2:5]] for j in ws.rows])
if you don't know how many columns then you will have to loop and check values more like your original idea

If you don't need to load data from multiple files in an automated manner, the package tableconvert I recently wrote may help. Just copy and paste the relevant cells from the excel file into a multiline string and use the convert() function.
import numpy as np
from tableconvert.converter import convert
array = convert("""
123 456 3.14159
SOMETEXT 2,71828 0
""")
print(type(array))
print(array)
Output:
<class 'numpy.ndarray'>
[[ 123. 456. 3.14159]
[ nan 2.71828 0. ]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas: Data Slices - python

Related

H2OFrame column to array: quickest way?

How to get data from object in Python

Converting python Dataframe to Matlab file

Deleting rows of data for multiple variables

Excel worksheet to Numpy array

Categories

Resources