I'm really new to python and pandas so would you please help me answer this seemingly simple question? I already have an excel file containing my data, now I want to create an array containing those data in python. For example, I have data in excel that look like this:
I want from those data to create a matrix of the form like the python code below:
Actually, my data is much longer so is there any way that I can take advantage of pandas to put the data from my excel file into a matrix in python similar to the simple example above?
Thank you!
you can put all your values into a=np.array([40,56,87,98,58,98,56,63]), then a.reshape(4,2) but in your case a.reshape(3,9). Hope you get my point.
You can use pandas.read_excel()
In the documentation there is also some examples like:
pd.read_excel('tmp.xlsx', index_col=0)
Name Value
0 string1 1
1 string2 2
2 #Comment 3
Related
I am doing a machine learning project with phone sensor data (accelerometer). I need to preprocess dataset before I export it to the ML model. I have 25 classes (alphabets in the datasets) and there are 20 subjects (how many times I got the alphabet) for each class. Since the lengths are different for each class and subject, I have to resample. I want to split a single csv file by class and subject to be able to resample. I have tried some things like groupby() or other things but did not work. I will be glad if you can share thoughts what I can do about this problem. This is my first time asking a question on this site if I made a mistake I would appreciate it if you warn me about my mistakes. Thank you from now.
I share some code and outputs to help you understand my question better.
what i got when i tried with groupby() but not exactly what i wanted
This is how my csv file looks like. It contains more than 300,000 data.
Some code snippet:
import pandas as pd
import numpy as np
def read_data(file_path):
data = pd.read_csv(file_path)
return data
# read csv file
dataset = read_data('raw_data.csv')
df1 = pd.DataFrame( dataset.groupby(['alphabet', 'subject'])['x_axis'].count())
df1['x_axis'].head(20)
I also need to do this for every x_axis, y_axis and z_axis so what can I use other than groupby() function? I do not want to use only the lengths but also the values of all three to be able to resample.
First, calculate the greatest common number of sample
num_sample = df.groupby(['alphabet', 'subject'])['x_axis'].count().min()
Now you can sample
df.groupby(['alphabet', 'subject']).sample(num_sample)
I'm currently working on a project that takes a csv list of student names who attended a meeting, and converts it into a list (later to be compared to full student roster list, but one thing at a time). I've been looking for answers for hours but I still feel stuck. I've tried using both pandas and the csv module. I'd like to stick with pandas, but if it's easier in the csv module that works too. CSV file example and code below.
The file is autogenerated by our video call software- so the formatting is a little weird.
Attendance.csv
see sample as image, I can't insert images yet
Code:
data = pandas.read_csv("2A Attendance Report.csv", header=3)
AttendanceList = data['A'].to_list()
print(str(AttendanceList))
However, this is raising KeyError: 'A'
Any help is really appreciated, thank you!!!
As seen in sample image, you have column headers in the first row itself. Hence you need to remove header=3 from your read_csv call. Either replace it with header=0 or don't specify any explicit header value at all.
My background: long-time SAS and R user, trying to figure out how to do some elementary things in Azure Databricks using Python and Spark. Sorry for the lack of a reproducible example below; I'm not sure how to create one like this.
I'm trying to read data from a complicated XML file. I've reached this point, where I have a pyspark.sql.dataframe (call it xml1) with this arrangement:
RESPONSE:array
element:array
element:struct
VALUE:string
VARNAME:string
The xml1 dataframe looks like this:
[Row(RESPONSE=[[Row(VALUE='No', VARNAME='PROV_U'), Row(VALUE='Included', VARNAME='ADJSAMP'), Row(VALUE='65', VARNAME='AGE'), ...
When I use xml2=xml1.toPandas(), I get this:
RESPONSE
0 [[(No, PROV_U), (Included, ADJSAMP), (65, AGE)...
1 [[(Included, ADJSAMP), (71, AGE), ...
...
At a minimum, I would like to convert this to a Pandas dataframe with two columns VARNAME and VALUE. A better solution would be a dataframe with columns named with VARNAME values (such as PROV_U, ADJSAMP, AGE), with one row per RESPONSE. Helpful hints with names of correct Python terms in intermediate steps are appreciated!
To deal with array of structs explode is your answer. Here is link on how to use explode https://hadoopist.wordpress.com/2016/05/16/how-to-handle-nested-dataarray-of-structures-or-multiple-explodes-in-sparkscala-and-pyspark/
I am trying to convert a python Dataframe to a Matlab (.mat) file.
I initially have a txt (EEG signal) that I import using panda.read_csv:
MyDataFrame = pd.read_csv("data.txt",sep=';',decimal='.'), data.txt being a 2D array with labels. This creates a dataframe which looks like this.
In order to convert it to .mat, I tried this solution where the idea is to convert the dataframe into a dictionary of lists but after trying every aspect of this solution it's still unsuccessful.
scipy.io.savemat('EEG_data.mat', {'struct':MyDataFrame.to_dict("list")})
It did create a .mat file but it did not save my dataframe properly. The file I obtain after looks like this, so all the values are basically gone, and the remaining labels you see are empty when you look into them.
I also tried using mat4py which is designed to export python structures into Matlab files, but it did not work either. I don't understand why, because converting my dataframe to a dictionary of lists is exactly what should be done according to the mat4py documentation.
I believe that the reason the previous solutions haven't worked for you is that your DataFrame column names are not valid MATLAB struct field names, because they contain spaces and/or start with digit characters.
When I do:
import pandas as pd
import scipy.io
MyDataFrame = pd.read_csv('eeg.txt',sep=';',decimal='.')
truncDataFrame = MyDataFrame[0:1000] # reduce data size for test purposes
scipy.io.savemat('EEGdata1.mat', {'struct1':truncDataFrame.to_dict("list")})
the result in MATLAB is a struct with the 4 fields reltime, datetime, iSensor and quality. Each of these has 1000 elements, so the data from these columns has been converted, but the rest of your data is missing.
However if I first rename the DataFrame columns:
truncDataFrame.rename(columns=lambda x:'col_' + x.replace(' ', '_'), inplace=True)
scipy.io.savemat('EEGdata2.mat', {'struct2':truncDataFrame.to_dict("list")})
the result in MATLAB is a struct with 36 fields. This is not the same format as your mat4py solution but it does contain (as far as I can see) all the data from the source DataFrame.
(Note that in your question, you are creating a .mat file that contains a variable called struct and when this is loaded into MATLAB it masks the builtin struct datatype - that might also cause issues with subsequent MATLAB code.)
I finally found a solution thanks to this post. There, the poster did not create a dictionary of lists but a dictionary of integers, which worked on my side. It is a small example, easily reproductible. Then I tried to manually add lists by entering values like [1, 2], an it did not work. But what worked was when I manually added tuples !
MyDataFrame needs to be converted to a dictionary and if a dictionary of lists doesn't work, try with tuples.
For beginners : lists are contained by [] and tuples by (). Here is an image showing both.
This worked for me:
import mat4py as mp
EEGdata = MyDataFrame.apply(tuple).to_dict()
mp.savemat('EEGdata.mat',{'structs': EEGdata})
EEGdata.mat should now be readable by Matlab, as it is on my side.
I'm trying to do an unbelievably simple thing: load parts of an Excel worksheet into a Numpy array. I've found a kludge that works, but it is embarrassingly unpythonic:
say my worksheet was loaded as "ws", the code:
A = np.zeros((37,3))
for i in range(2,39):
for j in range(1,4):
A[i-2,j-1]= ws.cell(row = i, column = j).value
loads the contents of "ws" into array A.
There MUST be a more elegant way to do this. For instance, csvread allows to do this much more naturally, and while I could well convert the .xlsx file into a csv one, the whole purpose of working with openpyxl was to avoid that conversion. So there we are, Collective Wisdom of the Mighty Intertubes: what's a more pythonic way to perform this conceptually trivial operation?
Thank you in advance for your answers.
PS: I operate Python 2.7.5 on a Mac via Spyder, and yes, I did read the openpyxl tutorial, which is the only reason I got this far.
You could do
A = np.array([[i.value for i in j] for j in ws['C1':'E38']])
EDIT - further explanation.
(firstly thanks for introducing me to openpyxl, I suspect I will use it quite a bit from time to time)
the method of getting multiple cells from the worksheet object produces a generator. This is probably much more efficient if you want to work your way through a large sheet as you can start straight away without waiting for it all to load into your list.
to force a generator to make a list you can either use list(ws['C1':'E38']) or a list comprehension as above
each row is a tuple (even if only one column wide) of
Cell objects. These have a lot more about them than just a number but if you want to get the number for your array you can use the .value attribute. This is really the crux of your question, csv files don't contain the structured info of an excel spreadsheet.
there isn't (as far as I can tell) a built in method for extracting values from a range of cells so you will have to do something effectively as you have sketched out.
The advantages of doing it my way are: no need to work out the dimension of the array and make an empty one to start with, no need to work out the corrected index number of the np array, list comprehensions faster. Disadvantage is that it needs the "corners" defining in "A1" format. If the range isn't know then you would have to use iter_rows, rows or columns
A = np.array([[i.value for i in j[2:5]] for j in ws.rows])
if you don't know how many columns then you will have to loop and check values more like your original idea
If you don't need to load data from multiple files in an automated manner, the package tableconvert I recently wrote may help. Just copy and paste the relevant cells from the excel file into a multiline string and use the convert() function.
import numpy as np
from tableconvert.converter import convert
array = convert("""
123 456 3.14159
SOMETEXT 2,71828 0
""")
print(type(array))
print(array)
Output:
<class 'numpy.ndarray'>
[[ 123. 456. 3.14159]
[ nan 2.71828 0. ]]