I have relatively complex csv files which contain multiple matrices representing several types of data and I would like to be able to parse these into multiple dataframes.
The complication is that these files are quite variable in size and content, as seen in this example containing two types of data, in this case a Median and Count metric for each sample.
There are some commonalities that all of these files share. Each metric will be stored in a matrix structured essentially like the two in the above example. In particular, the DataType field and subsequent entry will always be there, and the feature space (columns) and sample space (rows) will be consistent within a file (the row space may vary between files).
Essentially, the end result should be a dataframe of the data for just one metric, with the feature ids as the column names (Analyte 1, Analyte 2, etc in this example) and the sample ids as the row names (Location column in this case).
So far I've attempted this using the pandas read_csv function without much success.
In theory I could do something like this, but only if I know (1) the size and (2) the location of the particular matrix for the metric that I am after. In this case the headers for my particular metric of interest would be in row 46 and I happen to know that the number of samples is 384.
import pandas as pd
df = pd.read_csv('/path/to/file.csv', sep = ",", header=46, nrows=385, index_col='Location')
I am at a complete loss how to do this in a dynamic fashion with files and matrices that change dimensions. Any input on overall strategy here would be greatly appreciated!
I have extracted facial features from several videos in the form of facial action units(AU) using an open face. These features span for several seconds and hence take several rows in a CSV file (each row containing AU data for each frame of the video). Originally, I had multiple CSV files as input for CNN but, as advised by others, I have concatenated and condensed the data into a single file. My CSV columns look like this:
Filename | Label | the other columns contain AU related data
Filename contains individual "ID" that helps keep track of a single "example". The label column contains 2 possible values. Either "yes" or "no". I'm also considering to add a "Frames" column to keep track of frame number for a certain "example".
The most likely scenario is that I will require some form of a 3DCNN but so far, the only codes or help that I found for 3DCNN are specific to videos while I require code for either a CSV file or various CSV files. I've been unable to find any code that can help me out in this scenario. Can someone please help me out? I have no idea how/where to move forward.
I have a stata .dta file. If I open it in stata, I can see several columns with value labels. I can go into browse, click on one of them, and see the original code behind the label.
If I read this .dta file into python via pd.read_stata(..., convert_categoricals=True), I can get the data types via df.dtypes.
For some of the columns, categories have been created. However, for one column of interest, instead a series with dtype Object has been created, which contains the labels as string.
How exactly does the process of category creation in pd.read_stata work?
How can I access the original data codes behind the labels when reading in with convert_categorical=True
What do I do in the case where columns are converted into dtype Object -- do I have to read in the data frame a second time with convert_categoricals=False and merge? That really sounds non-pythonic.
I have a historic csv that needs to be updated daily (concatenated) with a freshly pulled csv. The issue is that the new csv may have different number of columns from the historic one. If each of them was light, I could just read in both and concatenate with pandas. If the number of columns was the same, I could use cat and do a command-line call. Unfortunately, neither is true.
So, I am wondering if there is a way to do out-of-memory concatenation/join with pandas for something like above, or using one of the command line tools.
Thanks!
How would I go around creating a MYSQL table schema inspecting an Excel(or CSV) file.
Are there any ready Python libraries for the task?
Column headers would be sanitized to column names. Datatype would be estimated based on the contents of the spreadsheet column. When done, data would be loaded to the table.
I have an Excel file of ~200 columns that I want to start normalizing.
Use the xlrd module; start here. [Disclaimer: I'm the author]. xlrd classifies cells into text, number, date, boolean, error, blank, and empty. It distinguishes dates from numbers by inspecting the format associated with the cell (e.g. "dd/mm/yyyy" versus "0.00").
The job of programming some code to wade through user-entered data to decide on what DB datatype to use for each column is not something that can be easily automated. You should be able to eyeball the data and assign types like integer, money, text, date, datetime, time, etc and write code to check your guesses. Note that you need to able to cope with things like numeric or date data entered in text fields (can look OK in the GUI). You need a strategy to handle cells that don't fit the "estimated" datatype. You need to validate and clean your data. Make sure you normalize text strings (strip leading/trailing whitespace, replace multiple whitespaces by a single space. Excel text is (BMP-only) Unicode; don't bash it into ASCII or "ANSI" -- work in Unicode and encode in UTF-8 to put it in your database.
Quick and dirty workaround with phpmyadmin:
Create a table with the right amount of columns. Make sure the data fits the columns.
Import the CSV into the table.
Use the propose table structure.
As far as I know, there is no tool that can automate this process (I would love for someone to prove me wrong as I've had this exact problem before).
When I did this, I came up with two options:
(1) Manually create the columns in the db with the appropriate types and then import, or
(2) Write some kind of filter that could "figure out" what data types the columns should be.
I went with the first option mainly because I didn't think I could actually write a program to do the type inference.
If you do decide to write a type inference tool/conversion, here are a couple of issues you may have to deal with:
(1) Excel dates are actually stored as the number of days since December 31st, 1899; how does one infer then that a column is dates as opposed to some piece of numerical data (population for example)?
(2) For text fields, do you just make the columns of type varchar(n) where n is the longest entry in that column, or do you make it an unbounded char field if one of the entries is longer than some upper limit? If so, what's a good upper limit?
(3) How do you automatically convert a float to a decimal with the correct precision and without loosing any places?
Obviously, this doesn't mean that you won't be able to (I'm a pretty bad programmer). I hope you do, because it'd be a really useful tool to have.
Just for (my) reference, I documented below what I did:
XLRD is practical, however I've just saved the Excel data as CSV, so I can use LOAD DATA INFILE
I've copied the header row and started writing the import and normalization script
Script does: CREATE TABLE with all columns as TEXT, except for Primary key
query mysql: LOAD DATA LOCAL INFILE loading all CSV data into TEXT fields.
based on the output of PROCEDURE ANALYSE, I was able to ALTER TABLE to give columns the right types and lengths. PROCEDURE ANALYSE returns ENUM for any column with few distinct values, which is not what I needed, but I found that useful later for normalization. Eye-balling 200 columns was a breeze with PROCEDURE ANALYSE. Output from PhpMyAdmin propose table structure was junk.
I wrote some normalization mostly using SELECT DISTINCT on columns and INSERTing results to separate tables. I have added to the old table a column for FK first. Just after the INSERT, I've got its ID and UPDATEed the FK column. When loop finished I've dropped old column leaving only FK column. Similarly with multiple dependent columns. It was much faster than I expected.
I ran (django) python manage.py inspctdb, copied output to models.py and added all those ForeignkeyFields as FKs do not exist on MyISAM. Wrote a little python views.py, urls.py, few templates...TADA
Pandas can return a schema:
pandas.read_csv('data.csv').dtypes
References:
pandas.read_csv
pandas.DataFrame