Training a Scikit Learn model on multiple dataframes with identical features - python

I have a around 300 Videos to train my model on, I used an algorithm that extracted around 30 features from each frame from each video, each video is converted to a csv file with rows being the frames and columns being the features so now i have around 300 CSV files. now I could simply concatenate them all into one giant dataframe and train them, except each video (or rather CSV) has a single target value (Boolean) that isn't unique to each frame but to the whole video (each video is out of the box has either true or false value that im trying to predict based on the features inside the video).
My first idea was I could just concatenate all CSV files and then groupby() the target and call it a day, but now I gotta use this huge dataframe to train a model that predicts whether this value is true or false using SVM or Random Forest for example but the issue is I don't know if this will work or is the right way to go about it. here's a sample of what a single CSV file looks like:
and this entire CSV has a true or false value I'm trying to predict (currently not present in the features and if added, the entire column would either be 1 or 0).
looking for any ideas or recommendations to handle this dataset.
Thank you!

Related

How to remove columns index wise from a 'prefetch type' dataset?

I am editing one Kaggle notebook and implementing one of my own algorithms. There I start to face one problem regarding the dataset. The type of the dataset is tensorFlow.python.data.ops.dataset_ops.PrefetchDataset . I have to split it into 70/30. I tried using dataset.take(1000) and dataset.skip(1000) , It did not workout. Also I am not able to split it like x_train,y_train,x_test, y_test. Moreover, I want to remove some specific index-wise rows from my dataset but that is not possible here. Is there any way out to convert it into dataframe then I can easily do all of these tasks.

About Data Cleaning

I am a pretty amateur data science student and I am working on a project where I compared two servers in a team based game but my two datasets are formatted differently from one another. One column for instance would be first blood, where one set of data stores this information as "blue_team_first_blood" and is stored as True or False where as the other stores it as just "first blood" and stores integers, (1 for blue team, 2 for red team, 0 for no one if applicable)
I feel like I can code around these difference but whats the best practice? should I take the extra step to make sure both data sets are formatted correctly or does it matter at all?
Data cleaning is usually the first step in any data science project. It makes sense to transform the data into a consistent format before any further processing steps.
you could consider transforming the "blue_team_first_blood" column to an integer format that is consistent with the other dataset, such as 1 for True and 0 for False. You could also consider renaming the "first blood" column in the second dataset to "blue_team_first_blood" to match the first dataset.
Overall, the best practice is to ensure that both datasets are formatted consistently and in a way that makes sense for your analysis. This will make it easier to compare the two datasets and draw meaningful insights.

Exporting a csv of pandas dataframe with LARGE np.arrays

I'm building a deep learning model for speech emotion recognition in google colab environment.
The process of the data and features extraction from the audio files is taking about 20+ mins of runtime.
Therefore, I have made a pandas DataFrame containing all of the data which I want to export to a CSV file so I wouldn't need to wait that long for the data to be extracted every time.
Because audio files have 44,100 frames per second on average (sample rate (Hz)), I get a huge array of values, so that
df.sample shows for e.g:
df.sample for variable 'x'
Each 'x' array has about 170K values, but only shows this minimizing representation in df.sample.
Unfortunately, df.to_csv copies the exact representation, and NOT the full arrays.
Is there a way to export the full DataFrame as CSV? (Should be miles and miles of data for each row...)
The problem is that a dataframe is not expected to contain np.arrays. As numpy is the underlying framework for Pandas, np.arrays are special for pandas. Anyway, a dataframe is intended to be a data processing tools, not a general purpose container, so I think you are using the wrong tool here.
If you still want to go that way, it is enough to change the np.arrays into lists:
df['x'] = df['x'].apply(list)
But at load time, you will have to declare a converter to change the string representations of lists into plain lists:
df = pd.read_csv('data.csv', converters={'x': ast.literal_eval, ...})
But again, a csv file is not intended to have fields containing large lists, and performances could not be what you expect.

A CNN that takes several rows from a CSV file as a single input

I have extracted facial features from several videos in the form of facial action units(AU) using an open face. These features span for several seconds and hence take several rows in a CSV file (each row containing AU data for each frame of the video). Originally, I had multiple CSV files as input for CNN but, as advised by others, I have concatenated and condensed the data into a single file. My CSV columns look like this:
Filename | Label | the other columns contain AU related data
Filename contains individual "ID" that helps keep track of a single "example". The label column contains 2 possible values. Either "yes" or "no". I'm also considering to add a "Frames" column to keep track of frame number for a certain "example".
The most likely scenario is that I will require some form of a 3DCNN but so far, the only codes or help that I found for 3DCNN are specific to videos while I require code for either a CSV file or various CSV files. I've been unable to find any code that can help me out in this scenario. Can someone please help me out? I have no idea how/where to move forward.

Encoding huge categorical features

I am currently working on a big categorical dataset using both Pandas an sklearn. As I want know to do feature extraction, or just be able to build a model, I need to encode the categoricals features as they are not handled by sklearn models.
Here is the issue : for one of the categories, I have more than 110,000 different values. I cannot possibly encode this as there would be (and there is) a memory error.
Also, this feature cannot be removed from the DataSet so this option is out of the question.
So I have multiple ways to deal with that :
Use FeatureHasher from sklearn (as mentionned in several topics related to that). But, while it's easy to encode, there is no mention on how to link the hashed feature with the DataFrame and thus, complete the feature extraction.
Use FuzzyMatching to reduce the number or values in the features and then use one-hot-encoding. The issue here is that it'll still create many dummy variables and I will lose the interest regarding the features that get encoded.
So I have multiple questions : first, do you know how to use FeatureHasher to link a Pandas DataFrame to an sklearn model through hashing ? If so, how am I supposed to do it ?
Second, can you think of any other way to do it that might be easier, or would work best with my problem?
Here are some screenshots regarding the DataSet I have, so that you understand more fully the issue.
Here is the output of the number/percentage of different values per feature.
Also, the values inside the biggest feature to encode called 'commentaire' (commentary in english) contains strings that sometimes are long sentences and sometimes just small words.
As requested, here is the current code to test Feature Hashing:
fh = FeatureHasher(input_type='string')
hashedCommentaire = fh.transform(dataFrame['commentaire'])
This outputs a Memory Error, so I can reduce the number of features, let's put it at a 100:
fh = FeatureHasher(input_type='string', n_features=100)
hashedCommentaire = fh.transform(dataFrame['commentaire'])
print(hashedCommentaire.toarray())
print(hashedCommentaire.shape)
It doesn't get an error and outputs the following : feature hashing output
Can I then directly use the result hashing in my DataFrame? The issue is that we totally loose track of the values of 'commentaire', if we then want to predict on new data, will the hashing output follow the previous one ? And also how do we know the hashing is "sufficient". Here I put 100 rows but there was more than 110,000 values to begin with.
Thanks to another observation, I began to explore the use of tf-idf. Here we can see a sample of 'commentaire' values: 'commentaire' feature sample
As we can see the different values will be strings. Sometimes sentences. The thing is, while exploring this data, I noticed that some values are very close to each other; that's why I wanted to explore the idea of Fuzzy Matching at first and now of tf-idf.

Categories