Splitting a single large csv file to resample by two columns - python

I am doing a machine learning project with phone sensor data (accelerometer). I need to preprocess dataset before I export it to the ML model. I have 25 classes (alphabets in the datasets) and there are 20 subjects (how many times I got the alphabet) for each class. Since the lengths are different for each class and subject, I have to resample. I want to split a single csv file by class and subject to be able to resample. I have tried some things like groupby() or other things but did not work. I will be glad if you can share thoughts what I can do about this problem. This is my first time asking a question on this site if I made a mistake I would appreciate it if you warn me about my mistakes. Thank you from now.
I share some code and outputs to help you understand my question better.
what i got when i tried with groupby() but not exactly what i wanted
This is how my csv file looks like. It contains more than 300,000 data.
Some code snippet:
import pandas as pd
import numpy as np
def read_data(file_path):
data = pd.read_csv(file_path)
return data
# read csv file
dataset = read_data('raw_data.csv')
df1 = pd.DataFrame( dataset.groupby(['alphabet', 'subject'])['x_axis'].count())
df1['x_axis'].head(20)
I also need to do this for every x_axis, y_axis and z_axis so what can I use other than groupby() function? I do not want to use only the lengths but also the values of all three to be able to resample.

First, calculate the greatest common number of sample
num_sample = df.groupby(['alphabet', 'subject'])['x_axis'].count().min()
Now you can sample
df.groupby(['alphabet', 'subject']).sample(num_sample)

Related

why and how to solve the data lost when multi encode in python pandas

Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))

How can I iterate over columns of a csv file to split it into several files?

I am completely new to Python (I started last week!), so while I looked at similar questions, I have difficulty understanding what's going on and even more difficulty adapting them to my situation.
I have a csv file where rows are dates and columns are different regions (see image 1). I would like to create a file that has 3 columns: Date, Region, and Indicator where for each date and region name the third column would have the correct indicator (see image 2).
I tried turning wide into long data, but I could not quite get it to work, as I said, I am completely new to Python. My second approach was to split it up by columns and then merge it again. I'd be grateful for any suggestions.
This gives your solution using stack() in pandas:
import pandas as pd
# In your case, use pd.read_csv instead of this:
frame = pd.DataFrame({
'Date': ['3/24/2020', '3/25/2020', '3/26/2020', '3/27/2020'],
'Algoma': [None,0,0,0],
'Brant': [None,1,0,0],
'Chatham': [None,0,0,0],
})
solution = frame.set_index('Date').stack().reset_index(name='Indicator').rename(columns={'level_1':'Region'})
solution.to_csv('solution.csv')
This is the inverse of doing a pivot, as explained here: Doing the opposite of pivot in pandas Python. As you can see there, you could also consider using the melt function as an alternative.
first, you're region column is currently 'one hot encoded'. What you are trying to do is to "reverse" one hot encode your region column. Maybe check if this link answers your question:
Reversing 'one-hot' encoding in Pandas.

Iteratively modify pandas dataframes?

So this is the hideous chunk of code:
import pandas as pd
import numpy as np
from pathlib import Path
import h5py as hdf
def datarray(data):
    '''works well for HDF files'''
    return pd.DataFrame(np.array(data))
print( 'Modules imported')
print( 'Initialized')
pth=Path(r'C:\Users\open.Sourcerer\Desktop\1DTrimmedStruc')
geo=hdf.File(pth/'DB_RAS.g09.hdf','r')
struc=geo.get('Geometry').get('Structures').get('Attributes')
culs=geo.get('Geometry').get('Structures').get('Culvert Groups').get('Attributes')
brls=geo.get('Geometry').get('Structures').get('Culvert Groups').get('Barrels').get('Attributes')
struc=datarray(struc)
culs=datarray(culs)
brls=datarray(brls)
struc['RSReach']=struc['RS']+struc['Reach']
culs['RSReach']=culs['RS']+culs['Reach']
brls['RSReach']=brls['RS']+brls['Reach']
for df in (struc,culs,brls):
    print(df)
I've tried a few ways of converting these 3 datasets to dataframes and adding a column iteratively, with no success. And no, I can't merge to 1 dataset, I used to feed them to a separate function from excel, but now I'm taking the datasets straight from HDF, so the output needs to look like this specifically. How could I boil it down from here? Thanks

implement a text classifier with python

i try to implement a Persian text classifier with python, i use excel to read my data and make my data set.
i would be thankful if you have any suggestion about better implementing.
i tried this code to access to body of messages which have my conditions and store them. i took screenshot of my excel file to help more.
for example i want to store body of messages which its col "foolish" (i mean F column) have value of 1(true).
https://ibb.co/DzS1RpY "screenshot"
import pandas as pd
file='1.xlsx'
sorted=pd.read_excel(file,index_col='foolish')
var=sorted[['body']][sorted['foolish']=='1']
print(var.head())
expected result is body of rows 2,4,6,8.
try assigning like this:
df_data=df["body"][df["foolish"]==1.0]
dont use - which is a python operator instead use _ (underscore)
Also note that this will return a series.
For a dataframe , use:
df_data = pd.DataFrame(df['body'][df["foolish"]==1.0])

Creating two lists from one randomly

I'm using pandas to import a lot of data from a CSV file, and once read I format it to contain only numerical data. This then returns a list within a list. Each list then contains around 140k bits of data. numericalData[][].
From this list, I wish to create Testing and Training Data. For my testing data, I want to have 30% of my read data numericalData, so I use this following bit of code;
testingAmount = len(numericalData0[0]) * trainingDataPercentage / 100
Works a treat. Then, I use numpy to select that amount of data from each column of my imported numericalData;
testingData.append(np.random.choice(numericalData[x], testingAmount) )
This then returns a sample with 38 columns (running in a loop), where each column has around 49k elements of data randomly selected from my imported numericalData.
The issue is, my trainingData needs to hold the other 70% of the data, but I'm unsure on how to do this. I've tried to compare each element in my testingData, and if both elements aren't equal, then add it to my trainingData. This resulted in an error and didn't work. Next, I tried to delete the selected testingData from my imported data, and then save that new column to my trainingData, alas, that didn't work eiher.
I've only been working with python for the past week so I'm a bit lost on what to try now.
You can use random.shuffle and split list after that. For toy example:
import random
data = range(1, 11)
random.shuffle(data)
training = data[:5]
testing = data[5:]
To get more information, read the docs.

Categories