Throwing away 3 out of 4 samples from a data frame - python

In Python, I have a DataFrame that looks like the following, all the way down to about 5000 samples:
I was wondering, in pandas, how do I remove 3 out of every 4 data points in my DataFrame?

To obtain a random sample of a quarter of your DataFrame, you could use
test4.sample(frac=0.25)
or, to specify the exact number of rows
test4.sample(n=1250))
If your purpose is to build training, validation, and testing data sets, then see this question.

If you want to select every 4th point, then you can do the following. This will select rows 0, 4, 8, ...:
test4.iloc[::4, :]['Accel']

Related

pandas : sampling avoiding twice same values in different samples

I have this 5000 rows dataframe.
I want to make 4 random samples of 300 rows from the dataframe. I want each of my sample to have no duplicate inside the sample, but i also want no duplicate among samples. Ie i dont want a row to appear in sample 1 and sample 3 for example.
I have tried df.sample(300,replace=False) but it's not enough.
I have also searched the forum but didnt find what i want.
How can i code pandas to do so without doing batch groups?
I don't think there is a pandas function specifically for that, but how about doing this:
df = pd.DataFrame({"col": range(5000)})
sample = df.sample(1200, replace= False)
sample.duplicated().any()
>> False # <-- no duplicates
samples = [sample.iloc[i-300:i] for i in range(300, 1500, 300)] # <-- 4 samples
Considering that .sample will return a random selection without replacement, this would achieve what you want.

Pandas or Numpy, but I am looking to take a 500 row 1 column, I guess series, turn it in to a 100 row, 5 column. or comparable in sections

I am looking to take my 1 column and 500 row data, and make it either sets of 5, so 100 sets of 5. Or compare a series of 5 numbers to 5 numbers at a time ( chunks of 5 ) in the data.
I want to see how many times if any, a sequence might show up in the data, or fragments of the sequence and how often.
the numbers are from 1-69 and sets of 5. But my data is all one long column with no spaces.
I eventually want to make a list of the sets that contain 2 or more numbers with matches, but one step at a time, right.
Oh or would a 100x5 matrix work? I don't know how to do that either.
Thank you for your time.
You can reshape a numpy array and create a dataframe from it. Below are 100x5 adn 5x100 examples
df = pd.DataFrame({'data':[random.randrange(1,70) for _ in range(500)]})
pd.DataFrame(df['data'].to_numpy().reshape((100, 5)))
pd.DataFrame(df['data'].to_numpy().reshape((5, 100)))

Python: Convert a pandas Series into an array and keep the index

I'm running a k-means algorithm (k=5) to cluster my Data. To check the stability of my algorithm, I first run the algorithm once on my whole dataset and afterwards I run the algorithm multiple times on 2/3 of my dataset (using a different random states for the splits). I use the results to predict the cluster of the remaining 1/3 of my data. Finally I want to compare the predicted cluster with the cluster I get when I run k-means on the whole dataset. This is where I get stuck.
Since k-means always assigns different labels to the (more or less) same clusters I can't just compare them. I tried using .value_counts() to reassign the labels 0 to 4 based on their frequency. But because I run this check multiple times, I need something that works in a loop.
Basically when I use .value_counts() I get something like this:
PredictedCluster
4 55555
0 44444
2 33333
1 22222
3 11111
I wish I could turn this into an array, where the labels are sorted by size:
a = [[4, 55555],[0,44444],...,[3,11111]]
Can anyone please tell me how to do this or what other approaches could I use to solve my problem?
Something like the one-liner below could work:
a = list(map(list, df["PredictedCluster"].value_counts().items()))
One option is to use:
(df['PredictedCluster'].value_counts(ascending=False)
.reset_index()
.to_numpy())
This will count the values, sort (descending) by those counts, then convert the results to a numpy.ndarray.
If you'd like the results in a list, simply append .tolist() to the end of the statement.

Alternative for drop_duplicates python 3.6

I am working on some huge volume of data, rows around 50 millions.
I want to find unique columns values from multiple columns. I use below script.
dataAll[['Frequency', 'Period', 'Date']].drop_duplicates()
But this is taking long time, more than 40minutes.
I found some alternative:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
but above script will give array, but I need in dataframe like first script will give as below
Generaly your new code is imposible convert to DataFrame, because:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
create one big 1d numpy array, so after remove duplicates is impossible recreate rows.
E.g. if there are 2 unique values 3 and 1 is impossible find which datetimes are for 3 and for 1.
But if there is only one unique value for Frequency and for each Period is possible find Date like in sample, solution is possible.
EDIT:
One possible alternative is use dask.dataframe.DataFrame.drop_duplicates.

Faster way to append ordered frequencies of pandas series

I am trying to make a list of the number of elements in each group in a pandas series. In my dataframe i have column called ID, and all values occur multiple times. I want to make a list containing the frequency of each element in the order by which they occur.
So an example of the column ID is [1,2,3,3,3,2,1,5,2,3,1,2,4,3]
this should produce [3,4,5,1,1] since the group-ID 1 occurs 3 times, the group-ID 2 occurs 4 times etc. I have made a code that does this perfectly:
group_list = df.ID.unique().tolist()
group_size = []
for i in group_list:
group_size.append(df.ID.value_counts()[i])
The problem is that it takes way to long to finish. I have 5 million rows, and i let it run for 50 minutes, and it still didn't finish! I tried running it for the first 30-50 rows and it works as intended.
To me it would be logical to simply use value_counts(sort=False) but it doesn't give me the group-ID frequencies in the order they occur in my series. I also tried implementing extend because i read it should be faster, but I get a "numpy.int64 object is not iterable".
Given a Series
ser = pd.Series([1,2,3,3,3,2,1,5,2,3,1,2,4,3])
You can do the following:
ser.value_counts().reindex(ser.unique()).tolist()
Out: [3, 4, 5, 1, 1]
Reindex will reorder the value_counts items based on the order they appear.

Categories