I'm running a k-means algorithm (k=5) to cluster my Data. To check the stability of my algorithm, I first run the algorithm once on my whole dataset and afterwards I run the algorithm multiple times on 2/3 of my dataset (using a different random states for the splits). I use the results to predict the cluster of the remaining 1/3 of my data. Finally I want to compare the predicted cluster with the cluster I get when I run k-means on the whole dataset. This is where I get stuck.
Since k-means always assigns different labels to the (more or less) same clusters I can't just compare them. I tried using .value_counts() to reassign the labels 0 to 4 based on their frequency. But because I run this check multiple times, I need something that works in a loop.
Basically when I use .value_counts() I get something like this:
PredictedCluster
4 55555
0 44444
2 33333
1 22222
3 11111
I wish I could turn this into an array, where the labels are sorted by size:
a = [[4, 55555],[0,44444],...,[3,11111]]
Can anyone please tell me how to do this or what other approaches could I use to solve my problem?
Something like the one-liner below could work:
a = list(map(list, df["PredictedCluster"].value_counts().items()))
One option is to use:
(df['PredictedCluster'].value_counts(ascending=False)
.reset_index()
.to_numpy())
This will count the values, sort (descending) by those counts, then convert the results to a numpy.ndarray.
If you'd like the results in a list, simply append .tolist() to the end of the statement.
Related
I found a very strange behavior with pyspark when I use randomSplit. I have a column is_clicked that takes values 0 or 1 and there are way more zeros than ones. After random split I would expect the data would be uniformly distributed. But instead, I see that the first rows in the splits are all is_cliked=1, followed by rows that are all is_clicked=0. You can see that number of clicks in the original dataframe df is 9 out of 1000 (which is what I expect). But after random split the number of clicks is 1000 out of 1000. If I take more rows I will see that it's all going to be is_clicked=1 until there are no more columns like this, and then it will be followed by rows is_clicked=0.
Anyone knows why there is distribution change after random split? How can I make is_clicked be uniformly distributed after split?
So indeed pyspark does sort the data, when does randomSplit. Here is a quote from the code:
It is possible that the underlying dataframe doesn't guarantee the
ordering of rows in its constituent partitions each time a split is
materialized which could result in overlapping splits. To prevent
this, we explicitly sort each input partition to make the ordering
deterministic. Note that MapTypes cannot be sorted and are explicitly
pruned out from the sort order.
The solution to this either reshuffle the data after the split or just use filter instead of randomSplit:
Solution 1:
df = df.withColumn('rand', sf.rand(seed=42)).orderBy('rand')
df_train, df_test = df.randomSplit([0.5, 0.5])
df_train.orderBy('rand')
Solution 2:
df_train = df.filter(df.rand < 0.5)
df_test = df.filter(df.rand >= 0.5)
Here is a blog post with more details.
normal = []
nine_plus []
tw_plus = []
for i in df['SubjectID'].unique():
x= df.loc[df['SubjectID']==i]
if(len(x['Year Term ID'].unique())<=8):
normal.append(i)
elif(len(x['Year Term ID'].unique())>=9 and len(x['Year Term ID'].unique())<13):
nine_plus.append(i)
elif(len(x['Year Term ID'].unique())>=13):
tw_plus.append(i)
Hello, I am dealing with a dataset that has 10 million rows. The dataset is about student records and I am trying to classify the students into three groups according to how many semesters they have attended. I feel like I am using very crude method right now, and there could be more efficient way of categorizing. Any suggestions?
You go through a lot of repeated iterations that are likely to make your data frame slower than a simple Python list. Use the data frame organization in your favor.
Group your rows by Subject_ID, then Year_Term_ID.
Extract the count of rows in each sub-group -- which you currently have as len(x(...
Make a function, lambda, or extra column that represents the classification; call that len expression load:
0 if load <= 8 else 1 if load <= 12 else 3
Use that expression to re-group your students into the three desired classifications.
Do not iterate through the rows of the data frame: this is a "code smell" that you're missing a vectorized capability.
Does that get you moving?
So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.
Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.
I am working on some huge volume of data, rows around 50 millions.
I want to find unique columns values from multiple columns. I use below script.
dataAll[['Frequency', 'Period', 'Date']].drop_duplicates()
But this is taking long time, more than 40minutes.
I found some alternative:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
but above script will give array, but I need in dataframe like first script will give as below
Generaly your new code is imposible convert to DataFrame, because:
pd.unique(dataAll[['Frequency', 'Period', 'Date']].values.ravel('K'))
create one big 1d numpy array, so after remove duplicates is impossible recreate rows.
E.g. if there are 2 unique values 3 and 1 is impossible find which datetimes are for 3 and for 1.
But if there is only one unique value for Frequency and for each Period is possible find Date like in sample, solution is possible.
EDIT:
One possible alternative is use dask.dataframe.DataFrame.drop_duplicates.
In Python, I have a DataFrame that looks like the following, all the way down to about 5000 samples:
I was wondering, in pandas, how do I remove 3 out of every 4 data points in my DataFrame?
To obtain a random sample of a quarter of your DataFrame, you could use
test4.sample(frac=0.25)
or, to specify the exact number of rows
test4.sample(n=1250))
If your purpose is to build training, validation, and testing data sets, then see this question.
If you want to select every 4th point, then you can do the following. This will select rows 0, 4, 8, ...:
test4.iloc[::4, :]['Accel']