pandas : sampling avoiding twice same values in different samples

pandas : sampling avoiding twice same values in different samples - python

I have this 5000 rows dataframe.
I want to make 4 random samples of 300 rows from the dataframe. I want each of my sample to have no duplicate inside the sample, but i also want no duplicate among samples. Ie i dont want a row to appear in sample 1 and sample 3 for example.
I have tried df.sample(300,replace=False) but it's not enough.
I have also searched the forum but didnt find what i want.
How can i code pandas to do so without doing batch groups?

I don't think there is a pandas function specifically for that, but how about doing this:
df = pd.DataFrame({"col": range(5000)})
sample = df.sample(1200, replace= False)
sample.duplicated().any()
>> False # <-- no duplicates
samples = [sample.iloc[i-300:i] for i in range(300, 1500, 300)] # <-- 4 samples
Considering that .sample will return a random selection without replacement, this would achieve what you want.

Related

Pyspark random split changes distribution of data

I found a very strange behavior with pyspark when I use randomSplit. I have a column is_clicked that takes values 0 or 1 and there are way more zeros than ones. After random split I would expect the data would be uniformly distributed. But instead, I see that the first rows in the splits are all is_cliked=1, followed by rows that are all is_clicked=0. You can see that number of clicks in the original dataframe df is 9 out of 1000 (which is what I expect). But after random split the number of clicks is 1000 out of 1000. If I take more rows I will see that it's all going to be is_clicked=1 until there are no more columns like this, and then it will be followed by rows is_clicked=0.
Anyone knows why there is distribution change after random split? How can I make is_clicked be uniformly distributed after split?

So indeed pyspark does sort the data, when does randomSplit. Here is a quote from the code:
It is possible that the underlying dataframe doesn't guarantee the
ordering of rows in its constituent partitions each time a split is
materialized which could result in overlapping splits. To prevent
this, we explicitly sort each input partition to make the ordering
deterministic. Note that MapTypes cannot be sorted and are explicitly
pruned out from the sort order.
The solution to this either reshuffle the data after the split or just use filter instead of randomSplit:
Solution 1:
df = df.withColumn('rand', sf.rand(seed=42)).orderBy('rand')
df_train, df_test = df.randomSplit([0.5, 0.5])
df_train.orderBy('rand')
Solution 2:
df_train = df.filter(df.rand < 0.5)
df_test = df.filter(df.rand >= 0.5)
Here is a blog post with more details.

Pandas or Numpy, but I am looking to take a 500 row 1 column, I guess series, turn it in to a 100 row, 5 column. or comparable in sections

I am looking to take my 1 column and 500 row data, and make it either sets of 5, so 100 sets of 5. Or compare a series of 5 numbers to 5 numbers at a time ( chunks of 5 ) in the data.
I want to see how many times if any, a sequence might show up in the data, or fragments of the sequence and how often.
the numbers are from 1-69 and sets of 5. But my data is all one long column with no spaces.
I eventually want to make a list of the sets that contain 2 or more numbers with matches, but one step at a time, right.
Oh or would a 100x5 matrix work? I don't know how to do that either.
Thank you for your time.

You can reshape a numpy array and create a dataframe from it. Below are 100x5 adn 5x100 examples
df = pd.DataFrame({'data':[random.randrange(1,70) for _ in range(500)]})
pd.DataFrame(df['data'].to_numpy().reshape((100, 5)))
pd.DataFrame(df['data'].to_numpy().reshape((5, 100)))

generate output files with random samples from pandas dataframe

I have a dataframe with 500K rows. I need to distribute sets of 100 randomly selected rows to volunteers for labeling.
for example:
df = pd.DataFrame(np.random.randint(0,450,size=(450,1)),columns=list('a'))
I can remove a random sample of 100 rows and output a file with time stamp:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
the above works but if I try to apply it to the entire example dataframe:
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
it runs continuously - my expected output is 5 timestampdfsample.csv files 4 of which have 100 rows and the fifth 50 rows all randomly selected from df however df.drop(df_sample.index) doesn't update df so condition is always true and it runs forever generating csv files. I'm having trouble solving this problem.
any guidance would be appreciated
UPDATE
this to gets me almost there:
for i in range(4):
df_subset=df.sample(100)
df=df.drop(df_subset.index)
time.sleep(1) #added because runs too fast for unique naming
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
it requires me to specify number of files. If I say 5 for the example df I get an error on the 5th. I hoped for 5 files with the 5th having 50 rows but not sure how to do that.

After running your code, I think the problem is not with df.drop
but with the line containing time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv', because Pandas creates multiple CSV files within a second which might be causing some overwriting issues.
I think if you want label files using a timestamp, perhaps going to the millisecond level might be more useful and prevent possibility of overwrite. In your case
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(datetime.now().strftime("%Y%m%d_%H%M%S.%f") + 'dfsample.csv')
df=df.drop(df_subset.index)

Another way is to shuffle your rows and get rid of that awful loop.
df.sample(frac=1)
and save slices of the shuffled dataframe.

Large DataFrame with random columns of a smaller one (Pandas)

I am trying to generate a huge dataset in Python 3.6 using Pandas for testing a code but the method I developed is too slow and I would like to know if there is a more efficient way of doing it.
I create a smaller dataframe with a few columns and around 3 millions of rows (for example):
# assume we have relevant information in 'a','b','c'
data = pd.DataFrame(index=range(int(3e6)))
data['a'] = 0
data['b'] = 0
data['c'] = 0
Now, I would like to take random rows of this dataframe and build with them a bigger one with approximately 15 million rows, of course, even if rows repeat.
I tried taking samples and appending to the new dataframe like this:
data_tot = pd.DataFrame(columns=data.columns)
for i in range(int(15e6)):
samp = data.sample(1)
data_tot = data_tot.append(samp)
It looks very inefficient but I never had to generate such amount of data. I also tried preallocation and using iloc then, but still very slow.

You can use np.random.choice, generate random indices, and index df.
idx = np.random.choice(len(df), int(15e6))
df = df.iloc[idx, :]

Throwing away 3 out of 4 samples from a data frame

In Python, I have a DataFrame that looks like the following, all the way down to about 5000 samples:
I was wondering, in pandas, how do I remove 3 out of every 4 data points in my DataFrame?

To obtain a random sample of a quarter of your DataFrame, you could use
test4.sample(frac=0.25)
or, to specify the exact number of rows
test4.sample(n=1250))
If your purpose is to build training, validation, and testing data sets, then see this question.

If you want to select every 4th point, then you can do the following. This will select rows 0, 4, 8, ...:
test4.iloc[::4, :]['Accel']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas : sampling avoiding twice same values in different samples - python

Related

Pyspark random split changes distribution of data

Pandas or Numpy, but I am looking to take a 500 row 1 column, I guess series, turn it in to a 100 row, 5 column. or comparable in sections

generate output files with random samples from pandas dataframe

Large DataFrame with random columns of a smaller one (Pandas)

Throwing away 3 out of 4 samples from a data frame

Categories

Resources