Random Sampling of Pandas data frame (both rows and columns) - python

I know how to randomly sample few rows from a pandas data frame. Lets say I had a data frame df, then to get a fraction of rows, I can do :
df_sample = df.sample(frac=0.007)
However what I need is random rows as above AND also random columns from the above data frame.
Df is currently 56Kx8.5k. If I want say 500x1000 where both 500 and 1000 are randomly sampled how to do this?
I think one approach would be do something like
df.columns to get a list of columns names.
Then do some random sampling of the indices of this list of columns and use that random indices to filter out remaining columns?

Just call sample twice, with corresponding axis parameters:
df.sample(n=500).sample(n=1000, axis=1)
For the first one, axis=0 by default. The first sampling samples lines, while the second considers columns.

Related

How to modify column names when combining rows of multiple columns into single rows in a dataframe based on a categorical value. + selective sum/mean

I'm using the pandas .groupby function to combine multiple rows into a single row.
Currently I have a dataframe df_clean which has 405 rows, and 85 columns. There are up to 3 rows that correspond to a single Batch.
My current code for combining the multiple rows is:
num_V = 84 #number of rows -1 for the row "Batch" that they are being grouped by
max_row = df_clean.groupby('Batch').Batch.count().max()
df2= (
df_clean.groupby('Batch')
.apply(lambda x: x.values[:,1:].reshape(1,-1)[0])
.apply(pd.Series)
)
This code works creating a dataframe df2 which groups the rows by Batch, however the columns in the resulting dataframe are simply numbered (0,1,2,3,...249,250,251) note that 84*3=252, ((number of columns - Batch column)*3)=252, Batch becomes the index.
I'm cleaning some data for analysis and I want to combine the data of several (generally 1-3) Sub_Batch values on separate rows into a single row based on their Batch. Ideally I would like to be able to determine which columns are grouped into a row and remain separate columns in the row, as well as which columns the average, or total value is reported.
for example desired input/output:
Original dataframe
Output dataframe
note: naming of columns, and that all columns are copied over and that the columns are ordered according to which sub-batch they belong to. ie Weight_2 will always correspond to the second sub_batch that is a part of that Batch, Weight_3 will correspond to the third Sub_batch that is part of the Batch.
Ideal output dataframe
note: naming of columns, and that in this dataframe there is only a single column that records the Color as they are identical for all Sub-Batch values within a Batch. The individual Temperature values are recorded, as well as the average of the Temperature values for a Batch. The individual Weight values are recorded as well as the sum of the weight values in the column 'Total_weight`
I am 100% okay with the Output Dataframe scenario as I will simply add the values that I want afterwards using .mean and .sum for the values that I desire, I am simply asking if it can be done using `.groupby' as it is not something that I have worked with before, and I know that it does have some ability to sum or average results.

averaging columns on a very big matrix of data (approximately 150,000*80) in a normal runtime

I am having problems with analyzing very big matrices.
I have 4 big matrices of numbers (between 0 to 1), each one of the matrices contains over 100,000 rows, with 70-80 columns.
I need to average each column and add them to lists according to the column title.
I tried to use the built in mean() method of pandas (the input is pandas dataframe), but the runtime is crazy (it takes numerous hours to calculate the mean of a single column) and I cannot afford that runtime.
Any suggestions on how can I do it with normal runtime?
I am adding my code here-
def filtering(data):
healthy=[]
nmibc_hg=[]
nmibc_lg=[]
mibc_hg=[]
for column in data:
if 'HEALTHY' in column:
healthy.append(data[column].mean())
elif 'NMIBC_HG' in column:
nmibc_hg.append(data[column].mean())
elif 'NMIBC_LG' in column:
nmibc_lg.append(data[column].mean())
elif 'MIBC_HG' in column and not 'NMIBC_HG' in column:
mibc_hg.append(data[column].mean())
Use df.mean() and it will generate a mean for every numeric column in the dataframe df. The column name and summary results go into a pandas series. I tried it on a dataframe with 190,000 rows and 12 numeric columns and it only took 4 seconds. It should only take a few seconds on your dataset too.

Problem calculating correlation in Python

I have the following data frame
I need to calculate a correlation matrix across all columns. The problem is: when I calculate two columns separately I get a different value than when I calculate all together for every pair using df.corr().
The way I constructed the data frame was by merging the first column with every other one, and this merging process resulted in different row lengths of two by two data frames.
For example: the first column (btc_logreturns) and the second columns (gold_logreturns) had 2000 rows originally, while btc_logreturns and ewz_logreturns had 2100 rows. But all columns together have 2459 rows.
Does the function .corr() account for NaNs when calculating the correlation? Is the length of the data frame a potential problem for the different correlation values I get?
The problem is likely in the fact that the indices of these data frames do not align, meaning that some indices that are present in one data frame are not present in the other and vice versa.
If the indices are indeed meaningful, use the result of the merged data frame. If on the other hand, the indices are not meaningful, merge the original data frames using pd.concat([df_1, df_2], ignore_index=True) which will ignore the indices in the original data frames.

How can I downsample a dataframe consisting of numerical as well as categorical columns?

I have a data frame consisting of EEG signals from 64 channels and 3 categorical columns and timestamp column. I want to down sample the numerical columns and eliminate the corresponding categorical values. I used pandas.resample but it converts my categorical values to NaN. I also used signal.decimate but it also throws type error on categorical values.
Any suggestions on what I can do to achieve the desired result?
The structure of DataFrame is like this:
headers = list(range(64)) #Numerical columns to be down sampled
headers.append('ActualChar', 'PossibleCharCol', 'ResultLabel' , TimeSequence') #Categorical columns just to be eliminated without any change in value.
Dataframe:
Complete data frame consists of 371740 rows
The data can be accessed from http://www.bbci.de/competition/ii/#datasets under dataset IIb.
Easiest way is to generate random integers.
n = np.random.randint(0,300,size=40)
then do, df.iloc[n,:]

resample pandas dataframe at to arbitrary number

I have a loop in which a new data frame is populated with values during each step. The number of rows in the new dataframe is different for each step in the loop. At the end of the loop, I want to compare the dataframes and in order to do so, they all need to be the same length. Is there a way I can resample the dataframe at each step to an arbitrary number (eg. 5618) of rows?
If your dataframe is too small by N rows, then you can randomly sample N rows with replacement and add the rows on to the end of your original dataframe. If your dataframe is too big, then sample the desired number from the original dataframe .
if len(df) <5618:
df1 = df.sample(n=5618-len(df),replace=True)
df = pd.concat([df,df1])
if len(df) > 5618:
df = df.sample(n=5618)

Categories