How to sample data by keeping at least two non zero columns - python

I have a pandas data frame which is basically 50K X9.5K dimensions. My dataset is binary that is it has 1 and 0 only. And has lot of zeros.
Think of it as a user-item purchase data where its 1 if user purchased an item else 0. Users are rows and items are columns.
353 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
354 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
355 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
356 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
357 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
I want to split into training, validation and test set. However it is not going to be just normal split by rows.
What I want is that for each validation and test set, I want to keep between 2-4 columns from original data which are non zero.
So basically if my original data had 9.5K columns for each user, I first keep only lets say 1500 or so columns. Then I spit this sampled data into train and test by keeping like 1495-1498 columns in train and 2-5 columns in test/validation. The columns which are in test are ONLY those which are non zero. Training can have both.
I also want to keep the item name/index corresponding to those which are retained in test/validation
I dont want to run a loop to check each cell value and put it in the next table.
Any idea?
EDIT 1:
So this is what I am trying to achieve.

So, by non-zero, I am guessing you mean those columns which only have ones in them. That is fairly easy to do. Best approach probably is to use sum, like so:
sums = df.sum(axis=1) # to sum along columns. You will have a Series with column names as indices, and column sums as values.
non_zero_cols = sums[sums = len(df)].index # this will have only column names with non-zero records
# Now to split the data into training and testing
test_cols = numpy.random.choice(non_zero_cols, 2, replace=False) # or 5, just randomly selecting columns.
test_data = df[test_cols]
train_data = df.drop(test_cols)
Is that what your are looking for?

IIUC:
threshold = 6
new_df = df.loc[df.sum(1) >= threshold]
df.sum(1) sums over each row. Since these are 1s and 0s, this is equivalent to counting.
df.sum(1) >= threshold creates series of Trues and Falses, also referred to as a boolean mask.
df.loc happens to accept boolean masks as a way to slice.
df.loc[df.sum(1) >= threshold] passes the boolean mask to df.loc and returns only those rows that had a corresponding True in the boolean mask.
Since the boolean mask only had Trues when there existed a count of 1s greater than or equal to threshold, this equates to returning a slice of the dataframe in which each row has at least a threshold number of non-zeroes.
And then refer to this answer on how to split into test, train, validation sets.
Or this answer

Related

How to create pandas columns and fill with values according to values in another column

I have this dataframe:
text sentiment
asdasda positive
fsdfsdfs negative
sdfsdfs neutral
dfsdsd mixed
and I want this outupu:
text positive negative neutral mixed
asdasda 1 0 0 0
fsdfsdfs 0 1 0 0
sdfsdfs 0 0 1 0
dfsdsd 0 0 0 1
How can I do it?
You can use pandas.get_dummies but before that you need to set column "text" as index and after getting result you need to rename all columns sentiment_positive to positive , sentiment_negative to negative, ...
import pandas as pd
# df <- your_df
res = pd.get_dummies(df.set_index('text')
# rename column sentiment_positive to positive ,
# rename column sentiment_negative to negative , ...
).rename(columns = lambda x: x.split('_')[1])
print(res)
mixed negative neutral positive
text
asdasda 0 0 0 1
fsdfsdfs 0 1 0 0
sdfsdfs 0 0 1 0
dfsdsd 1 0 0 0

Iterating through a Panda dataframe to count and adjust a cell

I have a Panda's dataframe like so:
colLabelA colLabelB ... colLabelZ
rowLabelA 10 0 0
specialRow 0 10 0
rowLabelB 20 0 10
...
rowLabelZ 0 0 20
Essentially I only know the row called specialRow. What I need is to find a way to iterate through the entire dataframe and check all the columns for 0 (zero).
If a column has all zeroes except specialRow, then that column by row cell needs to be made into a zero as well. Otherwise move to the next column and check that one.
So in the above example, only colLabelB has all zeroes except the specialRow so that needs to be updated like so:
colLabelA colLabelB ... colLabelZ
rowLabelA 10 0 0
specialRow 0 0 0
rowLabelB 20 0 10
...
rowLabelZ 0 0 20
Is there a quick and fast way to do this?
The dataframes aren't huge but I don't want it to be super slow either.
Use drop to drop the named row, then check for 0 with eq(0).all(). Then you can update with loc:
df.loc['specialRow', df.drop('specialRow').eq(0).all()] = 0
This works with more than one special rows too:
specialRows = ['specialRow']
df.loc[specialRows, df.drop(specialRows).eq(0).all()] = 0
Output:
colLabelA colLabelB colLabelZ
rowLabelA 10 0 0
specialRow 0 0 0
rowLabelB 20 0 10
rowLabelZ 0 0 20
For each column, exclude the particular index, then check if all other values for that column is zero, if yes, then just assign 0 to such columns:
for col in df:
if df[df.index!='specialRow'][col].eq(0).all():
df[col] = 0
OUTPUT:
colLabelA colLabelB colLabelZ
rowLabelA 10 0 0
specialRow 0 0 0
rowLabelB 20 0 10
rowLabelZ 0 0 20
In fact df.index!='specialRow' remains the same for all the columns, so you can just assign it to a variable and use it for each of the columns.
Drop the row with 'specialRow', check if column's values are all zero.
if (df.drop(['specialRow'])['colLabelB'] == 0).all():
df['B'] = 0

Converting data to matrix by group in Python

I want to create matrix for each observation in my dataset.
Each row should correspond to disease group (i.e. xx, yy, kk). Example data
id xx_z xx_y xx_a yy_b yy_c kk_t kk_r kk_m kk_y
1 1 1 0 0 1 0 0 1 1
2 0 0 1 0 0 1 1 0 1
Given that there are 3 types of diseases and there are maximum of 4 diseases in the dataset. The matrix should be by 3 X 4, and the output should look like:
id matrix
xx_z xx_y xx_a null
1 xx [ 1 1 0 0
yy_b yy_c null null
yy 0 1 0 0
kk_t kk_r kk_k kk_y
kk 0 0 1 1]
2 [ 0 0 1 0
0 0 0 0
1 1 0 1]
Please note that I do not know the exact number disease per disease group. How could I do it in python pandas?
P.S. I just need a nested matrix structure for each observation, later I will compare the matrices of different observations, e.g. Jaccard similarity of matrices for observation id == 1 and observation id == 2
Ok, how about something like this:
# make a copy just in case
d = df[:]
# get the groups, in case you don't have them already
groups = list({col.split('_')[0] for col in d.columns})
# define grouping condition (here, groups would be 'xx', 'yy', 'kk')
gb = d.groupby(d.columns.map(lambda x: x.split('_')[0]), axis=1)
# aggregate values of one group to list and save as extra columns
for g in groups:
d[g] = gb.get_group(g).values.tolist()
# now aggregate to list of lists
d['matrix'] = d[groups].values.tolist()
# convert list of lists to a matrix
d['matrix'] = d['matrix'].apply(lambda x: pd.DataFrame.from_records(x).fillna(0).astype(int).values)
# for the desired output
d[['matrix']]
Not the most elegant, but I'm hoping it does the job :)

Extracting rows in R based on an ID number

I have a data frame in R which, when running data$data['rs146217251',1:10,] looks something like:
g=0 g=1 g=2
1389117_1389117 0 1 NA
2912943_2912943 0 0 1
3094358_3094358 0 0 1
5502557_5502557 0 0 1
2758547_2758547 0 0 1
3527892_3527892 0 1 NA
3490518_3490518 0 0 1
1569224_1569224 0 0 1
4247075_4247075 0 1 NA
4428814_4428814 0 0 1
The leftmost column are participant identifiers. There are roughly 500,000 participants listed in this data frame, but I have a subset of them (about 5,000) listed by their identifiers. What is the best way to go about extracting only these rows that I care about according to their identifier in R or python (or some other way)?
Assuming that the participant identifiers are row names, you can filter by the vector of the identifiers as below:
df <- data$data['rs146217251',1:10,]
#Assuming the vector of identifiers
id <- c("4428814_4428814", "3490518_3490518", "3094358_3094358")
filtered <- df[id,]
Output:
> filtered
g.0 g.1 g.2
4428814_4428814 0 0 1
3490518_3490518 0 0 1
3094358_3094358 0 0 1

Pandas - Get dummies for only certain values

I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0

Categories