Converting data to matrix by group in Python - python

I want to create matrix for each observation in my dataset.
Each row should correspond to disease group (i.e. xx, yy, kk). Example data
id xx_z xx_y xx_a yy_b yy_c kk_t kk_r kk_m kk_y
1 1 1 0 0 1 0 0 1 1
2 0 0 1 0 0 1 1 0 1
Given that there are 3 types of diseases and there are maximum of 4 diseases in the dataset. The matrix should be by 3 X 4, and the output should look like:
id matrix
xx_z xx_y xx_a null
1 xx [ 1 1 0 0
yy_b yy_c null null
yy 0 1 0 0
kk_t kk_r kk_k kk_y
kk 0 0 1 1]
2 [ 0 0 1 0
0 0 0 0
1 1 0 1]
Please note that I do not know the exact number disease per disease group. How could I do it in python pandas?
P.S. I just need a nested matrix structure for each observation, later I will compare the matrices of different observations, e.g. Jaccard similarity of matrices for observation id == 1 and observation id == 2

Ok, how about something like this:
# make a copy just in case
d = df[:]
# get the groups, in case you don't have them already
groups = list({col.split('_')[0] for col in d.columns})
# define grouping condition (here, groups would be 'xx', 'yy', 'kk')
gb = d.groupby(d.columns.map(lambda x: x.split('_')[0]), axis=1)
# aggregate values of one group to list and save as extra columns
for g in groups:
d[g] = gb.get_group(g).values.tolist()
# now aggregate to list of lists
d['matrix'] = d[groups].values.tolist()
# convert list of lists to a matrix
d['matrix'] = d['matrix'].apply(lambda x: pd.DataFrame.from_records(x).fillna(0).astype(int).values)
# for the desired output
d[['matrix']]
Not the most elegant, but I'm hoping it does the job :)

Related

Build matrix of dummy indicators

I have a pandas dataframe which looks like the following
team_id
skill_id
inventor_id
1
A
Jack
1
B
Jack
1
A
Jill
1
B
Jill
2
A
Jack
2
B
Jack
2
A
Joe
2
B
Joe
So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =
Jack_A
Jack_B
Jill_A
Jill_B
Joe_A
Joe_B
1
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0
0
1
0
0
0
1
So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.
I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.
I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.
In addition to #Ben.T 's great answer I figured out another which allows me to keep memory efficient.
# Set the identifier for each row
inventor_data["team_id"] = inventor_data["team_id"].astype(str)
inventor_data["inv_skill_id"] = inventor_data["inventor_id"] + inventor_data["skill_id"]
inventor_data["team_skill_id"] = inventor_data["team_id"] + inventor_data["skill_id"]
# Using DictVectorizer requires a dictionary input
teams = list(inventor_data.groupby('team_skill_id')['inv_skill_id'].agg(dict))
# Change the dict entry from count to 1
for team_id, team in enumerate(teams):
teams[team_id] = {v: 1 for k, v in team.items()}
from sklearn.feature_extraction import DictVectorizer
vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(teams)
IIUC, you can use crosstab:
print(
pd.crosstab(
index=[df['team_id'],df['skill_id']],
columns=[df['inventor_id'], df['skill_id']]
)#.to_numpy()
)
# inventor_id Jack Jill Joe
# skill_id A B A B A B
# team_id skill_id
# 1 A 1 0 1 0 0 0
# B 0 1 0 1 0 0
# 2 A 1 0 0 0 1 0
# B 0 1 0 0 0 1
and if you just want the matrix, then uncomment .to_numpy() in the above code.
Note: if you have some skills that are not shared between teams or inventors, you may need to reindex with all the possibilities, so do:
pd.crosstab(
index=[df['team_id'],df['skill_id']],
columns=[df['inventor_id'], df['skill_id']]
).reindex(
index=pd.MultiIndex.from_product(
[df['team_id'].unique(),df['skill_id'].unique()]),
columns=pd.MultiIndex.from_product(
[df['inventor_id'].unique(),df['skill_id'].unique()]),
fill_value=0
)#.to_numpy()

sorting in pandas, while alternating between ascending and descending on the same sorting column

I would like to sort a dataframe by a two columns, but one is always ascending, and the other switches between ascending and descending based on the value of the first column. In other words, when the first column increases, the sorting flips from ascending to descending or vise versa.
My motivation for this is that I am trying to sort a set of data that is indexed spatially in a gird into chronological order. The data is measured by snaking upwards, and then back and forth in a grid. I would like to sort ascending by y value, and then go back and forth in the x value whenever the y value increments. I don't know how to so this with df.sort_values() or df.groupby() as those alter the whole dataframe.
I am trying to sort this
X position | Y position | Data
0 0 '1st'
0 1 '4th'
1 0 '2nd'
1 1 '3rd'
Into this
X position | Y position | Data
0 0 '1st'
1 0 '2nd'
1 1 '3rd'
0 1 '4th'
Not sure if you have found a solution in the meantime...
You could do the following:
df = pd.concat(
[
sdf.sort_values("X", ascending=i%2)
for i, (_, sdf) in enumerate(df.sort_values("Y").groupby("Y"), start=1)
]
)
or
def sort(sdf):
global i
i += 1
return sdf.sort_values("X", ascending=i%2)
i = 0
df = df.sort_values("Y").groupby("Y", group_keys=False).apply(sort)
Result for
df =
X Y
0 0 0
1 0 1
2 0 2
3 1 0
4 1 1
5 1 2
is
X Y
0 0 0
3 1 0
4 1 1
1 0 1
2 0 2
5 1 2

Extracting rows in R based on an ID number

I have a data frame in R which, when running data$data['rs146217251',1:10,] looks something like:
g=0 g=1 g=2
1389117_1389117 0 1 NA
2912943_2912943 0 0 1
3094358_3094358 0 0 1
5502557_5502557 0 0 1
2758547_2758547 0 0 1
3527892_3527892 0 1 NA
3490518_3490518 0 0 1
1569224_1569224 0 0 1
4247075_4247075 0 1 NA
4428814_4428814 0 0 1
The leftmost column are participant identifiers. There are roughly 500,000 participants listed in this data frame, but I have a subset of them (about 5,000) listed by their identifiers. What is the best way to go about extracting only these rows that I care about according to their identifier in R or python (or some other way)?
Assuming that the participant identifiers are row names, you can filter by the vector of the identifiers as below:
df <- data$data['rs146217251',1:10,]
#Assuming the vector of identifiers
id <- c("4428814_4428814", "3490518_3490518", "3094358_3094358")
filtered <- df[id,]
Output:
> filtered
g.0 g.1 g.2
4428814_4428814 0 0 1
3490518_3490518 0 0 1
3094358_3094358 0 0 1

How can I find differences between two dataframe rows?

I have two data frames that I merged together on a common ID. I am trying to uncover when values in each row for a matching ID are different.
I merged the files so that I have the below table. I think I might be able to approach this with a series of if statements but the actual data file has hundreds of column attributes which doesn't seem efficient at all. I'm trying to determine if there's an easy way to do this.
x Loan_ID Trade_Quantity_x Principal_x Interest_x Late_Fee_x Trade_Quantity_y Principal_y Interest_y Late_Fee_y
0 1 10 30 0 0 10 30 0 0
1 2 10 0 0 5 10 0 0 0
2 3 10 0 50 0 10 0 0 0
3 4 10 0 0 0 10 0 0 0
4 5 10 100 10 0 10 100 10 0
5 6 9 0 0 0 9 0 0 0
6 7 10 0 0 0 10 0 0 0
Expected output should be:
2. Late_Fee_y
3. Interest_y
I am assuming that what you are after is to compare two data frames of the same structure, i.e. having the same list of columns and the same number of rows identified by values of special Loan_ID.
The goal is to list all "cells" which are different between the two frames, cell location is by the id from Loan_ID and column name.
Can I suggest to merge the two frames differently first, to get a list of values and then find differences by scanning melted frames, or by applying a filter?
Example data (think of id as Loan_ID)
x = {'id':[1,2],'A':[0,1],'B':[2,3]}
y = {'id':[1,2],'A':[0,2],'B':[2,4]}
df_x = pd.DataFrame(x)
df_y = pd.DataFrame(y)
print(df_x)
print(df_y)
melted
df_xm = pd.melt(df_x, id_vars=['id'])
df_xm['source']='x'
df_ym = pd.melt(df_y, id_vars=['id'])
df_ym['source']='y'
print(df_xm)
print(df_ym)
Assuming that both frames are sorted by id correspondingly
for i in df_xm.index:
if df_xm['value'][i] != df_ym['value'][i]:
print(f"{df_xm['id'][i]},{df_xm['variable'][i]}")
Second method :
merged = df_xm.merge(df_ym, left_on= ['id','variable'], right_on=['id','variable'])
print(merged)
filter_diff = merged['value_x'] != merged['value_y']
print('differences:')
print(merged[ filter_diff ])
I'm sure this can be improved for efficiency but this is my general idea how to tackle the "difference between two table snapshots" with general frame/tables and filter operations.

How to sample data by keeping at least two non zero columns

I have a pandas data frame which is basically 50K X9.5K dimensions. My dataset is binary that is it has 1 and 0 only. And has lot of zeros.
Think of it as a user-item purchase data where its 1 if user purchased an item else 0. Users are rows and items are columns.
353 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
354 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
355 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
356 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
357 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
I want to split into training, validation and test set. However it is not going to be just normal split by rows.
What I want is that for each validation and test set, I want to keep between 2-4 columns from original data which are non zero.
So basically if my original data had 9.5K columns for each user, I first keep only lets say 1500 or so columns. Then I spit this sampled data into train and test by keeping like 1495-1498 columns in train and 2-5 columns in test/validation. The columns which are in test are ONLY those which are non zero. Training can have both.
I also want to keep the item name/index corresponding to those which are retained in test/validation
I dont want to run a loop to check each cell value and put it in the next table.
Any idea?
EDIT 1:
So this is what I am trying to achieve.
So, by non-zero, I am guessing you mean those columns which only have ones in them. That is fairly easy to do. Best approach probably is to use sum, like so:
sums = df.sum(axis=1) # to sum along columns. You will have a Series with column names as indices, and column sums as values.
non_zero_cols = sums[sums = len(df)].index # this will have only column names with non-zero records
# Now to split the data into training and testing
test_cols = numpy.random.choice(non_zero_cols, 2, replace=False) # or 5, just randomly selecting columns.
test_data = df[test_cols]
train_data = df.drop(test_cols)
Is that what your are looking for?
IIUC:
threshold = 6
new_df = df.loc[df.sum(1) >= threshold]
df.sum(1) sums over each row. Since these are 1s and 0s, this is equivalent to counting.
df.sum(1) >= threshold creates series of Trues and Falses, also referred to as a boolean mask.
df.loc happens to accept boolean masks as a way to slice.
df.loc[df.sum(1) >= threshold] passes the boolean mask to df.loc and returns only those rows that had a corresponding True in the boolean mask.
Since the boolean mask only had Trues when there existed a count of 1s greater than or equal to threshold, this equates to returning a slice of the dataframe in which each row has at least a threshold number of non-zeroes.
And then refer to this answer on how to split into test, train, validation sets.
Or this answer

Categories