Extracting rows in R based on an ID number - python

I have a data frame in R which, when running data$data['rs146217251',1:10,] looks something like:
g=0 g=1 g=2
1389117_1389117 0 1 NA
2912943_2912943 0 0 1
3094358_3094358 0 0 1
5502557_5502557 0 0 1
2758547_2758547 0 0 1
3527892_3527892 0 1 NA
3490518_3490518 0 0 1
1569224_1569224 0 0 1
4247075_4247075 0 1 NA
4428814_4428814 0 0 1
The leftmost column are participant identifiers. There are roughly 500,000 participants listed in this data frame, but I have a subset of them (about 5,000) listed by their identifiers. What is the best way to go about extracting only these rows that I care about according to their identifier in R or python (or some other way)?

Assuming that the participant identifiers are row names, you can filter by the vector of the identifiers as below:
df <- data$data['rs146217251',1:10,]
#Assuming the vector of identifiers
id <- c("4428814_4428814", "3490518_3490518", "3094358_3094358")
filtered <- df[id,]
Output:
> filtered
g.0 g.1 g.2
4428814_4428814 0 0 1
3490518_3490518 0 0 1
3094358_3094358 0 0 1

Related

Converting data to matrix by group in Python

I want to create matrix for each observation in my dataset.
Each row should correspond to disease group (i.e. xx, yy, kk). Example data
id xx_z xx_y xx_a yy_b yy_c kk_t kk_r kk_m kk_y
1 1 1 0 0 1 0 0 1 1
2 0 0 1 0 0 1 1 0 1
Given that there are 3 types of diseases and there are maximum of 4 diseases in the dataset. The matrix should be by 3 X 4, and the output should look like:
id matrix
xx_z xx_y xx_a null
1 xx [ 1 1 0 0
yy_b yy_c null null
yy 0 1 0 0
kk_t kk_r kk_k kk_y
kk 0 0 1 1]
2 [ 0 0 1 0
0 0 0 0
1 1 0 1]
Please note that I do not know the exact number disease per disease group. How could I do it in python pandas?
P.S. I just need a nested matrix structure for each observation, later I will compare the matrices of different observations, e.g. Jaccard similarity of matrices for observation id == 1 and observation id == 2
Ok, how about something like this:
# make a copy just in case
d = df[:]
# get the groups, in case you don't have them already
groups = list({col.split('_')[0] for col in d.columns})
# define grouping condition (here, groups would be 'xx', 'yy', 'kk')
gb = d.groupby(d.columns.map(lambda x: x.split('_')[0]), axis=1)
# aggregate values of one group to list and save as extra columns
for g in groups:
d[g] = gb.get_group(g).values.tolist()
# now aggregate to list of lists
d['matrix'] = d[groups].values.tolist()
# convert list of lists to a matrix
d['matrix'] = d['matrix'].apply(lambda x: pd.DataFrame.from_records(x).fillna(0).astype(int).values)
# for the desired output
d[['matrix']]
Not the most elegant, but I'm hoping it does the job :)

How to access column values of a TSV based on computationally read out keywords in Python/Pandas/R?

I am going to describe my question with this minimal, reproducible example (reprex). I have two tab separated value files df1.map and df2.map. df1.map looks like this:
1 BICF2G630707759 0 3014448
1 BICF2S2358127 0 3068620
1 BICF2P1173580 0 3079928
1 BICF2G630707846 0 3082514
1 BICF2G630707893 0 3176980
1 TIGRP2P175_RS8671583 0 3198886
1 BICF2P1383091 0 3212349
1 TIGRP2P259_RS8993730 0 3249189
1 BICF2P186608 0 3265742
1 BICF2G630707908 0 3273096
df2.map is this:
1 BICF2P413765 0 85620576
1 BICF2P413768 0 85621395
1 BICF2P860004 0 85633349
1 BICF2G630707846 0 85660017
1 BICF2G630707893 0 85684560
1 CHR1_85660017 0 85660017
1 BICF2P91636 0 85684560
1 CHR1_85685260 0 85685260
1 BICF2P172347 0 85700399
1 BICF2P1125163 0 85707031
The second column in each file holds an ID information. The question is: Check for every ID in df1.map if it is present in df2.map. If it is present, take the information of that ID, of that row from the columns 1 and 4 from df2.map and over-write the columns 1 and 4 in df1.map with the information from df2.map.
I tried one solution with python (after converting from TSV to CSV):
import pandas as pd
CF2 = pd.read_csv('df1.csv', low_memory=False)
CF3 = pd.read_csv('df2.csv', low_memory=False)
i = 0
while i < CF2.shape[0]:
if (CF2[CF2.columns[1]].values[i]) == (CF3[CF3.columns[1]]):
print (CF3[CF3.columns[0]])
i = i + 1
Yet, that resulted only in errors.
Then I tried to solve it in R:
CFTwo <- read.table("df1.map", sep="\t", header=FALSE)
CFThree <- read.table("df2.map", sep="\t", header=FALSE)
for (i in CFThree [,2]) {
if (i == CFTwo [,2]) {
print(CFTwo [,c(1,2,4)])}}
That gave 50 warnings and no result.
Since I was already not able to get the right values printed, I am far away from over-writing and changing files.
Besides Python and R I am also open to a solution in bash.
We can solve this problem by merging the second file with the first one, and checking for NA values to recode V1 and V4 with if_else(). A dplyr based solution looks like this:
file1 <- "1 BICF2G630707759 0 3014448
1 BICF2S2358127 0 3068620
1 BICF2P1173580 0 3079928
1 BICF2G630707846 0 3082514
1 BICF2G630707893 0 3176980
1 TIGRP2P175_RS8671583 0 3198886
1 BICF2P1383091 0 3212349
1 TIGRP2P259_RS8993730 0 3249189
1 BICF2P186608 0 3265742
1 BICF2G630707908 0 3273096"
file2 <- "1 BICF2P413765 0 85620576
1 BICF2P413768 0 85621395
1 BICF2P860004 0 85633349
1 BICF2G630707846 0 85660017
1 BICF2G630707893 0 85684560
1 CHR1_85660017 0 85660017
1 BICF2P91636 0 85684560
1 CHR1_85685260 0 85685260
1 BICF2P172347 0 85700399
1 BICF2P1125163 0 85707031"
We read the raw data, and for the first file use the default column names. For the second file, we change the names of every column except the key that joins with the first file. This avoids a duplicate column names problem when we join the files.
df1.map <- read.table(text=file1,header=FALSE)
df2.map <- read.table(text = file2,header = FALSE,
col.names = c("df2.V1","V2","df2.V3","df2.V4"))
Next, we join the data and use mutate() to set the values of V1 and V4 if the corresponding values of df2.V1 and df2.V4 are not NA. We use left_join() because the output data should include all rows from df1.map, regardless of whether df2.map contributed data to a particular row.
Once recoded, we drop the columns that begin with df2.* from the result data frame.
library(dplyr)
df1.map %>% left_join(.,df2.map) %>%
mutate(V1 = if_else(is.na(df2.V1),V1,df2.V1),
V4 = if_else(is.na(df2.V4),V4,df2.V4)) %>%
select(.,-c(df2.V1,df2.V3,df2.V4))
...and the output:
Joining, by = "V2"
V1 V2 V3 V4
1 1 BICF2G630707759 0 3014448
2 1 BICF2S2358127 0 3068620
3 1 BICF2P1173580 0 3079928
4 1 BICF2G630707846 0 85660017
5 1 BICF2G630707893 0 85684560
6 1 TIGRP2P175_RS8671583 0 3198886
7 1 BICF2P1383091 0 3212349
8 1 TIGRP2P259_RS8993730 0 3249189
9 1 BICF2P186608 0 3265742
10 1 BICF2G630707908 0 3273096
>

How to create by default two columns for every features (One Hot Encoding)?

My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0
I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.

Counting instances in dataframe that match to another instance

So, I am working with over 100 attributes. Clearly cannot be using this
df['column_name'] >= 1 & df['column_name'] <= 1
Say my dataframe looks like this-
A B C D E F G H I
1 1 1 1 1 0 1 1 0
0 0 1 1 0 0 0 0 1
0 0 1 0 0 0 1 1 1
0 1 1 1 1 0 0 0 0
I wish to find #instances with value 1 for labels C and I . Answer here is two( 2nd and 3rd row). I am working with a lot of attributes certainly cannot hardcode them. How can I be finding the frequency?
Consider I have access to the list of class labels I wish to work with i.e. [C,I]
I think you want DataFrame.all:
df[['C','I']].eq(1).all(axis=1).sum()
#2
We can also use:
df[['C','I']].astype(bool).all(axis=1).sum()

Pandas - Get dummies for only certain values

I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0

Categories