Python - Encoding Genomic Data in dataframe - python

Hi I'm trying to encode a Genome, stored as a string inside a dataframe read from a CSV.
Right now I'm looking to split each string in the dataframe under the column 'Genome' into a list of it's base pairs i.e. from ('acgt...') to ('a','c','g','t'...) then convert each base pair into a float (0.25,0.50,0.75,1.00) respectively.
I thought I was looking for a split function to split each string into characters but none seem to work on the data in the dataframe even when transformed to string using .tostring
Here's my most recent code:
import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def string_to_array(my_string):
my_string = my_string.lower()
my_string = re.sub('[^acgt]', 'z', my_string)
my_array = np.array(list(my_string))
return my_array
label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','g','c','t','z']))
def ordinal_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
float_encoded = integer_encoded.astype(float)
float_encoded[float_encoded == 0] = 0.25 # A
float_encoded[float_encoded == 1] = 0.50 # C
float_encoded[float_encoded == 2] = 0.75 # G
float_encoded[float_encoded == 3] = 1.00 # T
float_encoded[float_encoded == 4] = 0.00 # anything else, z
return float_encoded
dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath)
df = ordinal_encoder(string_to_array(dataframe[['Genome']].values.tostring()))
print(df)
I've tried making my own function but I don't have any clue how they work. Everything I try points to not being able to process data when it's in a numpy array and nothing is working to transform the data to another type.
Thanks for the tips!
Edit: Here is the print of the dataframe-
Antibiotic ... Genome
0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...
There are 5 columns 'Genome' being the 5th in the list I don't know why 1. .head() will not work and 2. why print() doesn't give me all columns...

I don't think LabelEncoder is what you want. This is a simple transformation, I recommend doing it directly. Start with a lookup your base pair mapping:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
Then apply the lookup to value of the "Genome" column. The values attribute will return the resulting dataframe as an ndarray.
dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values

Related

Python - Defining a chunk function to encode genomic data

I'm trying to encode genomes from strings stored in a dataframe to an array of corresponding numerical values.
Here is some of my dataframe (for some reason it doesn't give me all 5 columns just 2):
Antibiotic ... Genome
0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...
So I need to split these strings character by character and assign them to floats. This is the lookup table I was using:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
I tried to apply this directly using:
dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
But I have too much data to fit into memory so I'm trying to process using chunks and I'm having trouble defining a reprocessing function.
Here's my code so far:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath, chunksize=10)
chunk_list = []
def preprocess(chunk):
chunk['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
return;
for chunk in dataframe:
chunk_filter = preprocess(chunk)
chunk_list.append(chunk_filter)
dataframe1 = pd.concat(chunk_list)
print(dataframe1)
Thanks in advance!
You have chunk_filter = preprocess(chunk), but your preprocess() function returns nothing, so chunk_filter is always meaningless. Modify your preprocess function to store the result of the apply() call, then return that value. For example:
def preprocess(chunk):
processed_chunk = chunk['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
return processed_chunk;
By doing this, you actually return the data from the preprocess function so that it can be appended to the chunk list. As you have it currently, the preprocess function works correctly but essentially discards the results.

How to build features using pandas column and a dictionary efficiently?

I have a machine learning problem where I am calculating bigram Jaccard similarity of a pandas dataframe text column with values of a dictionary. Currently I am storing them as a list and then converting them to columns. This is proving to be very slow in production. Is there a more efficient way to do it?
Following are the steps I am currently following:
For each key in dict:
1. Get bigrams for the pandas column and the dict[key]
2. Calculate Jaccard similarity
3. Append to an empty list
4. Store the list in the dataframe
5. Convert the list to columns
from itertools import tee, islice
def count_ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
def n_gram_jaccard_similarity(str1, str2,n):
a = set(count_ngrams(str1.split(),n))
b = set(count_ngrams(str2.split(),n))
intersection = a.intersection(b)
union = a.union(b)
try:
return len(intersection) / float(len(union))
except:
return np.nan
def jc_list(sample_dict,row,n):
sim_list = []
for key in sample_dict:
sim_list.append(n_gram_jaccard_similarity(sample_dict[key],row["text"],n))
return str(sim_list)
Using the above functions to build the bigram Jaccard similarity features as follows:
df["bigram_jaccard_similarity"]=df.apply(lambda row: jc_list(sample_dict,row,2),axis=1)
df["bigram_jaccard_similarity"] = df["bigram_jaccard_similarity"].map(lambda x:[float(i) for i in [a for a in [s.replace(',','').replace(']', '').replace('[','') for s in x.split()] if a!='']])
df[[i for i in sample_dict]] = pd.DataFrame(df["bigram_jaccard_similarity"].values.tolist(), index= df.index)
Sample input:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
Expected output:
So, this is more difficult than I though, due to some broadcasting issues of sparse matrices. Additionally, in the short period of time I was not able to fully vectorize it.
I added an additional text row to the frame:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
df.loc[1] = ["2","this is a second sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
We will use the following modules/functions/classes:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
import numpy as np
and define a CountVectorizer to create character based n_grams
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
feel free to choose the n-grams you need. I'd advise to take an existing tokenizer and n-gram creator. You should find plenty of those. Also the CountVectorizer can be tweaked extensively (e.g. convert to lowercase, get rid of whitespace etc.)
We concatenate all the data:
all_data = np.concatenate((df.text.to_numpy(),np.array(list(sample_dict.values()))))
we do this, as our vectorizer needs to have a common indexing scheme for all the tokens appearing.
Now let's fit the Count vectorizer and transform the data accordingly:
ngrammed = ngram_vectorizer.fit_transform(all_data) >0
ngrammed is now a sparse matrix containing the identifiers to the tokens appearing in the respective rows and not the counts anymore as before. you can inspect the ngram_vecotrizer and find a mapping from tokens to column ids.
Next we want to compare every grammes entry from the sample dict against every row of our ngrammed text data. We need some magic here:
texts = ngrammed[:len(df)]
samples = ngrammed[len(df):]
text_rows = len(df)
jaccard_similarities = []
for key, ngram_sample in zip(sample_dict.keys(), samples):
repeated_row_matrix = (csr_matrix(np.ones([text_rows,1])) * ngram_sample).astype(bool)
support = texts.maximum(repeated_row_matrix)
intersection = texts.multiply(repeated_row_matrix).todense()
jaccard_similarities.append(pd.Series((intersection.sum(axis=1)/support.sum(axis=1)).A1, name=key))
support is the boolean array, that measures the union of the n-grams over both comparable. intersection is only True if a token is present in both comparable. Note that .A1 represents a matrix-object as the underlying base array.
Now
pd.concat(jaccard_similarities, axis=1)
gives
r1 r2 r3
0 0.631579 0.444444 0.500000
1 0.480000 0.333333 0.384615
you can concat is as well to df and obtain with
pd.concat([df, pd.concat(jaccard_similarities, axis=1)], axis=1)
id text r1 r2 r3
0 1 this is a sample text 0.631579 0.444444 0.500000
1 2 this is a second sample text 0.480000 0.333333 0.384615

Applying function on pandas column using information from another column

I have a dataframe that contains a bunch of people's text descriptions. Other than that, I also have 4 descriptions a,b,c,d. For each person's text description, I wish to compare them to each of the 4 descriptions by using cosine similarity and store these scores in the same dataframe in 4 new columns: a, b, c, d.
How can I do this in a panda way, without using for loops? I was thinking of using the apply function but I don't know how to reference to the 'text' column as well as the 4 descriptions a,b,c,d in the apply function.
Thank you very much for any help!!
What I have tried:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = [' '.join(['table','car','mouse'])]
person_two = [' '.join(['computer','card','can','mouse'])]
person_three = [' '.join(['chair','table','whiteboard','window','button'])]
person_four = [' '.join(['queen','king','joker','phone'])]
description_a = [' '.join(['table','yellow','car','king'])]
description_b = [' '.join(['bottle','whiteboard','queen'])]
description_c = [' '.join(['chair','car','car','phone'])]
description_d = [' '.join(['joker','blue','earphone','king'])]
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
df = df.reindex(columns = ['person','text','a','b','c','d'])
def trying(cell,jd):
vectorizer = CountVectorizer(analyzer='word', max_features=5000).fit(jd)
jd_vector = vectorizer.transform(jd)
person_vector = vectorizer.transform(cell['text'])
score = cosine_similarity(jd_vector,person_vector)
return score
df['a'] = df['a'].apply(trying(description_a))
df['b'] = df['b'].apply(trying(description_b))
df['c'] = df['c'].apply(trying(description_c))
df['d'] = df['d'].apply(trying(description_d))
This gives me an error:
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
The output should look something like this:
person text a b c d
0 person 1 [table, car, mouse] 0.3 0.2 0.5 0.7
1 person 2 [computer, card, can, mouse] 0.2 0.1 0.9 0.7
2 person 3 [chair, table, whiteboard, window, button] 0.3 0.5 0.1 0.4
3 person 4 [queen, king, joker, phone] 0.2 0.4 0.3 0.5
I can't post comment yet, but to solve the error :
df['a'] = df['a'].apply(trying(description_a))
TypeError: trying() missing 1 required positional argument: 'jd'
You need to pass the parameter like this :
df['a'] = df['a'].apply(trying, args=(description_a))
The first argument will be the column vector in your case, and the other arguments will then be taken in order from ther args list.
Hope this help.
How about this:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
person_one = ['table','car','mouse']
person_two = ['computer','card','can','mouse']
person_three = ['chair','table','whiteboard','window','button']
person_four = ['queen','king','joker','phone']
description_a = ['table','yellow','car','king']
description_b = ['bottle','whiteboard','queen']
description_c = ['chair','car','car','phone']
description_d = ['joker','blue','earphone','king']
descriptors = {
'a' : description_a,
'b' : description_d,
'c' : description_c,
'd' : description_d
}
mystuff = [('person 1',person_one),
('person 2',person_two),
('person 3',person_three),
('person 4',person_four)
]
labels = ['person','text']
df = pd.DataFrame.from_records(mystuff,columns = labels)
vocabulary_data =[
person_one,
person_two,
person_three,
person_four,
description_a,
description_b,
description_c,
description_d,
]
data = [set(sentence) for sentence in vocabulary_data]
vocabulary = set.union(*data)
cv = CountVectorizer(vocabulary=vocabulary)
def similarity(row, desc):
a = cosine_similarity(cv.fit_transform(row['text']).sum(axis=0), cv.fit_transform(desc).sum(axis=0))
return a.item()
for key, description in descriptors.items():
df[key] = df.apply(lambda x: similarity(x, description), axis=1)
I used one for loop, but only for filling different descriptions. The main "computation" is done by apply.

label-encoder encoding missing values

I am using the label encoder to convert categorical data into numeric values.
How does LabelEncoder handle missing values?
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)
Output:
array([1, 2, 3, 0, 4, 1])
For the above example, label encoder changed NaN values to a category. How would I know which category represents missing values?
Don't use LabelEncoder with missing values. I don't know which version of scikit-learn you're using, but in 0.17.1 your code raises TypeError: unorderable types: str() > float().
As you can see in the source it uses numpy.unique against the data to encode, which raises TypeError if missing values are found. If you want to encode missing values, first change its type to a string:
a[pd.isnull(a)] = 'NaN'
you can also use a mask to replace form the original data frame after labelling
df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
A B C
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
original = df
mask = df_1.isnull()
A B C
0 False False False
1 True False False
2 False False True
df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)
A B C
0 1.0 0 1.0
1 NaN 1 0.0
2 2.0 2 NaN
Hello a little computational hack I did for my own work:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
### fit with the desired col, col in position 0 for this example
fit_by = pd.Series([i for i in a.iloc[:,0].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = fit_by.apply(lambda x: le.transform([x])[0] if type(x) == str else x)
This is my solution, because I was not pleased with the solutions posted here. I needed a LabelEncoder that keeps my missing values as NaN to use an Imputer afterwards. So I have written my own LabelEncoder class. It works with DataFrames.
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].to_numpy()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
#return the transformed DataFrame
return x
You can enter a DataFrame, not only a 1-dim Series. with col you can chose the columns that should be encoded.
I would like to here some feedback.
I want to share with you my solution.
I created a module which take mix dataset and convert it from categorical to numerical
and inverse.
This Module also available in my Github well organized with example.
Please upvoted if you like my solution.
Tks,
Idan
class label_encoder_contain_missing_values :
def __init__ (self) :
pass
def categorical_to_numeric (self,dataset):
import numpy as np
import pandas as pd
self.dataset = dataset
self.summary = None
self.table_encoder= {}
for index in self.dataset.columns :
if self.dataset[index].dtypes == 'object' :
column_data_frame = pd.Series(self.dataset[index],name='column').to_frame()
unique_values = pd.Series(self.dataset[index].unique())
i = 0
label_encoder = pd.DataFrame({'value_name':[],'Encode':[]})
while i <= len(unique_values)-1:
if unique_values.isnull()[i] == True :
label_encoder = label_encoder.append({'value_name': unique_values[i],'Encode':np.nan}, ignore_index=True) #np.nan = -1
else:
label_encoder = label_encoder.append({'value_name': unique_values[i],'Encode':i}, ignore_index=True)
i+=1
output = pd.merge(left=column_data_frame,right = label_encoder, how='left',left_on='column',right_on='value_name')
self.summary = output[['column','Encode']].drop_duplicates().reset_index(drop=True)
self.dataset[index] = output.Encode
self.table_encoder.update({index:self.summary})
else :
pass
# ---- Show Encode Table ----- #
print('''\nLabel Encoding completed in Successfully.\n
Next steps: \n
1. To view table_encoder, Execute the follow: \n
for index in table_encoder :
print(f'\\n{index} \\n',table_encoder[index])
2. For inverse, execute the follow : \n
df = label_encoder_contain_missing_values().
inverse_numeric_to_categorical(table_encoder, df) ''')
return self.table_encoder ,self.dataset
def inverse_numeric_to_categorical (self,table_encoder, df):
dataset = df.copy()
for column in table_encoder.keys():
df_column = df[column].to_frame()
output = pd.merge(left=df_column,right = table_encoder[column], how='left',left_on= column,right_on='Encode')#.rename(columns={'column_x' :'encode','column_y':'category'})
df[column]= output.column
print('\nInverse Label Encoding, from categorical to numerical completed in Successfully.\n')
return df
**execute command from categorical to numerical** <br>
table_encoder, df = label_encoder_contain_missing_values().categorical_to_numeric(df)
**execute command from numerical to categorical** <br>
df = label_encoder_contain_missing_values().inverse_numeric_to_categorical(table_encoder, df)
An easy way is this
It is an example of Titanic
LABEL_COL = ["Sex", "Embarked"]
def label(df):
_df = df.copy()
le = LabelEncoder()
for col in LABEL_COL:
# Not NaN index
idx = ~_df[col].isna()
_df.loc[idx, col] \
= le.fit(_df.loc[idx, col]).transform(_df.loc[idx, col])
return _df
The most voted answer by #Kerem has typos, therefore I am posting the corrected and improved answer here:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
for j in a.columns.values:
le = LabelEncoder()
### fit with the desired col, col in position 0 for this ###example
fit_by = pd.Series([i for i in a[j].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = a[j].apply(lambda x: le.transform([x])[0] if type(x) == str else x)
You can handle missing values by replacing it with string 'NaN'. The category can be obtained by le.transfrom().
le.fit_transform(a.fillna('NaN'))
category = le.transform(['NaN'])
Another solution is for label encoder to ignore missing values.
a = le.fit_transform(a.astype(str))
You can fill the na's by some value and later change the dataframe column type to string to make things work.
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
a.fillna(99)
le = LabelEncoder()
le.fit_transform(a.astype(str))
Following encoder addresses None values in each category.
class MultiColumnLabelEncoder:
def __init__(self):
self.columns = None
self.led = defaultdict(preprocessing.LabelEncoder)
def fit(self, X):
self.columns = X.columns
for col in self.columns:
cat = X[col].unique()
cat = [x if x is not None else "None" for x in cat]
self.led[col].fit(cat)
return self
def fit_transform(self, X):
if self.columns is None:
self.fit(X)
return self.transform(X)
def transform(self, X):
return X.apply(lambda x: self.led[x.name].transform(x.apply(lambda e: e if e is not None else "None")))
def inverse_transform(self, X):
return X.apply(lambda x: self.led[x.name].inverse_transform(x))
Uses Example
df = pd.DataFrame({
'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'],
'owner': ['Champ', 'Ron', 'Brick', None, 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
None]
})
print(df)
location owner pets
0 San_Diego Champ cat
1 New_York Ron dog
2 New_York Brick cat
3 San_Diego None monkey
4 San_Diego Veronica dog
5 None Ron dog
le = MultiColumnLabelEncoder()
le.fit(df)
transformed = le.transform(df)
print(transformed)
location owner pets
0 2 1 0
1 0 3 1
2 0 0 0
3 2 2 2
4 2 4 1
5 1 3 1
inverted = le.inverse_transform(transformed)
print(inverted)
location owner pets
0 San_Diego Champ cat
1 New_York Ron dog
2 New_York Brick cat
3 San_Diego None monkey
4 San_Diego Veronica dog
5 None Ron dog
This function takes a column from a dataframe and return the column where only non-NaNs are label encoded, the rest remains untouched
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def label_encode_column(col):
nans = col.isnull()
nan_lst = []
nan_idx_lst = []
label_lst = []
label_idx_lst = []
for idx, nan in enumerate(nans):
if nan:
nan_lst.append(col[idx])
nan_idx_lst.append(idx)
else:
label_lst.append(col[idx])
label_idx_lst.append(idx)
nan_df = pd.DataFrame(nan_lst, index=nan_idx_lst)
label_df = pd.DataFrame(label_lst, index=label_idx_lst)
label_encoder = LabelEncoder()
label_df = label_encoder.fit_transform(label_df.astype(str))
label_df = pd.DataFrame(label_df, index=label_idx_lst)
final_col = pd.concat([label_df, nan_df])
return final_col.sort_index()
This is how I did it:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
UNKNOWN_TOKEN = '<unknown>'
a = pd.Series(['A','B','C', 'D','A'], dtype=str).unique().tolist()
a.append(UNKNOWN_TOKEN)
le = LabelEncoder()
le.fit_transform(a)
embedding_map = dict(zip(le.classes_, le.transform(le.classes_)))
and when applying to new test data:
test_df = test_df.apply(lambda x: x if x in embedding_map else UNKNOWN_TOKEN)
le.transform(test_df)
I also wanted to contribute my workaround, as I found the others a bit more tedious when working with categorical data which contains missing values
# Create a random dataframe
foo = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Randomly intersperse column 'A' with missing data (NaN)
foo['A'][np.random.randint(0,len(foo), size=20)] = np.nan
# Convert this series to string, to simulate our problem
series = foo['A'].astype(str)
# np.nan are converted to the string "nan", mask these out
mask = (series == "nan")
# Apply the LabelEncoder to the unmasked series, replace the masked series with np.nan
series[~mask] = LabelEncoder().fit_transform(series[~mask])
series[mask] = np.nan
foo['A'] = series
This is my attempt!
import numpy as np
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#Now lets encode the incomplete Cabin feature
titanic_train_le['Cabin'] = le.fit_transform(titanic_train_le['Cabin'].astype(str))
#get nan code for the cabin categorical feature
cabin_nan_code=le.transform(['nan'])[0]
#Now, retrieve the nan values in the encoded data
titanic_train_le['Cabin'].replace(cabin_nan_code,np.nan,inplace=True)
I just created my own encoder which can encode a dataframe at once.
Using this class, None is encoded to 0. It can be handy when trying to make sparse matrix.
Note that the input dataframe must include categorical columns only.
class DF_encoder():
def __init__(self):
self.mapping = {None : 0}
self.inverse_mapping = {0 : None}
self.all_keys =[]
def fit(self,df:pd.DataFrame):
for col in df.columns:
keys = list(df[col].unique())
self.all_keys += keys
self.all_keys = list(set(self.all_keys))
for i , item in enumerate(start=1 ,iterable=self.all_keys):
if item not in self.mapping.keys():
self.mapping[item] = i
self.inverse_mapping[i] = item
def transform(self,df):
temp_df = pd.DataFrame()
for col in df.columns:
temp_df[col] = df[col].map(self.mapping)
return temp_df
def inverse_transform(self,df):
temp_df = pd.DataFrame()
for col in df.columns:
temp_df[col] = df[col].map(self.inverse_mapping)
return temp_df
I faced the same problem but none of the above worked for me. So I added a new row to the training data consisting only "nan"

python alternative to scan('file', what=list(...)) in R

I have a file in following format:
10000
2
2
2
2
0.00
0.00
0 1
0.00
0.01
0 1
...
I want to create a dataframe from this file (skipping the first 5 lines) like this:
x1 x2 y1 y2
0.00 0.00 0 1
0.00 0.01 0 1
So the lines are converted to columns (where each third line is also split into two columns, y1 and y2).
In R I did this as follows:
df = as.data.frame(scan(".../test.txt", what=list(x1=0, x2=0, y1=0, y2=0), skip=5))
I am looking for a python alternative (pandas?) to this scan(file, what=list(...)) function.
Does it exist or do I have to write a more extended script?
You can skip the first 5, and then take groups of 4 to build a Python list, then put that in pandas as a start... I wouldn't be surprised if pandas offered something better though:
from itertools import islice, izip_longest
with open('input') as fin:
# Skip header(s) at start
after5 = islice(fin, 5, None)
# Take remaining data and group it into groups of 4 lines each... The
# first 2 are float data, the 3rd is two integers together, and the 4th
# is the blank line between groups... We use izip_longest to ensure we
# always have 4 items (padded with None if needs be)...
for lines in izip_longest(*[iter(after5)] * 4):
# Convert first two lines to float, and take 3rd line, split it and
# convert to integers
print map(float, lines[:2]) + map(int, lines[2].split())
#[0.0, 0.0, 0, 1]
#[0.0, 0.01, 0, 1]
As far as I know I cannot see any options here http://pandas.pydata.org/pandas-docs/stable/io.html to organize your DataFrame as you want;
But you can achieve it easly:
lines = open('YourDataFile.txt').read() # read the whole file
import re # import re
elems = re.split('\n| ', lines)[5:] # split each element and exclude the first 5
grouped = zip(*[iter(elems)]*4) # group them 4 by 4
import pandas as pd # import pandas
df = pd.DataFrame(grouped) # construct DataFrame
df.columns = ['x1', 'x2', 'y1', 'y2'] # columns names
It's not concise, it's not elegant, but it's clear what it does...
OK, here's how I did it (it is in fact a combo of Jon's & Giupo's answer, tnx guys!):
with open('myfile.txt') as file:
data = file.read().split()[5:]
grouped = zip(*[iter(data)]*4)
import pandas as pd
df = pd.DataFrame(grouped)
df.columns = ['x1', 'x2', 'y1', 'y2']

Categories