I'm trying to encode genomes from strings stored in a dataframe to an array of corresponding numerical values.
Here is some of my dataframe (for some reason it doesn't give me all 5 columns just 2):
Antibiotic ... Genome
0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...
So I need to split these strings character by character and assign them to floats. This is the lookup table I was using:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
I tried to apply this directly using:
dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
But I have too much data to fit into memory so I'm trying to process using chunks and I'm having trouble defining a reprocessing function.
Here's my code so far:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath, chunksize=10)
chunk_list = []
def preprocess(chunk):
chunk['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
return;
for chunk in dataframe:
chunk_filter = preprocess(chunk)
chunk_list.append(chunk_filter)
dataframe1 = pd.concat(chunk_list)
print(dataframe1)
Thanks in advance!
You have chunk_filter = preprocess(chunk), but your preprocess() function returns nothing, so chunk_filter is always meaningless. Modify your preprocess function to store the result of the apply() call, then return that value. For example:
def preprocess(chunk):
processed_chunk = chunk['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
return processed_chunk;
By doing this, you actually return the data from the preprocess function so that it can be appended to the chunk list. As you have it currently, the preprocess function works correctly but essentially discards the results.
Related
I have an excel sheet that graphs my data using pandas, matplot, and numpy but I also want it to give me both the max and min values of all the cycles I have. Here's an idea of how my data looks like except its about 50k rows and only reaches 21 cycles.
Cycle
Voltage
1
0.30
1
0.05
1
-0.25
2
0.35
2
0.07
2
-0.23
My best idea was to use a for loop to find the max and min in every cycle but can't seem to get it working at all and not really sure how to go about it.
Here's one way to do it. For testing purposes, I put the following data in a CSV file:
Cycle,Voltage
1,0.30
1,0.05
1,-0.25
13,0.03
13,0.005
13,-0.025
2,0.35
2,0.07
2,-0.23
Here's the code:
from itertools import groupby
import csv
from operator import itemgetter
from typing import NamedTuple
class Record(NamedTuple):
""" Define the fields and their types of a record. """
Cycle: int
Voltage: float
#classmethod
def transform(cls: 'Record', dict_: dict) -> tuple:
""" Convert string values in given dictionary to corresponding Record
class field type."""
return tuple(cls.__annotations__[name](value) for name, value in dict_.items())
filepath = 'voltages.csv'
with open(filepath, newline='') as file:
reader = csv.DictReader(file)
data = sorted(map(Record.transform, reader))
groups = {}
for k, g in groupby(data, key=itemgetter(0)):
voltages = tuple(map(itemgetter(1), g))
groups[k] = (min(voltages), max(voltages))
print(groups) # -> {1: (-0.25, 0.3), 2: (-0.23, 0.35), 13: (-0.025, 0.03)}
Hi I'm trying to encode a Genome, stored as a string inside a dataframe read from a CSV.
Right now I'm looking to split each string in the dataframe under the column 'Genome' into a list of it's base pairs i.e. from ('acgt...') to ('a','c','g','t'...) then convert each base pair into a float (0.25,0.50,0.75,1.00) respectively.
I thought I was looking for a split function to split each string into characters but none seem to work on the data in the dataframe even when transformed to string using .tostring
Here's my most recent code:
import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def string_to_array(my_string):
my_string = my_string.lower()
my_string = re.sub('[^acgt]', 'z', my_string)
my_array = np.array(list(my_string))
return my_array
label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','g','c','t','z']))
def ordinal_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
float_encoded = integer_encoded.astype(float)
float_encoded[float_encoded == 0] = 0.25 # A
float_encoded[float_encoded == 1] = 0.50 # C
float_encoded[float_encoded == 2] = 0.75 # G
float_encoded[float_encoded == 3] = 1.00 # T
float_encoded[float_encoded == 4] = 0.00 # anything else, z
return float_encoded
dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath)
df = ordinal_encoder(string_to_array(dataframe[['Genome']].values.tostring()))
print(df)
I've tried making my own function but I don't have any clue how they work. Everything I try points to not being able to process data when it's in a numpy array and nothing is working to transform the data to another type.
Thanks for the tips!
Edit: Here is the print of the dataframe-
Antibiotic ... Genome
0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...
There are 5 columns 'Genome' being the 5th in the list I don't know why 1. .head() will not work and 2. why print() doesn't give me all columns...
I don't think LabelEncoder is what you want. This is a simple transformation, I recommend doing it directly. Start with a lookup your base pair mapping:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
Then apply the lookup to value of the "Genome" column. The values attribute will return the resulting dataframe as an ndarray.
dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
I'm calculating the frequency of words into many text files (140 docs), the end of my work is to create a csv file where I can order the frequence of every word by single doc and by all docs.
Let say I have:
absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005}
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7}
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6}
...
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9}
So, what I need is a cvs file to export in excel that looks like this:
WORD, ABS_FREQ, DOC_1_FREQ, DOC_2_FREQ, ..., DOC_140_FREQ
hello, 0.001 0.8 0.2 0.1
world, 0.002 0.9 0.03 0.5
baby, 0.005 0.7 0.6 0.9
How can I do It with python?
You could also convert it to a Pandas Dataframe and save it as a csv file or continue analysis in a clean format.
absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005}
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7}
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6}
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9}
all = [absolut_freq, doc_1, doc_2, doc_140]
# if you have a bunch of docs, you could use enumerate and then format the colname as you iterate over and create the dataframe
colnames = ['AbsoluteFreq', 'Doc1', 'Doc2', 'Doc140']
import pandas as pd
masterdf = pd.DataFrame()
for i in all:
df = pd.DataFrame([i]).T
masterdf = pd.concat([masterdf, df], axis=1)
# assign the column names
masterdf.columns = colnames
# get a glimpse of what the data frame looks like
masterdf.head()
# save to csv
masterdf.to_csv('docmatrix.csv', index=True)
# and to sort the dataframe by frequency
masterdf.sort(['AbsoluteFreq'])
You can make it a mostly a data-driven process—given only the variable names of all the dictionary variables—by first creating a table with all the data listed in it, and then using the csv module to write a transposed (columns for rows swapped) version it to the output file.
import csv
absolut_freq = {u'hello': 0.001, u'world': 0.002, u'baby': 0.005}
doc_1 = {u'hello': 0.8, u'world': 0.9, u'baby': 0.7}
doc_2 = {u'hello': 0.2, u'world': 0.3, u'baby': 0.6}
doc_140 ={u'hello': 0.1, u'world': 0.5, u'baby': 0.9}
dic_names = ('absolut_freq', 'doc_1', 'doc_2', 'doc_140') # dict variable names
namespace = globals()
words = namespace[dic_names[0]].keys() # assume dicts all contain the same words
table = [['WORD'] + list(words)] # header row (becomes first column of output)
for dic_name in dic_names: # add values from each dictionary given its name
table.append([dic_name.upper()+'_FREQ'] + list(namespace[dic_name].values()))
# Use open('merged_dicts.csv', 'wb') for Python 2.
with open('merged_dicts.csv', 'w', newline='') as csvfile:
csv.writer(csvfile).writerows(zip(*table))
print('done')
CSV file produced:
WORD,ABSOLUT_FREQ_FREQ,DOC_1_FREQ,DOC_2_FREQ,DOC_140_FREQ
world,0.002,0.9,0.3,0.5
baby,0.005,0.7,0.6,0.9
hello,0.001,0.8,0.2,0.1
No matter how you want to write this data, first you need an ordered data structure, for example a 2D list:
docs = []
docs.append( {u'hello':0.001, u'world':0.002, u'baby':0.005} )
docs.append( {u'hello':0.8, u'world':0.9, u'baby':0.7} )
docs.append( {u'hello':0.2, u'world':0.3, u'baby':0.6} )
docs.append( {u'hello':0.1, u'world':0.5, u'baby':0.9} )
words = docs[0].keys()
result = [ [word] + [ doc[word] for doc in docs ] for word in words ]
then you can use the built-in csv module: https://docs.python.org/2/library/csv.html
My function outputs a list, for instance when I type:
My_function('TV', 'TV_Screen')
it outputs the following:
['TV', 1, 'TV_Screen', 0.04, 'True']
Now, my TV is made of several parts, such as speaker, transformer, etc., I can keep running my function for each part, and for instance change 'TV_Screen' for 'TV_Speaker', or 'TV_transformer', etc.
The alternative is to create a list with all the part, such as:
TV_parts = ['TV_Screen', 'TV_Speaker', 'TV_transformer']
What I am trying to get is a pandas data frame with 5 columns (because my function outputs 5 variables, see above the section "it outputs the following:") and in this case 3 rows (one of each for 'TV_Screen', 'TV_Speaker', and 'TV_transformer'). Basically, I want the following to be in a data frame:
['TV', 1, 'TV_Screen', 0.04, 'True']
['TV', 9, 'TV_Speaker', 0.56, 'True']
['TV', 3, 'TV_transformer', 0.80, 'False']
I know I need a for loop somewhere, but I am not sure how to create this data frame. Could you please help? (I can change the output of my function to be a pd.Series or something else that would work better).
Thanks!
If you have many arrays, it may be worth converting them into a numpy matrix first and then converting them into a dataframe.
import pandas as pd
import numpy as np
a = ['TV', 1, 'TV_Screen', 0.04, 'True']
b = ['TV', 9, 'TV_Speaker', 0.56, 'True']
c = ['TV', 3, 'TV_transformer', 0.80, 'False']
matrix = np.matrix([a,b,c])
df = pd.DataFrame(data=matrix)
You can do it like this:
def My_function(part):
# prepare result
result = ['TV', 1, part, 0.04, 'True'] # for testing
return result
TV_parts = ['TV_Screen', 'TV_Speaker', 'TV_transformer']
df = pd.DataFrame([My_function(part) for part in TV_parts])
>>> df
0 1 2 3 4
0 TV 1 TV_Screen 0.04 True
1 TV 1 TV_Speaker 0.04 True
2 TV 1 TV_transformer 0.04 True
I have a file in following format:
10000
2
2
2
2
0.00
0.00
0 1
0.00
0.01
0 1
...
I want to create a dataframe from this file (skipping the first 5 lines) like this:
x1 x2 y1 y2
0.00 0.00 0 1
0.00 0.01 0 1
So the lines are converted to columns (where each third line is also split into two columns, y1 and y2).
In R I did this as follows:
df = as.data.frame(scan(".../test.txt", what=list(x1=0, x2=0, y1=0, y2=0), skip=5))
I am looking for a python alternative (pandas?) to this scan(file, what=list(...)) function.
Does it exist or do I have to write a more extended script?
You can skip the first 5, and then take groups of 4 to build a Python list, then put that in pandas as a start... I wouldn't be surprised if pandas offered something better though:
from itertools import islice, izip_longest
with open('input') as fin:
# Skip header(s) at start
after5 = islice(fin, 5, None)
# Take remaining data and group it into groups of 4 lines each... The
# first 2 are float data, the 3rd is two integers together, and the 4th
# is the blank line between groups... We use izip_longest to ensure we
# always have 4 items (padded with None if needs be)...
for lines in izip_longest(*[iter(after5)] * 4):
# Convert first two lines to float, and take 3rd line, split it and
# convert to integers
print map(float, lines[:2]) + map(int, lines[2].split())
#[0.0, 0.0, 0, 1]
#[0.0, 0.01, 0, 1]
As far as I know I cannot see any options here http://pandas.pydata.org/pandas-docs/stable/io.html to organize your DataFrame as you want;
But you can achieve it easly:
lines = open('YourDataFile.txt').read() # read the whole file
import re # import re
elems = re.split('\n| ', lines)[5:] # split each element and exclude the first 5
grouped = zip(*[iter(elems)]*4) # group them 4 by 4
import pandas as pd # import pandas
df = pd.DataFrame(grouped) # construct DataFrame
df.columns = ['x1', 'x2', 'y1', 'y2'] # columns names
It's not concise, it's not elegant, but it's clear what it does...
OK, here's how I did it (it is in fact a combo of Jon's & Giupo's answer, tnx guys!):
with open('myfile.txt') as file:
data = file.read().split()[5:]
grouped = zip(*[iter(data)]*4)
import pandas as pd
df = pd.DataFrame(grouped)
df.columns = ['x1', 'x2', 'y1', 'y2']