Categorising data according to one column in python

Categorising data according to one column in python - python

Hi I have a dataset as follows eg:
sample pos mutation
2fec2 40 TC
1f3c 40 TC
19b0 40 TC
tld3 60 CG
I want to be able to find a python way to for example find every instance where 2fec2 and 1f3c have the same mutation and print the code. So far I have tried the following but it simply returns everything. I am completely new to python and trying to ween myself off awk - please help!
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
for record.affected_start in vcf_reader: #.affect_start is this modules way of calling data from the parsed pos column from a particular type of bioinformatics file
if record.sample == 2fec2 and 1f3c != 19b0 !=t1d3: #ditto .sample
print record.affected_start

I'm assuming your data is in the format you describe and not VCF.
You can try to simply parse the file with standard python techniques and for each (pos, mutation) pair, build the set of samples having it as follows:
from sys import argv
from collections import defaultdict
# More convenient than a normal dict: an empty set will be
# automatically created whenever a new key is accessed
# keys will be (pos, mutation) pairs
# values will be sets of sample names
mutation_dict = defaultdict(set)
# This "with" syntax ("context manager") is recommended
# because file closing will be handled automatically
with open(argv[1], "r") as data_file:
# Read first line and check headers
# (assert <something False>, "message"
# will make the program exit and display "message")
assert data_file.readline().strip().split() == ["sample", "pos", "mutation"], "Unexpected column names"
# .strip() removes end-of-line character
# .split() splits into a list of words
# (by default using "blanks" as separator)
# .readline() has "consumed" a first line.
# Now we can loop over the rest of the lines
# that should contain the data
for line in data_file:
# Extract the fields
[sample, pos, mutation] = line.strip().split()
# add the sample to the set of samples having
# this (pos, mutation) combination
mutation_dict[(pos, mutation)].add(sample)
# Now loop over the key, value pairs in our dict:
for (pos, mutation), samples in mutation_dict.items():
# True if set intersection (&) is not empty
if samples & {"2fec2", "1f3c"}:
print("2fec2 and 1f3c share mutation %s at position %s" % (mutation, pos))
With your example data as first argument of the script, this outputs:
2fec2 and 1f3c share mutation TC at position 40

How about this
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
# Store our results outside of the loop
fecResult = ""
f3cResult = ""
# For each record
for record.affected_start in vcf_reader:
if record.sample == "2fec2":
fecResult = record.mutation
if record.sample == "1f3c":
f3cResult = record.mutation
# Outside of the loop compare the results and if they match print the record.
if fecResult == f3cResult:
print record.affected_start

Related

multiple modification to a list at once

I have a text file of some ip's and Mac's. The format of the Mac's are xxxx.xxxx.xxxx, I need to change all the MAC's to xx:xx:xx:xx:xx:xx
I am already reading the file and putting it into a list. Now I am looping through each line of the list and I need to make multiple modification. I need to remove the IP's and then change the MAC format.
The problem I am running into is that I cant seem to figure out how to do this in one shot unless I copy the list to a newlist for every modification.
How can I loop through the list once, and update each element on the list with all my modification?
count = 0
output3 = []
for line in output:
#print(line)
#removes any extra spaces between words in a string.
output[count] = (str(" ".join(line.split())))
#create a new list with just the MAC addresses
output3.append(str(output[count].split(" ")[3]))
#create a new list with MAC's using a ":"
count += 1
print(output3)

It appears you are trying to overthink the problem, so that may be where your frustration is spinning you around a bit.
First, you should always consider if you need a count variable in python. Usually you do not, and the enumerate() function is your friend here.
Second, there is no need to process data multiple times in python. You can use variables to your advantage and leverage python's expressiveness, rather than trying to hide your problem from the language.
PSA an implementation example that may help you think through your approach. Good luck on solving your harder problems, and I hope python will help you out with them!
#! /usr/bin/env python3
import re
from typing import Iterable
# non-regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: this assumes a source with '.' separators only
# reformat_mac = lambda _: ':'.join(_ for _ in _.split('.') for _ in (_[:2], _[2:]))
# regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: Only requires at least two hex digits adjacent at a time
reformat_mac = lambda _: ":".join(re.findall(r"(?i)[\da-f]{2}", _))
def generate_output3(output: Iterable[str]) -> Iterable[str]:
for line in output:
col1, col2, col3, mac, *cols = line.split()
mac = reformat_mac(mac)
yield " ".join((col1, col2, col3, mac, *cols))
if __name__ == "__main__":
output = [
"abc def ghi 1122.3344.5566",
"jklmn op qrst 11a2.33c4.55f6 uv wx yz",
"zyxwu 123 next 11a2.33c4.55f6 uv wx yz",
]
for line in generate_output3(output):
print(line)

Solution
You can use the regex (regular expression) module to extract any pattern that matches that of the
mac-ids: "xxxx:xxxx:xxxx" and then process it to produce the expected output ("xx-xx-xx-xx-xx-xx")
as shown below.
Note: I have used a dummy data file (see section: Dummy Data below) to make this answer
reproducible. It should work with your data as well.
# import re
filepath = "input.txt"
content = read_file(filepath)
mac_ids = extract_mac_ids(content, format=True) # format=False --> "xxxx:xxxx:xxxx"
print(mac_ids)
## OUTPUT:
#
# ['a0-b1-ff-33-ac-d5',
# '11-b9-33-df-55-f6',
# 'a4-d1-e7-33-ff-55',
# '66-a1-b2-f3-b9-c5']
Code: Convenience Functions
How does the regex work? see this example
def read_file(filepath: str):
"""Reads and returns the content of a file."""
with open(filepath, "r") as f:
content = f.read() # read in one attemp
return content
def format_mac_id(mac_id: str):
"""Returns a formatted mac_id.
INPUT FORMAT: "xxxxxxxxxxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
mac_id = list(mac_id)
mac_id = ''.join([ f"-{v}" if (i % 2 == 0) else v for i, v in enumerate(mac_id)])[1:]
return mac_id
def extract_mac_ids(content: str, format: bool=True):
"""Extracts and returns a list of formatted mac_ids after.
INPUT FORMAT: "xxxx:xxxx:xxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
import re
# pattern = "(" + ':'.join([r"\w{4}"]*3) + "|" + ':'.join([r"\w{2}"]*6) + ")"
# pattern = r"(\w{4}:\w{4}:\w{4}|\w{2}:\w{2}:\w{2}:\w{2}:\w{2}:\w{2})"
pattern = r"(\w{4}:\w{4}:\w{4})"
pat = re.compile(pattern)
mac_ids = pat.findall(content) # returns a list of all mac-ids
# Replaces the ":" with "" and then formats
# each mac-id as: "xx-xx-xx-xx-xx-xx"
if format:
mac_ids = [format_mac_id(mac_id.replace(":", "")) for mac_id in mac_ids]
return mac_ids
Dummy Data
The following code block creates a dummy file with some sample mac-ids.
filepath = "input.txt"
s = """
a0b1:ff33:acd5 ghwvauguvwi ybvakvi
klasilvavh; 11b9:33df:55f6
haliviv
a4d1:e733:ff55
66a1:b2f3:b9c5
"""
# Create dummy data file
with open(filepath, "w") as f:
f.write(s)

How to deal with Mapreduce by identifying the keys in python Hadoop

I have two key values from map function: NY and Others. so, the output of my key is: NY 1, or Other 1. Only these two cases.
my map function:
#!/usr/bin/env python
import sys
import csv
import string
reader = csv.reader(sys.stdin, delimiter=',')
for entry in reader:
if len(entry) == 22:
registration_state=entry[16]
print('{0}\t{1}'.format(registration_state,int(1)))
Now i need to use reducers to process the map outputs. My reduce:
#!/usr/bin/env python
import sys
import string
currentkey = None
ny = 0
other = 0
# input comes from STDIN (stream data that goes to the program)
for line in sys.stdin:
#Remove leading and trailing whitespace
line = line.strip()
#Get key/value
key, values = line.split('\t', 1)
values = int(values)
#If we are still on the same key...
if key == 'NY':
ny = ny + 1
#Otherwise, if this is a new key...
else:
#If this is a new key and not the first key we've seen
other = other + 1
#Compute/output result for the last key
print('{0}\t{1}'.format('NY',ny))
print('{0}\t{1}'.format('Other',other))
From these, the mapreduce will give two output result files, each contains both NY and Others outputs. i.e. one contains: NY 1248, Others 4677; another one: NY 0, Others 1000. This is because two reduced split the output from the map, so generated two results, by combining (merge) the final output will be the result.
However, I would like to change my reduce or map functions to have each reduced process on only one key, i.e. one reduced only deal with NY as the key values, and another one works on Other. I expect to have results like one contains:
NY 1258, Others 0; Another: NY 0, Others 5677.
How can I adjust my functions to achieve results I expect?

Probably you need to use Python iterators and generators.
An excellent example is given this link. I have tried re-writing your code with the same (not tested)
Mapper:
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def main(separator='\t'):
reader = csv.reader(sys.stdin, delimiter=',')
for entry in reader:
if len(entry) == 22:
registration_state=entry[16]
print '%s%s%d' % (registration_state, separator, 1)
if __name__ == "__main__":
main()
Reducer:
!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""
from itertools import groupby
from operator import itemgetter
import sys
def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(separator='\t'):
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass
if __name__ == "__main__":
main()

Can't get unique word/phrase counter to work - Python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.

Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)

Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

Python functions and for loops

I'm new to Python programming and I do not seem to get the right behavior from a FOR loop.
I've got a list of ids, and I want to iterate a ".gtf" file (tab separated multi-line) and extract from it some values corresponding to those ids.
It seems that the construction of the regex is not working correctly inside the findgtf function. From the second iteration onward, the "id" variable passed to the function is not used for the regex pattern of "sc" variable and subsequently, the pattern matching doesn't work. Do I need to reinitialize the variables "id" or/and "sc" before each iteration?
I so, could you tell me how to achieve that
Here's is the code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys, os, re
#Usage:gtf_parser_4.py [path_to_dir] [IDlist]
#######FUNCTIONS######################################
def findgtf(id, gtf):
id=id.strip()#remove \n
#print "Received Id: *"+id+"* post-stripped"
for line in gtf:
seq, source, feat, start, end, score, strand, frame, attribute = line.strip().split("\t")
sc = re.search(str(id), str(attribute))
if sc:
print "Coord of "+id+" -> Start: "+str(start)+" End: "+str(end)
###########################MAIN#########################
#Arguments retrieval
mydir = sys.argv[1]
#print"Directory : "+mydir
IDlist = sys.argv[2]
#print"IDlist : "+IDlist
path2ID = os.path.join(mydir, IDlist)
#print"Full IdList: "+path2ID
#lines to list
IDlines = [line.rstrip('\n') for line in open(path2ID)]
#Open and read dir
for file in os.listdir(mydir):
if file.endswith(".gtf"):
path2file = os.path.join(mydir, file)
#print"Full gtf : "+path2file
gtf = open(path2file,"r")
for id in IDlines:
print"ID submitted to findgtf: "+id
fg = findgtf(id, gtf)
gtf.close()
And here are the results retrieved from the console (submitted an Idlist with 3 ids: LX00_00030, gyrB, LX00_00065 ):
ID submitted to findgtf: LX00_00030
Coord of LX00_00030 -> Start: 4299 End: 5303
ID submitted to findgtf: gyrB
ID submitted to findgtf: LX00_00065
As you can see the first ID worked correctly but the second an third do not yield any result (although they do if their order is switched in the IDlist).
Thanks in advance for your help

Your code is not working because you are trying to repeatedly iterate over the same file object. A file keeps track of the position you've read to internally, so when you've read to the end, you can't read any more!
To make your code work, you need to seek back to the start of the file before iterating over it again.
for id in IDlines:
print"ID submitted to findgtf: "+id
gtf.seek(0) # seek to the start of the file
fg = findgtf(id, gtf)

Filtering a CSV file in python

I have downloaded this csv file, which creates a spreadsheet of gene information. What is important is that in the HLA-* columns, there is gene information. If the gene is too low of a resolution e.g. DQB1*03 then the row should be deleted. If the data is too high resoltuion e.g. DQB1*03:02:01, then the :01 tag at the end needs to be removed. So, ideally I want to proteins to be in the format DQB1*03:02, so that it has two levels of resolution after DQB1*. How can I tell python to look for these formats, and ignore the data stored in them.
e.g.
if (csvCell is of format DQB1*03:02:01):
delete the :01 # but do this in a general format
elif (csvCell is of format DQB1*03):
delete row
else:
goto next line
UPDATE: Edited code I referenced
import csv
import re
import sys
csvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')
csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^\w+\*\d\d$', value):
keep = False
break # quit processing target fields
elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
if keep:
csvdictwriter.writerow(rowfields)

Here's something that I think will do what you want. It's not as simple as Peter's answer because it uses Python's csv module to process the file. It could probably be rewritten and simplified to just treat the file as a plain text as his does, but that should be easy.
import csv
import re
import sys
csvdictreader = csv.DictReader(sys.stdin, delimiter=',')
csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?
keep = False
break # quit processing target fields
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',
r'DQB1*\1:\2', value)
if keep:
csvdictwriter.writerow(rowfields)
The hardest part for me was determining what you wanted to do.

Here's an ultra-simple filter:
import sys
for line in sys.stdin:
line = line.replace( ',DQB1*03:02:01,', ',DQB1*03:02,' )
if line.find( ',DQB1*03,' ) == -1:
sys.stdout.write( line )
Or, if you want to use regular expressions
import re
import sys
for line in sys.stdin:
line = re.sub( ',DQB1\\*03:02:01,', ',DQB1*03:02,', line )
if re.search( ',DQB1\\*03,', line ) == None:
sys.stdout.write( line )
Run it as
python script.py < data.csv

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Categorising data according to one column in python - python

Related

multiple modification to a list at once

How to deal with Mapreduce by identifying the keys in python Hadoop

Can't get unique word/phrase counter to work - Python

Python functions and for loops

Filtering a CSV file in python

Categories

Resources