Filtering a CSV file in python - python

I have downloaded this csv file, which creates a spreadsheet of gene information. What is important is that in the HLA-* columns, there is gene information. If the gene is too low of a resolution e.g. DQB1*03 then the row should be deleted. If the data is too high resoltuion e.g. DQB1*03:02:01, then the :01 tag at the end needs to be removed. So, ideally I want to proteins to be in the format DQB1*03:02, so that it has two levels of resolution after DQB1*. How can I tell python to look for these formats, and ignore the data stored in them.
e.g.
if (csvCell is of format DQB1*03:02:01):
delete the :01 # but do this in a general format
elif (csvCell is of format DQB1*03):
delete row
else:
goto next line
UPDATE: Edited code I referenced
import csv
import re
import sys
csvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')
csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^\w+\*\d\d$', value):
keep = False
break # quit processing target fields
elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
if keep:
csvdictwriter.writerow(rowfields)

Here's something that I think will do what you want. It's not as simple as Peter's answer because it uses Python's csv module to process the file. It could probably be rewritten and simplified to just treat the file as a plain text as his does, but that should be easy.
import csv
import re
import sys
csvdictreader = csv.DictReader(sys.stdin, delimiter=',')
csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?
keep = False
break # quit processing target fields
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',
r'DQB1*\1:\2', value)
if keep:
csvdictwriter.writerow(rowfields)
The hardest part for me was determining what you wanted to do.

Here's an ultra-simple filter:
import sys
for line in sys.stdin:
line = line.replace( ',DQB1*03:02:01,', ',DQB1*03:02,' )
if line.find( ',DQB1*03,' ) == -1:
sys.stdout.write( line )
Or, if you want to use regular expressions
import re
import sys
for line in sys.stdin:
line = re.sub( ',DQB1\\*03:02:01,', ',DQB1*03:02,', line )
if re.search( ',DQB1\\*03,', line ) == None:
sys.stdout.write( line )
Run it as
python script.py < data.csv

Related

multiple modification to a list at once

I have a text file of some ip's and Mac's. The format of the Mac's are xxxx.xxxx.xxxx, I need to change all the MAC's to xx:xx:xx:xx:xx:xx
I am already reading the file and putting it into a list. Now I am looping through each line of the list and I need to make multiple modification. I need to remove the IP's and then change the MAC format.
The problem I am running into is that I cant seem to figure out how to do this in one shot unless I copy the list to a newlist for every modification.
How can I loop through the list once, and update each element on the list with all my modification?
count = 0
output3 = []
for line in output:
#print(line)
#removes any extra spaces between words in a string.
output[count] = (str(" ".join(line.split())))
#create a new list with just the MAC addresses
output3.append(str(output[count].split(" ")[3]))
#create a new list with MAC's using a ":"
count += 1
print(output3)
It appears you are trying to overthink the problem, so that may be where your frustration is spinning you around a bit.
First, you should always consider if you need a count variable in python. Usually you do not, and the enumerate() function is your friend here.
Second, there is no need to process data multiple times in python. You can use variables to your advantage and leverage python's expressiveness, rather than trying to hide your problem from the language.
PSA an implementation example that may help you think through your approach. Good luck on solving your harder problems, and I hope python will help you out with them!
#! /usr/bin/env python3
import re
from typing import Iterable
# non-regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: this assumes a source with '.' separators only
# reformat_mac = lambda _: ':'.join(_ for _ in _.split('.') for _ in (_[:2], _[2:]))
# regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: Only requires at least two hex digits adjacent at a time
reformat_mac = lambda _: ":".join(re.findall(r"(?i)[\da-f]{2}", _))
def generate_output3(output: Iterable[str]) -> Iterable[str]:
for line in output:
col1, col2, col3, mac, *cols = line.split()
mac = reformat_mac(mac)
yield " ".join((col1, col2, col3, mac, *cols))
if __name__ == "__main__":
output = [
"abc def ghi 1122.3344.5566",
"jklmn op qrst 11a2.33c4.55f6 uv wx yz",
"zyxwu 123 next 11a2.33c4.55f6 uv wx yz",
]
for line in generate_output3(output):
print(line)
Solution
You can use the regex (regular expression) module to extract any pattern that matches that of the
mac-ids: "xxxx:xxxx:xxxx" and then process it to produce the expected output ("xx-xx-xx-xx-xx-xx")
as shown below.
Note: I have used a dummy data file (see section: Dummy Data below) to make this answer
reproducible. It should work with your data as well.
# import re
filepath = "input.txt"
content = read_file(filepath)
mac_ids = extract_mac_ids(content, format=True) # format=False --> "xxxx:xxxx:xxxx"
print(mac_ids)
## OUTPUT:
#
# ['a0-b1-ff-33-ac-d5',
# '11-b9-33-df-55-f6',
# 'a4-d1-e7-33-ff-55',
# '66-a1-b2-f3-b9-c5']
Code: Convenience Functions
How does the regex work? see this example
def read_file(filepath: str):
"""Reads and returns the content of a file."""
with open(filepath, "r") as f:
content = f.read() # read in one attemp
return content
def format_mac_id(mac_id: str):
"""Returns a formatted mac_id.
INPUT FORMAT: "xxxxxxxxxxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
mac_id = list(mac_id)
mac_id = ''.join([ f"-{v}" if (i % 2 == 0) else v for i, v in enumerate(mac_id)])[1:]
return mac_id
def extract_mac_ids(content: str, format: bool=True):
"""Extracts and returns a list of formatted mac_ids after.
INPUT FORMAT: "xxxx:xxxx:xxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
import re
# pattern = "(" + ':'.join([r"\w{4}"]*3) + "|" + ':'.join([r"\w{2}"]*6) + ")"
# pattern = r"(\w{4}:\w{4}:\w{4}|\w{2}:\w{2}:\w{2}:\w{2}:\w{2}:\w{2})"
pattern = r"(\w{4}:\w{4}:\w{4})"
pat = re.compile(pattern)
mac_ids = pat.findall(content) # returns a list of all mac-ids
# Replaces the ":" with "" and then formats
# each mac-id as: "xx-xx-xx-xx-xx-xx"
if format:
mac_ids = [format_mac_id(mac_id.replace(":", "")) for mac_id in mac_ids]
return mac_ids
Dummy Data
The following code block creates a dummy file with some sample mac-ids.
filepath = "input.txt"
s = """
a0b1:ff33:acd5 ghwvauguvwi ybvakvi
klasilvavh; 11b9:33df:55f6
haliviv
a4d1:e733:ff55
66a1:b2f3:b9c5
"""
# Create dummy data file
with open(filepath, "w") as f:
f.write(s)

Adding new colums in a csv file and values from differents dictionaries-comprehension

This is my code below and I would like to write new column in my original csv , the columns are supposed to contain the values of each dictionary created during my code and I would like for the last dictionary since it contains 3 values , that each values is inserted in a single column. The code to write in the csv is at the end but maybe there is a way to write the values at each time i am producing a new dictionary.
My code for the csv route : I cannot figure it out how to add without deleting the content of the original file
# -*- coding: UTF-8 -*-
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict
#export le lexique de sentiments
pickle_in = open("dict_pickle", "rb")
dico_lexique = pickle.load(pickle_in)
# extraction colonne verbatim
d_verbatim = {}
with open(sys.argv[1], 'r', encoding='cp1252') as csv_file:
csv_file.readline()
for line in csv_file:
token = line.split(';')
try:
d_verbatim[token[0]] = token[1]
except:
print(line)
#print(d_verbatim)
#Using treetagger
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
d_tag = {}
for key, val in d_verbatim.items():
newvalues = tagger.tag_text(val)
d_tag[key] = newvalues
#print(d_tag)
#lemmatisation
d_lemma = defaultdict(list)
for k, v in d_tag.items():
for p in v:
parts = p.split('\t')
try:
if parts[2] == '':
d_lemma[k].append(parts[0])
else:
d_lemma[k].append(parts[2])
except:
print(parts)
#print(d_lemma)
stopWords = set(stopwords.words('french'))
d_filtered_words = {k: [w for w in l if w not in stopWords and w.isalpha()] for k, l in d_lemma.items()}
print(d_filtered_words)
d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
for word in v:
if word in dico_lexique:
if word
print(word, dico_lexique[word])
your edit seemed to make things worse, you've ended up deleting a lot of relevant context. I think I've pieced together what you are trying to do. the core of it seems to be a routine that is performing sentiment analysis on text.
I'd start by creating a class that keeps track of this, e.g:
class Sentiment:
__slots__ = ('positive', 'neutral', 'negative')
def __init__(self, positive=0, neutral=0, negative=0):
self.positive = positive
self.neutral = neutral
self.negative = negative
def __repr__(self):
return f'<Sentiment {self.positive} {self.neutral} {self.negative}>'
def __add__(self, other):
return Sentiment(
self.positive + other.positive,
self.neutral + other.neutral,
self.negative + other.negative,
)
which will allow you to replace your convoluted bits of code like [a + b for a, b in zip(map(int, dico_lexique[word]), d_score[k])] with score += sentiment in the function below, and allows us to refer to the various values by name
I'd then suggest preprocessing your pickled data, so you don't have to convert things to ints in the middle of unrelated code, e.g:
with open("dict_pickle", "rb") as fd:
dico_lexique = {}
for word, (pos, neu, neg) in pickle.load(fd):
dico_lexique[word] = Sentiment(int(pos), int(neu), int(neg))
this puts them directly into the above class and seems to match up with other constraints in your code. but I don't have your data, so can't check.
after pulling apart all your comprehensions and loops, we are left with a single nice routine for processing a single piece of text:
def process_text(text):
"""process the specified text
returns (words, filtered words, total sentiment score)
"""
words = []
filtered = []
score = Sentiment()
for tag in make_tags(tagger.tag_text(text)):
word = tag.lemma
words.append(word)
if word not in stopWords and lemma.isalpha():
filtered.append(word)
sentiment = dico_lexique.get(word)
if sentiment is not None:
score += sentiment
return words, filtered, score
and we can put this into a loop that reads lines from the input and sends them to an output file:
filename = sys.argv[1]
tempname = filename + '~'
with open(filename) as fdin, open(tempname, 'w') as fdout:
inp = csv.reader(fdin, delimiter=';')
out = csv.writer(fdout, delimiter=';')
# get the header, and blindly append out column names
header = next(inp)
out.writerow(header + [
'd_lemma', 'd_filtered_words', 'Positive Score', 'Neutral Score', 'Negative Score',
])
for row in inp:
# assume that second item contains the text we want to process
words, filtered, score = process_text(row[1])
extra_values = [
words, filtered,
score.positive, score.neutal, score.negative,
]
# add the values and write out
assert len(row) == len(header), "code needed to pad the columns out"
out.writerow(row + extra_values)
# only replace if everything succeeds
os.rename(tempname, filename)
we write out to a different file and only rename on success, this means that if the code crashes it won't leave partially written files around. I'd discourage working like though, and tend to make my scripts read from stdin and write to stdout. that way I can run as:
$ python script.py < input.csv > output.csv
when all is OK, but also lets me run as:
$ head input.csv | python script.py
if I just want to test with the first few lines of input, or:
$ python script.py < input.csv | less
if I want to checkout output as it's generated
note that none of this code has been run, so there are probably bugs in it, but I can actually see what the code is trying to do like this. comprehensions and 'functional' style code is great, but it can easily get unreadable if you're not careful

Categorising data according to one column in python

Hi I have a dataset as follows eg:
sample pos mutation
2fec2 40 TC
1f3c 40 TC
19b0 40 TC
tld3 60 CG
I want to be able to find a python way to for example find every instance where 2fec2 and 1f3c have the same mutation and print the code. So far I have tried the following but it simply returns everything. I am completely new to python and trying to ween myself off awk - please help!
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
for record.affected_start in vcf_reader: #.affect_start is this modules way of calling data from the parsed pos column from a particular type of bioinformatics file
if record.sample == 2fec2 and 1f3c != 19b0 !=t1d3: #ditto .sample
print record.affected_start
I'm assuming your data is in the format you describe and not VCF.
You can try to simply parse the file with standard python techniques and for each (pos, mutation) pair, build the set of samples having it as follows:
from sys import argv
from collections import defaultdict
# More convenient than a normal dict: an empty set will be
# automatically created whenever a new key is accessed
# keys will be (pos, mutation) pairs
# values will be sets of sample names
mutation_dict = defaultdict(set)
# This "with" syntax ("context manager") is recommended
# because file closing will be handled automatically
with open(argv[1], "r") as data_file:
# Read first line and check headers
# (assert <something False>, "message"
# will make the program exit and display "message")
assert data_file.readline().strip().split() == ["sample", "pos", "mutation"], "Unexpected column names"
# .strip() removes end-of-line character
# .split() splits into a list of words
# (by default using "blanks" as separator)
# .readline() has "consumed" a first line.
# Now we can loop over the rest of the lines
# that should contain the data
for line in data_file:
# Extract the fields
[sample, pos, mutation] = line.strip().split()
# add the sample to the set of samples having
# this (pos, mutation) combination
mutation_dict[(pos, mutation)].add(sample)
# Now loop over the key, value pairs in our dict:
for (pos, mutation), samples in mutation_dict.items():
# True if set intersection (&) is not empty
if samples & {"2fec2", "1f3c"}:
print("2fec2 and 1f3c share mutation %s at position %s" % (mutation, pos))
With your example data as first argument of the script, this outputs:
2fec2 and 1f3c share mutation TC at position 40
How about this
from sys import argv
script, vcf_file = argv
import vcf
vcf_reader = vcf.Reader(open(vcf_file, 'r'))
# Store our results outside of the loop
fecResult = ""
f3cResult = ""
# For each record
for record.affected_start in vcf_reader:
if record.sample == "2fec2":
fecResult = record.mutation
if record.sample == "1f3c":
f3cResult = record.mutation
# Outside of the loop compare the results and if they match print the record.
if fecResult == f3cResult:
print record.affected_start

Can't get unique word/phrase counter to work - Python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

UPDATE: Calculate vector length according to str value in specific column in Python

I am trying to measure the length of vectors based on a value of the first column of my input data.
For instance: my input data is as follows:
dog nmod+n+-n 4
dog nmod+n+n-a-commitment-n 6
child into+ns-j+vn-pass-rb-divide-v 3
child nmod+n+ns-commitment-n 5
child nmod+n+n-pledge-n 3
hello nmod+n+ns 2
The value that I want to calculate is based on an identical value in the first column. For instance, I would calculate a value based on all rows in which dog is in the first column, then I would calculate a value based on all rows in which child is in the first column... and so on.
I have worked out the mathematics to calculate the vector length (Euc. norm). However, I am unsure how to base the calculation based on grouping the identical values in the first column.
So far, this is the code that I have written:
#!/usr/bin/python
import os
import sys
import getopt
import datetime
import math
print "starting:",
print datetime.datetime.now()
def countVectorLength(infile, outfile):
with open(infile, 'rb') as inputfile:
flem, _, fw = next(inputfile).split()
current_lem = flem
weights = [float(fw)]
for line in inputfile:
lem, _, w = line.split()
if lem == current_lem:
weights.append(float(w))
else:
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
current_lem = lem
weights = [float(w)]
print current_lem,
print math.sqrt(sum([math.pow(weight,2) for weight in weights]))
print "Finish:",
print datetime.datetime.now()
path = '/Path/to/Input/'
pathout = '/Path/to/Output'
listing = os.listdir(path)
for infile in listing:
outfile = 'output' + infile
print "current file is:" + infile
countVectorLength(path + infile, pathout + outfile)
This code outputs the length of vector of each individual lemma. The above data gives me the following output:
dog 7.211102550927978
child 6.48074069840786
hello 2
UPDATE
I have been working on it and I have managed to get the following working code, as updated in the code sample above. However, as you will be able to see. The code has a problem with the output of the very last line of each file --- which I have solved rather rudimentarily by manually adding it. However, because of this problem, it does not permit a clean iteration through the directory -- outputting all of the results of all files in an appended > document. Is there a way to make this a bit cleaner, pythonic way to output directly each individual corresponding file in the outpath directory?
First thing, you need to transform the input into something like
dog => [4,2]
child => [3,5,3]
etc
It goes like this:
from collections import defaultdict
data = defaultdict(list)
for line in file:
line = line.split('\t')
data[line[0]].append(line[2])
Once this is done, the rest is obvious:
def vector_len(vec):
you already got that
vector_lens = {name: vector_len(values) for name, values in data.items()}

Categories