Following up my previous question, because I couldn't get a satisfactory answer. Now I have data like this, don't know what it exactly is
["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]
I'd like my final output to be written to a csv file like below. How can I achieve this?
A ,B ,C
a1,a2 ,b1 ,c1
a2,a4 ,b3 ,ct
Assuming that ["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"] is one long string as the original post seems to imply, ie:
"""["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]"""
then the following code should work:
# ORIGINAL STRING
s = """["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]"""
# GET RID OF UNNECESSARY CHARACTERS FOR OUR CSV
s = s.replace("][", "--") # temporary chars to help split into lines later on
s = s.replace("[", "")
s = s.replace("]", "")
s = s.replace("\'", "")
s = s.replace("\"", "")
# SPLIT UP INTO A LIST OF LINES OF TEXT
lines = s.split("--")
# WRITE EACH LINE IN TURN TO A CSV FILE
with open("myFile.csv", mode = "w") as textFile:
# mode = w to override any other contents of an existing file, or
# create a new one.
# mode = a To append to an exising file
for line in lines:
textFile.write(line + str("\n"))
An alternative way, again assuming that the data is encoded as one long string:
import ast
# ORIGINAL STRING
s = """["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]"""
# PARSE INTO A LIST OF LISTS WITH STRING ELEMENTS
s2 = s.replace("][", "],[")
s2 = ast.literal_eval(s2)
s2 = [ast.literal_eval(s2[x][0]) for x in range(len(s2))]
# WRITE EACH LIST AS A LINE IN THE CSV FILE
with open("myFile.csv", mode = "w") as textFile:
# mode = w to override any other contents of an existing file, or
# create a new one.
# mode = a To append to an exising file
for i in range(len(s2)):
line = ",".join(s2[i])
textFile.write(line + str("\n"))
Since the given input won't be accepted by any inbuilt data structure, you need to convert it either into a string or a list of lists. Assuming your input as a string in the following. Also, you can modify the formatting as per your requirement.
#!/usr/bin/python
from ast import literal_eval
def csv(li):
file_handle = open("test.csv", "w")
#stripping the outer double_quotes and splitting the list by commas
for outer in li:
temp = outer[0].strip("'")
temp = temp.split("',")
value = ""
#bulding a formatted string(change this as per your requirement
for inner in temp:
value += '{0: <10}'.format(inner.strip("'")) + '{0: >10}'.format(",")
value = value.strip(", ")
#writing the built string into the file
file_handle.write(value + "\n")
file_handle.close()
#assuming your input as string
def main():
li_str = """["'A','B','C'"]["'a1,a2','b1','c1'"]["'a2,a4','b3','ct'"]"""
li = []
start_pos, end_pos = 0, -1
#break each into a new list and appending it to li
while(start_pos != -1):
start_pos = li_str.find("[", end_pos+1)
if start_pos == -1:
break
end_pos = li_str.find("]", start_pos+1)
li.append(literal_eval(li_str[start_pos:end_pos+1]))
#li now conatins a list of lists i.e. same as the input
csv(li)
if __name__=="__main__":
main()
Related
I want to write a keyword-in-context script in which I first read a text file as an enumerated list and then return a given keyword and the five next words.
I saw that similar questions were asked for C# and I found solutions for the enum module in Python, but I hope there is a solution for just using the enumerate() function.
This is what I have got so far:
# Find keywords in context
import string
# open input txt file from local path
with open('C:\\Users\\somefile.txt', 'r', encoding='utf-8', errors='ignore') as f: # open file
data1=f.read() # read content of file as string
data2=data1.translate(str.maketrans('', '', string.punctuation)).lower() # remove punctuation
data3=" ".join(data2.split()) # remove additional whitespace from text
indata=list(data3.split()) # convert string to list
print(indata[:4])
searchterms=["text", "book", "history"]
def wordsafter(keyword, source):
for i, val in enumerate(source):
if val == keyword: # cannot access the enumeration value here
return str(source[i+5]) # intend to show searchterm and subsequent five words
else:
continue
for s in searchterms: # iterate through searchterms
print(s)
wordsafter(s, indata)
print("done")
I was hoping I could simply access the value of the enumeration like I did here, but that does not seem to be the case.
With credits to #jasonharper, your improved code:
import string
def wordsafter(keyword, source):
for i, val in enumerate(source):
if val == keyword:
return ' '.join(source[i:i + 5]) # intend to show searchterm and subsequent five words
# wordsafter() for all instances
def wordsafter(keyword, source):
instances = []
for i, val in enumerate(source):
if val == keyword:
instances.append(' '.join(source[i:i + 5]))
return instances
# open input txt file from local path
with open('README.md', 'r', encoding='utf-8', errors='ignore') as f: # open file
data1 = f.read() # read content of file as string
data2 = data1.translate(str.maketrans('', '', string.punctuation)).lower() # remove punctuation
data3 = " ".join(data2.split()) # remove additional whitespace from text
indata = list(data3.split()) # convert string to list
searchterms = ["this", "book", "history"]
for string in searchterms: # iterate through searchterms
result = wordsafter(string, indata)
if result:
print(result)
I have a file that I am trying to do a word frequency list on, but I'm having trouble with the list and string aspects. I changed my file to a string to remove numbers from the file, but that ends up messing up the tokenization. The expected output is a word count of the file I am opening excluding numbers, but what I get is the following:
Counter({'<_io.TextIOWrapper': 1, "name='german/test/polarity/negative/neg_word_list.txt'": 1, "mode='r'": 1, "encoding='cp'>": 1})
done
Here's the code:
import re
from collections import Counter
def word_freq(file_tokens):
global count
for word in file_tokens:
count = Counter(file_tokens)
return count
f = open("german/test/polarity/negative/neg_word_list.txt")
clean = re.sub(r'[0-9]', '', str(f))
file_tokens = clean.split()
print(word_freq(file_tokens))
print("done")
f.close()
this ended up working, thank you to Rakesh
import re
from collections import Counter
def word_freq(file_tokens):
global count
for word in file_tokens:
count = Counter(file_tokens)
return count
f = open("german/test/polarity/negative/neg_word_list.txt")
clean = re.sub(r'[0-9]', '', f.read())
file_tokens = clean.split()
print(word_freq(file_tokens))
print("done")
f.close()
Reading further i've noticed you didn't "read" the file, you've just opened it.
if you print only opening the file:
f = open("german/test/polarity/negative/neg_word_list.txt")
print(f)
You'll notice it will tell you what the object is, "io.TextIOWrapper". So you need to read it:
f_path = open("german/test/polarity/negative/neg_word_list.txt")
f = f_path.read()
f_path.close() # don't forget to do this to clear stuff
print(f)
# >>> what's really inside the file
or another way to do this without the "close()":
# adjust your encoding
with open("german/test/polarity/negative/neg_word_list.txt", encoding="utf-8") as r:
f = r.read()
It's possible that by doing that it won't be in a list, but a plain text file, so you could iterate each line:
list_of_lines = []
# adjust your encoding
with open("german/test/polarity/negative/neg_word_list.txt", encoding="utf-8") as r:
# read each line and append to list
for line in r:
list_of_lines.append(line)
def writeConfusionMatrix(self, outFile):
print("Write a confusion matrix to outFile; elements in the matrix can be frequencies (you don't need to normalize)")
output = []
file = open(outFile, 'w+')
matrix = defaultdict(lambda: defaultdict(int))
for s in range(len(self.goldenTags)):
for w in range(len(self.goldenTags[s])):
matrix[self.goldenTags[s][w].tag][self.myTags[s][w].tag] += 1
row_ids = sorted(matrix.keys())
col_ids = sorted(set(k for v in matrix.values() for k in v.keys()))
output.append(col_ids)
for r in row_ids:
output.append([r] + [matrix[r].get(c, 0) for c in col_ids])
#matKeys = matrix.keys()
#df = DataFrame(matrix).T.fillna(0)
#output = '\n'.join(output)
print(output)
file.write(str(output))
This function creates a confusion matrix and writes it into a new file.
It looks like
current matrix
It's nested list with no spaces, but I want to make it look like:
new matrix
by adding new lines in between the elements.
I've tried something like adding
output = '\n'.join(output)
before
file.write(str(output))
but gave me a
sequence item 0: expected str instance, list found
error.
Any idea?
Simply print your string line by line:
for line in output:
print line # or file.write(line)
Or create a new string that uses the new line separator and then write it:
output = '\n'.join(str(line) for line in output)
print line # or file.write(line)
Try writing this to your file:
output = '\n'.join([','.join(map(str,item)) for item in output])
You were joining lists with '\n'.join() and you need to convert those to string as well.
I've been trying to extract both the species name and sequence from a file as depicted below in order to compile a dictionary with the key corresponding to the species name (FOX2_MOUSE for example) and the value corresponding to the Amino Acid sequence.
Sample fasta file:
>sp|P58463|FOXP2_MOUSE
MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
>sp|Q8MJ98|FOXP2_PONPY
MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
I've tried using my code below:
import re
InFileName = "foxp2.fasta"
InFile = open(InFileName, 'r')
Species = []
Sequence = []
reg = re.compile('FOXP2_\w+')
for Line in InFile:
Species += reg.findall(Line)
print Species
reg = re.compile('(^\w+)')
for Line in Infile:
Sequence += reg.findall(Line)
print Sequence
dictionary = dict(zip(Species, Sequence))
InFile.close()
However, my output for my lists are:
[FOX2_MOUSE, FOXP2_PONPY]
[]
Why is my second list empty? Are you not allowed to use re.compile() twice? Any suggestions on how to circumvent my problem?
Thank you,
Christy
If you want to read a file twice, you have to seek back to the beginning.
InFile.seek(0)
You can do it in a single pass, and without regular expressions:
def load_fasta(filename):
data = {}
species = ""
sequence = []
with open(filename) as inf:
for line in inf:
line = line.strip()
if line.startswith(";"): # is comment?
# skip it
pass
elif line.startswith(">"): # start of new record?
# save previous record (if any)
if species and sequence:
data[species] = "".join(sequence)
species = line.split("|")[2]
sequence = []
else: # continuation of previous record
sequence.append(line)
# end of file - finish storing last record
if species and sequence:
data[species] = "".join(sequence)
return data
data = load_fasta("foxp2.fasta")
On your given file, this produces data ==
{
'FOXP2_PONPY': 'MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK',
'FOXP2_MOUSE': 'MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK'
}
You could also do this in a single pass with a multiline regex:
import re
reg = re.compile('(FOXP2_\w+)\n(^[\w\n-]+)', re.MULTILINE)
with open("foxp2.fasta", 'r') as file:
data = dict(reg.findall(file.read()))
The downside is that you have to read the whole file in at once. Whether this is a problem depends on likely file sizes.
I'm working on a Python script which opens a DBF file, and then from that creates a text output of the contents (either .txt or .csv).
I've now managed to get it writing the output file, but I need to replace the space character in one of the database fields (It's a UK Car registration number, e.g. I need "AB09 CDE" to output as "AB09CDE") but I've been unable to work out how to do this as it seems to be nested lists. The field is rec[7] in the code below.
if __name__ == '__main__':
import sys, csv
from cStringIO import StringIO
from operator import itemgetter
# Read a database
filename = 'DATABASE.DBF'
if len(sys.argv) == 2:
filename = sys.argv[1]
f = open(filename, 'rb')
db = list(dbfreader(f))
f.close()
fieldnames, fieldspecs, records = db[0], db[1], db[2:]
# Remove some fields that we don't want to use...
del fieldnames[0:]
del fieldspecs[0:]
#Put the relevant data into the temporary table
records = [rec[7:8] + rec[9:12] + rec[3:4] for rec in records]
# Create outputfile
output_file = 'OUTPUT.txt'
f = open (output_file, 'wb')
csv.writer(f).writerows(records)
This also adds a lot of spaces to the end of each outputted value. How would I get rid of these?
I'm fairly new to Python so any guidance would be gratefully received!
The problem is that you are using slicing:
>>> L = [1,2,3,4,5,6,7,8,9,10]
>>> L[7]
8
>>> L[7:8] #NOTE: it's a *list* of a single element!
[8]
To replace spaces in rec[7] do:
records = [[rec[7].replace(' ', '')] + rec[9:12] + rec[3:4] for rec in records]
records = [rec[7].replace(' ', '') + rec[9:12] + rec[3:4] for rec in records]
?
Python documentation wrote:
str.replace(old, new[, count]) Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
This example demonstrate it:
In [13]: a ="AB09CDE"
In [14]: a.replace(" ", "")
Out[14]: 'AB09CDE'
In [15]: print a
AB09CDE
So, if rec is string field, then:
records = [rec.replace(" ", "") for rec in records]