I have this code to check indexes in a file to see if they match, but to start off I am having trouble being able to select an index. What do I have to do in order to be able to do so, because at this moment it doesn't show the values as being in a list.
def checkOS():
fid = open("C:/Python/NSRLOS.txt", 'r')
fhand = open("C:/Python/sha_sub_hashes.out", 'r')
sLine = fhand.readline()
line = fid.readline()
outdata = []
print line
checkOS()
Right now it prints:
"190","Windows 2000","2000","609"
I only want it to print: (so index[0])
190
And when I try index[0], I just get ' " '. So the first value in the whole string, I want a list to be able to select the index.
Try using line.split(",") to split the line by the commas, then strip out the quotation marks by slicing the result.
Example:
>>> line = '"190","Windows 2000","2000","609"'
>>> sliced = line.split(',')
>>> print sliced
['"190"', '"Windows 2000"', '"2000"', '"609"']
>>> first_item = sliced[0][1:-1]
>>> print first_item
190
...and here's the whole thing, abstracted into a function:
def get_item(line, index):
return line.split(',')[index][1:-1]
(This is assuming, of course, that all the items in the line are divided by commas, that they're all wrapped by quotation marks, that there's no spaces after the commas (although you could take care of that by doing item.strip() to remove whitespace). It also fails if the quoted items contains commas, as noted in the comments.)
And if you try using split() to split each comma and return first value? Try this.
[0] applied to a string only returns the first character.
You want the first item of a comma-separated list. You could write your own parsing code, or you could use the csv module which already handles this.
import csv
def get_first_row(fname):
with open(fname, 'rb') as inf:
incsv = csv.reader(inf)
try:
row = incsv.next()
except StopIteration:
row = [None]
return row
def checkOS():
fid = get_first_row("C:/Python/NSRLOS.txt")[0]
fhand = get_first_row("C:/Python/sha_sub_hashes.out")[0]
print fid
csv.reader would be a good start.
import csv
from itertools import izip
with open('file1.csv') as fid, open('file2.csv') as fhand:
fidcsv = csv.reader(fid)
fhandcsv = csv.reder(fhand)
for row1, row2 in izip(fidcsv, fhandcsv):
print row1, row2, row[1] # etc...
Using csv.reader will handle CSV formatted files better than pure str methods. The izip will read line1 then 2, then 3 etc.. from both files (it will stop at the shortest number of rows in the file though), then line2 from both files etc... (not sure if this is what you want though). row1 and row2 will end up being a list of columns, and then just index if row1[0] == row2[0]: or whatever logic you wish to use.
Related
any idea how should I get the largest age from the text file and print it?
The text file:
Name, Address, Age,Hobby
Abu, “18, Jalan Satu, Penang”, 18, “Badminton, Swimming”
Choo, “Vista Gambier, 10-3A-88, Changkat Bukit Gambier Dua, 11700, Penang”, 17, Dancing
Mutu, Kolej Abdul Rahman, 20, “Shopping, Investing, Youtube-ing”
This is my coding:
with open("iv.txt",encoding="utf8") as file:
data = file.read()
splitdata = data.split('\n')
I am not getting what I want from this.
This works! I hope it helps. Let me know if there are any questions.
This approach essentially assumes that values associated with Hobby do not have numbers in them.
import csv
max_age = 0
with open("iv.txt", newline = '', encoding = "utf8") as f:
# spamreader returns reader object used to iterate over lines of f
# delimiter=',' is the default but I like to be explicit
spamreader = csv.reader(f, delimiter = ',')
# skip first row
next(spamreader)
# each row read from file is returned as a list of strings
for row in spamreader:
# reversed() returns reverse iterator (start from end of list of str)
for i in reversed(row):
try:
i = int(i)
break
# ValueError raised when string i is not an int
except ValueError:
pass
print(i)
if i > max_age:
max_age = i
print(f"\nMax age from file: {max_age}")
Output:
18
17
20
Max age from file: 20
spamreader from the csv module of Python's Standard Library returns a reader object used to iterate over lines of f. Each row (i.e. line) read from the file f is returned as a list of strings.
The delimiter (in our case, ',', which is also the default) determines how a raw line from the file is broken up into mutually exclusive but exhaustive parts -- these parts become the elements of the list that is associated with a given line.
Given a raw line, the string associated with the start of the line to the first comma is an element, then the string associated with any part of the line that is enclosed by two commas is also an element, and finally the string associated with the last comma to the end of the line is also an element.
For each line/list of the file, we start iterating from the end of the list, using the reversed built-in function, because we know that age is the second-to-last category. We assume that the hobby category does not have numbers in them such that the number would appear as an element of the list for the raw line. For example, for the line associated with Abu, if instead of "Badminton, Swimming" we had "Badminton, 30, Swimming", then the code would not have the desired effect as 30 would be treated as Abu's age.
I'm sure there is a built-in feature to parse a composite string like the one you posted, but as I don't know, I've created a CustomParse class to do the job:
class CustomParser():
def __init__(self, line: str, delimiter: str):
self.line = line
self.delimiter = delimiter
def split(self):
word = ''
words = []
inside_string = False
for letter in line:
if letter in '“”"':
inside_string = not inside_string
continue
if letter == self.delimiter and not inside_string:
words.append(word.strip())
word = ''
continue
word += letter
words.append(word.strip())
return words
with open('people_data.csv') as file:
ages = []
for line in file:
ages.append(CustomParser(line, ',').split()[2])
print(max(ages[1:]))
Hope that helps.
Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.
Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.
I need to load text from a file which contains several lines, each line contains letters separated by coma, into a 2-dimensional list. When I run this, I get a 2 dimensional list, but the nested lists contain single strings instead of separated values, and I can not iterate over them. how do I solve this?
def read_matrix_file(filename):
matrix = []
with open(filename, 'r') as matrix_letters:
for line in matrix_letters:
line = line.split()
matrix.append(line)
return matrix
result:
[['a,p,p,l,e'], ['a,g,o,d,o'], ['n,n,e,r,t'], ['g,a,T,A,C'], ['m,i,c,s,r'], ['P,o,P,o,P']]
I need each letter in the nested lists to be a single string so I can use them.
thanks in advance
split() function splits on white space by default. You can fix this by passing the string you want to split on. In this case, that would be a comma. The code below should work.
def read_matrix_file(filename):
matrix = []
with open(filename, 'r') as matrix_letters:
for line in matrix_letters:
line = line.split(',')
matrix.append(line)
return matrix
The input format you described conforms to CSV format. Python has a library just for reading CSV files. If you just want to get the job done, you can use this library to do the work for you. Here's an example:
Input(test.csv):
a,string,here
more,strings,here
Code:
>>> import csv
>>> lines = []
>>> with open('test.csv') as file:
... reader = csv.reader(file)
... for row in reader:
... lines.append(row)
...
>>>
Output:
>>> lines
[['a', 'string', 'here'], ['more', 'strings', 'here']]
Using the strip() function will get rid of the new line character as well:
def read_matrix_file(filename):
matrix = []
with open(filename, 'r') as matrix_letters:
for line in matrix_letters:
line = line.split(',')
line[-1] = line[-1].strip()
matrix.append(line)
return matrix
I'm working on a script to remove bad characters from a csv file then to be stored in a list.
The script runs find but doesn't remove bad characters so I'm a bit puzzled any pointers or help on why it's not working is appreciated
def remove_bad(item):
item = item.replace("%", "")
item = item.replace("test", "")
return item
raw = []
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append((remove_bad(row[0].strip()),
row[1].strip().title()))
print raw
If I have a csv-file with one line:
tst%,testT
Then your script, slightly modified, should indeed filter the "bad" characters. I changed it to pass both items separately to remove_bad (because you mentioned you had to "remove bad characters from a csv", not only the first row):
import csv
def remove_bad(item):
item = item.replace("%","")
item = item.replace("test","")
return item
raw = []
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append((remove_bad(row[0].strip()), remove_bad(row[1].strip()).title()))
print raw
Also, I put title() after the function call (else, "test" wouldn't get filtered out).
Output (the rows will get stored in a list as tuples, as in your example):
[('tst', 'T')]
Feel free to ask questions
import re
import csv
p = re.compile( '(test|%|anyotherchars)') #insert bad chars insted of anyotherchars
def remove_bad(item):
item = p.sub('', item)
return item
raw =[]
with open("test.csv", "rb") as f:
rows = csv.reader(f)
for row in rows:
raw.append( ( remove_bad(row[0].strip()),
row[1].strip().title() # are you really need strip() without args?
) # here you create a touple which you will append to array
)
print raw
I'm absolute beginner in python, and I'd like to get field i.e. from 2nd column, 3rd row from text file like this:
176a AUGCACGUACGUA ACGUA AGUCU
156b GACUACAUGCAUG GCAUA AGCUA
172e AGCUCAGCUAGGC CGAGA CGACU
(text is separated by spaces). is there any simple way to do that?
You could split the text and have a list of lists, where each sub list is a row, then pluck whatever you need from the list using rows[row - 1][column - 1].
f = open('test.txt', 'r')
lines = f.readlines()
f.close()
rows = []
for line in lines:
rows.append(line.split(' '))
print rows[2][1]
if your file isn't too big I would read it once then split each line and get the part that I want :
with open(myfile) as file_in :
lines = file_in.readlines()
third_line = lines[2]
second_column = third_line.split(' ')[1]
print second_column
If I have a file test which contains your example data the following will doing the job:
def extract_field(data, row, col):
'''extract_field -> string
`data` must be an iterable file object or an equivalent
data structure which elements contains space delimited
fields.
`row` and `col` declares the wished field position which
will be returned. '''
# cause first list element is 0
col -= 1
# jump to requested `row`
for _ in xrange(row):
line = next(data)
# create list with space delimited elements of `line`
# and return the `col`'s element of these list
return line.split()[col]
Use it like this:
>>> with open('test') as f:
... extract_field(f, row=3, col=2)
...
'AGCUCAGCUAGGC'