Python - Get the largest age from a list txt.file - python

any idea how should I get the largest age from the text file and print it?
The text file:
Name, Address, Age,Hobby
Abu, “18, Jalan Satu, Penang”, 18, “Badminton, Swimming”
Choo, “Vista Gambier, 10-3A-88, Changkat Bukit Gambier Dua, 11700, Penang”, 17, Dancing
Mutu, Kolej Abdul Rahman, 20, “Shopping, Investing, Youtube-ing”
This is my coding:
with open("iv.txt",encoding="utf8") as file:
data = file.read()
splitdata = data.split('\n')
I am not getting what I want from this.

This works! I hope it helps. Let me know if there are any questions.
This approach essentially assumes that values associated with Hobby do not have numbers in them.
import csv
max_age = 0
with open("iv.txt", newline = '', encoding = "utf8") as f:
# spamreader returns reader object used to iterate over lines of f
# delimiter=',' is the default but I like to be explicit
spamreader = csv.reader(f, delimiter = ',')
# skip first row
next(spamreader)
# each row read from file is returned as a list of strings
for row in spamreader:
# reversed() returns reverse iterator (start from end of list of str)
for i in reversed(row):
try:
i = int(i)
break
# ValueError raised when string i is not an int
except ValueError:
pass
print(i)
if i > max_age:
max_age = i
print(f"\nMax age from file: {max_age}")
Output:
18
17
20
Max age from file: 20
spamreader from the csv module of Python's Standard Library returns a reader object used to iterate over lines of f. Each row (i.e. line) read from the file f is returned as a list of strings.
The delimiter (in our case, ',', which is also the default) determines how a raw line from the file is broken up into mutually exclusive but exhaustive parts -- these parts become the elements of the list that is associated with a given line.
Given a raw line, the string associated with the start of the line to the first comma is an element, then the string associated with any part of the line that is enclosed by two commas is also an element, and finally the string associated with the last comma to the end of the line is also an element.
For each line/list of the file, we start iterating from the end of the list, using the reversed built-in function, because we know that age is the second-to-last category. We assume that the hobby category does not have numbers in them such that the number would appear as an element of the list for the raw line. For example, for the line associated with Abu, if instead of "Badminton, Swimming" we had "Badminton, 30, Swimming", then the code would not have the desired effect as 30 would be treated as Abu's age.

I'm sure there is a built-in feature to parse a composite string like the one you posted, but as I don't know, I've created a CustomParse class to do the job:
class CustomParser():
def __init__(self, line: str, delimiter: str):
self.line = line
self.delimiter = delimiter
def split(self):
word = ''
words = []
inside_string = False
for letter in line:
if letter in '“”"':
inside_string = not inside_string
continue
if letter == self.delimiter and not inside_string:
words.append(word.strip())
word = ''
continue
word += letter
words.append(word.strip())
return words
with open('people_data.csv') as file:
ages = []
for line in file:
ages.append(CustomParser(line, ',').split()[2])
print(max(ages[1:]))
Hope that helps.

Related

How to read a value of file separate by tabs in Python?

I have a text file with this format
ConfigFile 1.1
;
; Version: 4.0.32.1
; Date="2021/04/08" Time="11:54:46" UTC="8"
;
Name
John Legend
Type
Student
Number
s1054520
I would like to get the value of Name or Type or Number
How do I get it?
I tried with this method, but it does not solve my problem.
import re
f = open("Data.txt", "r")
file = f.read()
Name = re.findall("Name", file)
print(Name)
My expectation output is John Legend
Anyone can help me please. I really appreciated. Thank you
First of all re.findall is used to search for “all” occurrences that match a given pattern. So in your case. you are finding every "Name" in the file. Because that's what you are looking for.
On the other hand, the computer will not know the "John Legend" is the name. it will only know that's the line after the word "Name".
In your case I will suggest you can check this link.
Find the "Name"'s line number
Read the next line
Get the name without the white space
If there is more than 1 Name. this will work as well
the final code is like this
def search_string_in_file(file_name, string_to_search):
"""Search for the given string in file and return lines containing that string,
along with line numbers"""
line_number = 0
list_of_results = []
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if string_to_search in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
# Return list of tuples containing line numbers and lines where string is found
return list_of_results
file = open('Data.txt')
content = file.readlines()
matched_lines = search_string_in_file('Data.txt', 'Name')
print('Total Matched lines : ', len(matched_lines))
for i in matched_lines:
print(content[i[0]].strip())
Here I'm going through each line and when I encounter Name I will add the next line (you can directly print too) to the result list:
import re
def print_hi(name):
result = []
regexp = re.compile(r'Name*')
gotname = False;
with open('test.txt') as f:
for line in f:
if gotname:
result.append(line.strip())
gotname = False
match = regexp.match(line)
if match:
gotname = True
print(result)
if __name__ == '__main__':
print_hi('test')
Assuming those label lines are in the sequence found in the file you
can simply scan for them:
labelList = ["Name","Type","Number"]
captures = dict()
with open("Data.txt","rt") as f:
for label in labelList:
while not f.readline().startswith(label):
pass
captures[label] = f.readline().strip()
for label in labelList:
print(f"{label} : {captures[label]}")
I wouldn't use a regex, but rather make a parser for the file type. The rules might be:
The first line can be ignored
Any lines that start with ; can be ignored.
Every line with no leading whitespace is a key
Every line with leading whitespace is a value belonging to the last
key
I'd start with a generator that can return to you any unignored line:
def read_data_lines(filename):
with open(filename, "r") as f:
# skip the first line
f.readline()
# read until no more lines
while line := f.readline():
# skip lines that start with ;
if not line.startswith(";"):
yield line
Then fill up a dict by following rules 3 and 4:
def parse_data_file(filename):
data = {}
key = None
for line in read_data_lines(filename):
# No starting whitespace makes this a key
if not line.startswith(" "):
key = line.strip()
# Starting whitespace makes this a value for the last key
else:
data[key] = line.strip()
return data
Now at this point you can parse the file and print whatever key you want:
data = parse_data_file("Data.txt")
print(data["Name"])

Grabbing CSV Information with Regex in Python

I'm trying to extract all the phone numbers from a CSV document and append them to a list in string format. Here is a sample of my input:
someone#somewhere.com,John,Doe,,,(555) 555-5555
And here is the code I am using:
l = []
with open('sample.csv', 'r') as f:
reader = csv.reader(f)
for x in reader:
number = re.search(r'.*?#.*?,.*?,.*?,.*?,.*?,(.*?),',x)
if number in x:
l.append(''.join(number))
Basically, I'm trying to check if there is a number at a certain position in the row (where the parentheses are) and then append that to a list as a string using join. However, I keep getting this error:
Traceback (most recent call last):
File "C:/Users/svillamil/Desktop/Final Phone.py", line 14, in <module>
number = re.search(b'.*?#.*?,.*?,.*?,.*?,.*?,(.*?),', x)
File "C:\Users\svillamil\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
How do I get around this?
Iterating over a csv.reader gives you a list of strings for each row.
Taking the value at index 5 already gives you the phone number (if I counted correctly). You don't need a regular expression to do this.
l = []
with open('sample.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
number = row[5]
if number:
l.append(number)
(Conversely, if you insisted on using a regular expression, you wouldn't need csv to do the splitting and could just iterate over the raw lines of the file.)
You should just split a file line by comma and iterate through elements checking each if it matches (...), assuming a phone number can appear at any delimited position in a file line:
import re
result = []
with open('sandbox.txt', 'r') as f:
fileLines = f.readlines()
for fileLine in fileLines:
fileLine = fileLine.strip()
lineElems = fileLine.split(',')
for lineElem in lineElems:
pattern = re.compile("\(...\)")
if pattern.match(lineElem):
print("Adding %s" % lineElem)
result.append(lineElem)
x is a list which contains each field of the row.
So one approach is to join the array and then apply the regex,
foo=','.join(x)
number = re.search(r'.*?#.*?,.*?,.*?,.*?,.*?,(.*?),', foo)
Or you can iterate over each field in the row and check if its a number
for row in reader:
for field in row:
number = re.search(r'<phone-number-regex>', field)
if number in x:
l.append(''.join(number))

read fastq file into dictionary

I have a fastq file like this (part of the file):
#A80HNBABXX:4:1:1344:2224#0/1
AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG
+
\\YYWX\PX^YT[TVYaTY]^\^H\`^`a`\UZU__TTbSbb^\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
#A80HNBABXX:4:1:1515:2211#0/1
TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA
+
ee^e^\`ad`eeee\dd\ddddYeebdd\ddaYbdcYc`\bac^YX[V^\Ybb]]^bdbaZ]ZZ\^K\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_
#A80HNBABXX:4:1:1538:2220#0/1
CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT
+
fff^fd\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\a\`]aY]ZZ[XYcccYcZ\\]Y
#A80HNBABXX:4:1:1666:2222#0/1
CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT
+
deeee`bbcddddad\bbbbeee\ecYZcc^dd^ddd\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifier. Line 2 is the DNA sequence letters. Line 3 begins with a '+' character. Line 4 encodes the quality values for the sequence in Line 2 (the part after "+" and before the next "#", and must contain the same number of symbols as letters in the sequence.
i want to read the fastq file into a dictionary like this (the key is the DNA sequence and the value is the quality value, and the line starting with "#" and "+" can be discarded):
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG':'\YYWX\PX^YT[TVYaTY]^\^H`^a\UZU__TTbSbb^\a^^^[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT':'fff^fd\c^d^Ycacdcdcdedeffdfedb]beeeeecd^ddccdddddfffeaeeeffdTecacaLV[QRPa\a`]aY]ZZ[XYcccYcZ\]Y ',
....}
I write the following code but it does not give me what I want. Can anyone help me to fix/improve my code?
class fastq(object):
def __init__(self,filename):
self.filename = filename
self.__sequences = {}
def parse_file(self):
symbol=['#','+']
"""Stores both the sequence and the quality values for the sequence"""
f = open(self.filename,'rU')
for lines in self.filename:
if symbol not in lines.startwith()
data = f.readlines()
return data
Here's a pretty quick and efficient way of doing it:
def parse_file(self):
with open(self.filename, 'r') as f:
content = f.readlines()
# Recreate content without lines that start with # and +
content = [line for line in content if not line[0] in '#+']
# Now the lines you want are alternating, so you can make a dict
# from key/value pairs of lists content[0::2] and content[1::2]
data = dict(zip(content[0::2], content[1::2]))
return data
I don't think use the reads as the key is good idea, what if you got exactly the same read. But any way if you want to do it:
In [9]:
with open('temp.fastq') as f:
lines=f.readlines()
head=[item[:-1] for item in lines[::4]] #get rid of '\n'
read=[item[:-1] for item in lines[1::4]]
qual=[item[:-1] for item in lines[3::4]]
dict(zip(read, qual))
Out[9]:
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG': '\\\\YYWX\\PX^YT[TVYaTY]^\\^H\\`^`a`\\UZU__TTbSbb^\\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT': 'fff^fd\\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\\\a\\`]aY]ZZ[XYcccYcZ\\\\]Y',
'CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT': 'deeee`bbcddddad\\bbbbeee\\ecYZcc^dd^ddd\\\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB',
'TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA': 'ee^e^\\`ad`eeee\\dd\\ddddYeebdd\\ddaYbdcYc`\\bac^YX[V^\\Ybb]]^bdbaZ]ZZ\\^K\\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_'}
you can use function from Bio, like this:
from Bio import SeqIO
myf=mydir+myfile
startlist=[]
for record in SeqIO.parse(myf, "fastq"):
startlist.append(str(record.seq)) #or without 'str'

How to create a list to select index?

I have this code to check indexes in a file to see if they match, but to start off I am having trouble being able to select an index. What do I have to do in order to be able to do so, because at this moment it doesn't show the values as being in a list.
def checkOS():
fid = open("C:/Python/NSRLOS.txt", 'r')
fhand = open("C:/Python/sha_sub_hashes.out", 'r')
sLine = fhand.readline()
line = fid.readline()
outdata = []
print line
checkOS()
Right now it prints:
"190","Windows 2000","2000","609"
I only want it to print: (so index[0])
190
And when I try index[0], I just get ' " '. So the first value in the whole string, I want a list to be able to select the index.
Try using line.split(",") to split the line by the commas, then strip out the quotation marks by slicing the result.
Example:
>>> line = '"190","Windows 2000","2000","609"'
>>> sliced = line.split(',')
>>> print sliced
['"190"', '"Windows 2000"', '"2000"', '"609"']
>>> first_item = sliced[0][1:-1]
>>> print first_item
190
...and here's the whole thing, abstracted into a function:
def get_item(line, index):
return line.split(',')[index][1:-1]
(This is assuming, of course, that all the items in the line are divided by commas, that they're all wrapped by quotation marks, that there's no spaces after the commas (although you could take care of that by doing item.strip() to remove whitespace). It also fails if the quoted items contains commas, as noted in the comments.)
And if you try using split() to split each comma and return first value? Try this.
[0] applied to a string only returns the first character.
You want the first item of a comma-separated list. You could write your own parsing code, or you could use the csv module which already handles this.
import csv
def get_first_row(fname):
with open(fname, 'rb') as inf:
incsv = csv.reader(inf)
try:
row = incsv.next()
except StopIteration:
row = [None]
return row
def checkOS():
fid = get_first_row("C:/Python/NSRLOS.txt")[0]
fhand = get_first_row("C:/Python/sha_sub_hashes.out")[0]
print fid
csv.reader would be a good start.
import csv
from itertools import izip
with open('file1.csv') as fid, open('file2.csv') as fhand:
fidcsv = csv.reader(fid)
fhandcsv = csv.reder(fhand)
for row1, row2 in izip(fidcsv, fhandcsv):
print row1, row2, row[1] # etc...
Using csv.reader will handle CSV formatted files better than pure str methods. The izip will read line1 then 2, then 3 etc.. from both files (it will stop at the shortest number of rows in the file though), then line2 from both files etc... (not sure if this is what you want though). row1 and row2 will end up being a list of columns, and then just index if row1[0] == row2[0]: or whatever logic you wish to use.

How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.

Categories