pulling subset of lines from files

pulling subset of lines from files - python

I have files where there is a varying number of header lines in random order followed by the data that I need, which spans the number of lines as given by the corresponding header. ex Lines: 3
from: blah#blah.com
Subject: foobarhah
Lines: 3
Extra: More random stuff
Foo Bar Lines of Data, which take up
some arbitrary long amount characters on a single line, but no matter how long
they still only take up the number of lines as specified in the header
How can I get at that data in one read of the file??
P.S. The data is from the 20Newsgroups corpus.
Edit: The quick solution I guess which only works if I relax the constraint on reading only once is this:
[1st read] Find out total_num_of_lines and match on first Lines: header ,
[2nd read] I discard the first (total_num_of_lines- header_num_of_lines) and then read the rest of the file
I'm still unaware of a way to read in the data in one pass though.

I'm not quite sure you even need the beginning of the file in order to get its contents. Consider using split:
_, contents = file_contents.split(os.linesep + os.linesep) # e.g. \n\n
If, however, the lines parameter does count - you can use the technique suggested above along with parsing file headers:
headers, contents = file_contents.split(os.linesep + os.linesep)
# Get lines length
headers_list = [line.split for line in headers.splitlines()]
lines_count = int([line[1] for line in headers_list if line[0].lower() == 'lines:'][0])
# Get contents
real_contents = contents[:lines_count]

Assuming we have the general case where there could be multiple messages following each other, maybe something like
from itertools import takewhile
def msgreader(file):
while True:
header = list(takewhile(lambda x: x.strip(), file))
if not header: break
header_dict = {k: v.strip() for k,v in (line.split(":", 1) for line in header)}
line_count = int(header_dict['Lines'])
message = [next(file) for i in xrange(line_count)] # or islice..
yield message
would work, where
with open("53903") as fp:
for message in msgreader(fp):
print message
would give all the listed messages. For this particular use case the above would be overkill, but frankly it's not much harder to extract all the header info than it is only the one line. I'd be surprised if there weren't already a module to parse these messages, though.

You need to store the state of whether the headers have finished. That's all.

Related

Python: count headers in a csv file

I want to now the numbers of headers my csv file contains (between 0 and ~50). The file itself is huge (so not reading the complete file for this is mandatory) and contains numerical data.
I know that csv.Sniffer has a has_header() function, but that can only detect 1 header.
One idea I had is to recursivly call the has_header funcion (supposing it detects the first header) and then counting the recursions. I am sure though, there is a much smarter way.
Googling was kind of a pain, since no matter what you search, if it includes "count" and "csv" at some point, you get all the "count rows in csv" results :D
Clarification:
With number of headers I mean number of rows containing information which is not data. There is no general rule for the headers (could be text, floats, or white spaces) and it may be a single line of text. The data itself however is only floats. For me this was super clear, because I've been working with these files for a long time, but forgot this isn't the normal case.
I hoped there was a easy and smart builtin function from Numpy or Pandas, but it doesn't seem so.
Inspired by the comments so far, I think my best bet is to
read 100 lines
count number of separators in each line
determine most common number of separators per line
Coming from the end of 100 lines, find first line with different amount of separators, or isn't floats. That line is the last header line.

Here's a sketch for finding the first line which matches a particular criterion. For demo purposes, I use the criterion "there are empty fields":
import csv
with open(filename, "r", encoding="utf-8") as handle:
for lineno, fields in enumerate(csv.reader(handle), 1):
if "" in fields:
print(lineno-1)
break
You'd update it to look for something which makes sense for your data, like perhaps "third and eight fields contain numbers":
try:
float(fields[2])
float(fields[7])
print(lineno-1)
break
except ValueError:
continue
(notice how the list fields is indexed starting at zero, so the first field is fields[0] and the third is fields[2]), or perhaps a more sophisticated model where the first line contains no empty fields, successive lines contain more and more empty fields, and then the first data line contains fewer empty fields:
maxempty = 0
for lineno, fields in numerate(csv.reader(handle), 1):
empty = fields.count("")
if empty > maxempty:
maxempty = empty
elif empty < maxempty:
print(lineno-1)
break
We simply print the line number of the last header line, since your question asks how many there are. Perhaps printing or returning the number of the first data line would make more sense in some scenarios.
This code doesn't use Pandas at all, just the regular csv module from the Python standard library. It stops reading when you hit break so it doesn't matter for performance how many lines there are after that (though if you need to experiment or debug, maybe create a smaller file with only, say, the first 200 lines of your real file).

Use re.search to search for lines that have 2 or more letters in a row. Two is used instead of one, to not count as header scientific notation (e.g., 1.0e5).
# In the shell, create a test file:
# echo "foo,bar\nbaz,bletch\n1e4,2.0\n2E5,2" > in_file.csv
import re
num_header_lines = 0
for line in open('in_file.csv'):
if re.search('[A-Za-z]{2,}', line):
# count the header here
num_header_lines += 1
else:
break
print(num_header_lines)
# 2

Well, I think that you could get the first line of the csv file and then split it by a ",". That will return an array with all the headers in it. Now you can just count them with len.

Try this:
import pandas as pd
df = pd.read_csv('your_file.csv', index_col=0)
num_rows, num_cols = df.shape
Since I see you're worried about file size, breaking the file into chunks would work:
chunk_size = 10000
df = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
I think you might get a variable number of rows if you read the df chunk by chunk but if you're only interested in number of columns this would work easily.
You could also look into dask.dataframe

This only reads first line of csv
import csv
with open('ornek.csv', newline='') as f:
reader = csv.reader(f)
row1 = next(reader)
sizeOfHeader = len(row1)

Find all strings in text file fitting either of two formats

So I know similar questions have been asked before, but every method I have tried is not working...
Here is the ask: I have a text file (which is a log file) that I am parsing for any occurrence of "app.task2". The following are the 2 scenarios that can occur (As they appear in the text file, independent of my code):
Scenario 1:
Mar 23 10:28:24 dasd[116] <Notice>: app.task2.refresh:556A2D:[
{name: ApplicationPolicy, policyWeight: 50.000, response: {Decision: Can Proceed, Score: 0.45}}
] sumScores:68.785000, denominator:96.410000, FinalDecision: Can Proceed FinalScore: 0.713463}
Scenario 2:
Mar 23 10:35:56 dasd[116] <Notice>: 'app.task2.refresh:C6C2FE' CurrentScore: 0.636967, ThresholdScore: 0.410015 DecisionToRun:1
The problem I am facing is that my current code below, I am not getting the entire log entry for the first case, and it is only pulling the first line in the log, not the remainder of the log entry, and it appears to be stopping at the new line escape character, which is occurring after ":[".
My Code:
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
all.append(line)
print all
How can I get the entire log entry for the first case? I tried stripping escape characters with no luck. From here I should be able to parse the list of results returned for what I truly need, but this will help! ty!
OF NOTE: I need to be able to locate these types of log entries (which will then give us either scenario 1 or scenario 2) by the string "app.task2". So this needs to be incorporated, like in my example...

Before adding the line to all, check if it ends with [. If it does, keep reading and merge the lines until you get to ].
import re
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
if re.search(r'\[\s*$', line): # start of multiline log message
for line2 in f:
line += line2
if re.search(r'^\s*\]', line2): # end of multiline log message
break
all.append(line)
print(all)

You are iterating over each each line individually which is why you only get the first line in scenario 1.
Either you can add a counter like this:
all = []
count = -1
with open(path_to_log) as f:
for line in f:
if count > 0:
all.append(line)
if count == 1:
tmp = all[-count:]
del all[-count:]
all.append("\n".join(tmp))
count -= 1
continue
if "app.task2" in line:
all.append(line)
if line.endswith('[\n'):
count = 3
print all
In this case i think Barmar solution would work just as good.
Or you can (preferably) when storing the log file have some distinct delimiter between each log entry and just split the log file by this delimiter.

I like #Barmar's solution with nested loops on the same file object, and may use that technique in the future. But prior to seeing I would have done it with a single loop, which may or may not be more readable:
all = []
keep = False
for line in open(path_to_log,"rt"):
if "app.task2" in line:
all.append(line)
keep = line.rstrip().endswith("[")
elif keep:
all.append(line)
keep = not line.lstrip().startswith("]")
print (all)
or, you can print it nicer with:
print(*all,sep='\n')

Python: Access "field" in line

I have the following .txt-File (modified bash emboss-dreg report, the original report has seqtable format):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
I would like to access the elements under "sequence" only, to compare them with some variables and delete the whole lines, if the comparison does not give the desired result (using Levenshtein distance for comparison).
But I can't even get started .... :(
I am searching for something like the linux -f option, to directly get to the right "field" in the line to do my comparison.
I came across re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
which results in:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
That is the closest I got to "split my lines into elements". I feel like totally going the wrong way, but searching Stack Overflow and google did not result in anything :(
I have never worked with seqtable-format before, so I tried to deal with it as .txt Maybe, there is another approach better for dealing with it?
Python is the main language I am learning, I am not so firm in Bash, but bash-answers for dealing with the issue would be ok for me, too.
I am thankful for any hint/link/help :)

The format itself seems to be using multiple lines as delimiters while your r'\t' is not doing anything (you're instructing Python to split on a literal \t). Also, based on what you've pasted the data is not using a tab delimiter anyway, but a random number of whitespaces to pad the table.
To address both, you can read the file, treat the first line as a header (if you need it), then read the rest line by line, strip the trailing\leading whitespace, check if there is any data there and if there is - further split it on whitespace to get to your line elements:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
As a bonus, since you have the header, you can turn it into a map and then use 'proxied' named access to get the element you're looking for so you don't need to worry about the element position:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
You can also use a header map to turn your rows into dict structures for even easier access.
UPDATE: Here's how to create a header map and then use it to build a dict out of your lines:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
As for how to 'delete' lines that you don't want for some reason, you'll have to create a temporary file, loop through your original file, compare your values, write the ones that you want to keep into the temporary file, delete the original file and finally rename the temporary file to match your original file, something like:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
This will produce the same file sans the second row from your example since its sequence ends in a TC and our comp_function() returns False in that case.
For a bit less complexity, instead of using temporary files you can load your whole source file into the working memory and then just overwrite it, but that would work only for files that can fit your working memory while the above approach can work with files as large as your free storage space.

Trouble sorting a list with python

I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!

When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.

since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(

Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).

Printing elements out of list

I have a certain check to be done and if the check satisfies, I want the result to be printed. Below is the code:
import string
import codecs
import sys
y=sys.argv[1]
list_1=[]
f=1.0
x=0.05
write_in = open ("new_file.txt", "w")
write_in_1 = open ("new_file_1.txt", "w")
ligand_file=open( y, "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( string.strip, ligand_lines ) #Remove the newline character from all the pdb file names
ligand_file.close()
ligand_file=open( "unique_count_c_from_ac.txt", "r" ) #Open the receptor.txt file
ligand_lines_1=ligand_file.readlines() # Read all the lines into the array
ligand_lines_1=map( string.strip, ligand_lines_1 ) #Remove the newline character from all the pdb file names
ligand_file.close()
s=[]
for i in ligand_lines:
for j in ligand_lines_1:
j = j.split()
if i == j[1]:
print j
The above code works great but when I print j, it prints like ['351', '342'] but I am expecting to get 351 342 (with one space in between). Since it is more of a python question, I have not included the input files (basically they are just numbers).
Can anyone help me?
Cheers,
Chavanak

To convert a list of strings to a single string with spaces in between the lists's items, use ' '.join(seq).
>>> ' '.join(['1','2','3'])
'1 2 3'
You can replace ' ' with whatever string you want in between the items.

Mark Rushakoff seems to have solved your immediate problem, but there are some other improvements that could be made to your code.
Always use context managers (with open(filename, mode) as f:) for opening files rather than relying on close getting called manually.
Don't bother reading a whole file into memory very often. Looping over some_file.readilines() can be replaced with looping over some_file directly.
For example, you could have used map(string.strip, ligland_file) or better yet [line.strip() for line in ligland_file]
Don't choose names to include the type of the object they refer to. This information can be found other ways.
For exmaple, the code you posted can be simplified to something along the lines of
import sys
from contextlib import nested
some_real_name = sys.argv[1]
other_file = "unique_count_c_from_ac.txt"
with nested(open(some_real_name, "r"), open(other_file, "r")) as ligand_1, ligand_2:
for line_1 in ligand_1:
# Take care of the trailing newline
line_1 = line_1.strip()
for line_2 in ligand_2:
line_2 = line2.strip()
numbers = line2.split()
if line_1 == numbers[1]:
# If the second number from this line matches the number that is
# in the user's file, print all the numbers from this line
print ' '.join(numbers)
which is more reliable and I believe more easily read.
Note that the algorithmic performance of this is far from ideal because of these nested loops. Depending on your need, this could potentially be improved, but since I don't know exactly what data you need to extract to tell you whether you can.
The time this takes currently in my code and yours is O(nmq), where n is the number of lines in one file, m is the number of lines in the other, and q is the length of lines in unique_count_c_from_ac.txt. If two of these are fixed/small, then you have linear performance. If two can grow arbitrarily (I sort of imagine n and m can?), then you could look into improving your algorithm, probably using sets or dicts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pulling subset of lines from files - python

You need to store the state of whether the headers have finished. That's all.

Related

Python: count headers in a csv file

Find all strings in text file fitting either of two formats

Python: Access "field" in line

Trouble sorting a list with python

Printing elements out of list

Categories

Resources