I have a certain check to be done and if the check satisfies, I want the result to be printed. Below is the code:
import string
import codecs
import sys
y=sys.argv[1]
list_1=[]
f=1.0
x=0.05
write_in = open ("new_file.txt", "w")
write_in_1 = open ("new_file_1.txt", "w")
ligand_file=open( y, "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( string.strip, ligand_lines ) #Remove the newline character from all the pdb file names
ligand_file.close()
ligand_file=open( "unique_count_c_from_ac.txt", "r" ) #Open the receptor.txt file
ligand_lines_1=ligand_file.readlines() # Read all the lines into the array
ligand_lines_1=map( string.strip, ligand_lines_1 ) #Remove the newline character from all the pdb file names
ligand_file.close()
s=[]
for i in ligand_lines:
for j in ligand_lines_1:
j = j.split()
if i == j[1]:
print j
The above code works great but when I print j, it prints like ['351', '342'] but I am expecting to get 351 342 (with one space in between). Since it is more of a python question, I have not included the input files (basically they are just numbers).
Can anyone help me?
Cheers,
Chavanak
To convert a list of strings to a single string with spaces in between the lists's items, use ' '.join(seq).
>>> ' '.join(['1','2','3'])
'1 2 3'
You can replace ' ' with whatever string you want in between the items.
Mark Rushakoff seems to have solved your immediate problem, but there are some other improvements that could be made to your code.
Always use context managers (with open(filename, mode) as f:) for opening files rather than relying on close getting called manually.
Don't bother reading a whole file into memory very often. Looping over some_file.readilines() can be replaced with looping over some_file directly.
For example, you could have used map(string.strip, ligland_file) or better yet [line.strip() for line in ligland_file]
Don't choose names to include the type of the object they refer to. This information can be found other ways.
For exmaple, the code you posted can be simplified to something along the lines of
import sys
from contextlib import nested
some_real_name = sys.argv[1]
other_file = "unique_count_c_from_ac.txt"
with nested(open(some_real_name, "r"), open(other_file, "r")) as ligand_1, ligand_2:
for line_1 in ligand_1:
# Take care of the trailing newline
line_1 = line_1.strip()
for line_2 in ligand_2:
line_2 = line2.strip()
numbers = line2.split()
if line_1 == numbers[1]:
# If the second number from this line matches the number that is
# in the user's file, print all the numbers from this line
print ' '.join(numbers)
which is more reliable and I believe more easily read.
Note that the algorithmic performance of this is far from ideal because of these nested loops. Depending on your need, this could potentially be improved, but since I don't know exactly what data you need to extract to tell you whether you can.
The time this takes currently in my code and yours is O(nmq), where n is the number of lines in one file, m is the number of lines in the other, and q is the length of lines in unique_count_c_from_ac.txt. If two of these are fixed/small, then you have linear performance. If two can grow arbitrarily (I sort of imagine n and m can?), then you could look into improving your algorithm, probably using sets or dicts.
Related
I'm trying to figure out how to get the first N strings from a txt file, and store them into an array. Right now, I have code that gets every string from a txt file, separated by a space delimiter, and stores it into an array. However, I want to be able to only grab the first N number of strings from it, not every single string. Here is my code (and I'm doing it from a command prompt):
import sys
f = open(sys.argv[1], "r")
contents = f.read().split(' ')
f.close()
I'm sure that the only line I need to fix is:
contents = f.read().split(' ')
I'm just not sure how to limit it here to N number of strings.
If the file is really big, but not too big--that is, big enough that you don't want to read the whole file (especially in text mode or as a list of lines), but not so big that you can't page it into memory (which means under 2GB on a 32-bit OS, but a lot more on 64-bit), you can do this:
import itertools
import mmap
import re
import sys
n = 5
# Notice that we're opening in binary mode. We're going to do a
# bytes-based regex search. This is only valid if (a) the encoding
# is ASCII-compatible, and (b) the spaces are ASCII whitespace, not
# other Unicode whitespace.
with open(sys.argv[1], 'rb') as f:
# map the whole file into memory--this won't actually read
# more than a page or so beyond the last space
m = mmap.mmap(f.fileno(), access=mmap.ACCESS_READ)
# match and decode all space-separated words, but do it lazily...
matches = re.finditer(r'(.*?)\s', m)
bytestrings = (match.group(1) for match in matches)
strings = (b.decode() for b in bytestrings)
# ... so we can stop after 5 of them ...
nstrings = itertools.islice(strings, n)
# ... and turn that into a list of the first 5
contents = list(nstrings)
Obviously you can combine steps together, even cramming the whole thing into a giant one-liner if you want. (An idiomatic version would be somewhere between that extreme and this one.)
If you're fine with reading the whole file (assuming it's not memory prohibitive to do so) you can just do this:
strings_wanted = 5
strings = open('myfile').read().split()[:strings_wanted]
That works like this:
>>> s = 'this is a test string with more than five words.'
>>> s.split()[:5]
['this', 'is', 'a', 'test', 'string']
If you actually want to stop reading exactly as soon as you've reached the nth word, you pretty much have to read a byte at a time. But that's going to be slow, and complicated. Plus, it's still not really going to stop reading after the nth word, unless you're reading in binary mode and decoding manually, and you disable buffering.
As long as the text file has line breaks (as opposed to being one giant 80MB line), and it's acceptable to read a few bytes past the nth word, a very simple solution will still be pretty efficient: just read and split line by line:
import sys
f = open(sys.argv[1], "r")
contents = []
for line in f:
contents += line.split()
if len(contents) >= n:
del contents[n:]
break
f.close()
what about just:
output=input[:3]
output will contain the first three strings in input
I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~
You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]
In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.
I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want
s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]
If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).
I have 2 text files (new.txt and master.txt). Each has different data stored as such:
Cory 12 12:40:12.016221
Suzy 64 12:40:33.404614
Trent 145 12:40:56.640052
(catagorised by the first set of numbers appearing on each line)
I have to scan each line of new.txt for the name (e.g. Suzy), check if there is a duplicate in master.txt and if there isn't, then I add that line to master.txt catagorized by that line's number (e.g. 64 in Suzy 64 12:40:33.404614).
I have written the following script, but it falls into a loop of checking the 1st line of new.txt (I know why, I just don't know how to work around not closing fileinput.input(new.txt) so that I can then open fileinput.input(master.txt) further down the loop). I feel like I've highly over complicated things for myself and any help is appreciated.
import fileinput
import re
end_of_file = False
while end_of_file == False:
for line in fileinput.input('new.txt', inplace=1):
end_of_file = fileinput.isstdin() #ends while loop if on last line of new.txt
user_f_line_list = line.split()
master_f = open('master.txt', 'r')
master_f_read = master_f.read()
master_f.close()
fileinput.close()
if not re.findall(user_f_line_list[0], master_f_read):
for line in fileinput.input('master.txt', inplace=1):
master_line_list = line.split()
if int(user_f_line_list[1]) <= int(master_line_list[1]):
written = False
while written == False:
written = True
print(' '.join(user_f_line_list))
print(line, end='')
fileinput.close()
And for reference, master.txt starts with startline 0 and ends with endline 1000000000000000 so that it is impossible for the categorizing to be out of range.
Some suggestions:
Open master.txt into a list with readlines().
Use an OrderedDict from the collections module - it is the same as a regular dict but preserves the order. Make each key the unique element - a tuple in this case (e.g. ("Cory", 12)). Make the value whatever comes after.
Now you can very rapidly check to see if the entry is present by if key in my_dict:.
If it isn't, you can insert it. If you need to insert in order, it'll take a bit more work, but not too much. I would insert in the end, convert to a list when all is done, and apply a sort function to the list with a custom function to specify how to sort.
Output it back to the file.
I won't say it's necessarily shorter than your solution, but it is a lot cleaner.
I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!
When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.
since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(
Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).
I have files where there is a varying number of header lines in random order followed by the data that I need, which spans the number of lines as given by the corresponding header. ex Lines: 3
from: blah#blah.com
Subject: foobarhah
Lines: 3
Extra: More random stuff
Foo Bar Lines of Data, which take up
some arbitrary long amount characters on a single line, but no matter how long
they still only take up the number of lines as specified in the header
How can I get at that data in one read of the file??
P.S. The data is from the 20Newsgroups corpus.
Edit: The quick solution I guess which only works if I relax the constraint on reading only once is this:
[1st read] Find out total_num_of_lines and match on first Lines: header ,
[2nd read] I discard the first (total_num_of_lines- header_num_of_lines) and then read the rest of the file
I'm still unaware of a way to read in the data in one pass though.
I'm not quite sure you even need the beginning of the file in order to get its contents. Consider using split:
_, contents = file_contents.split(os.linesep + os.linesep) # e.g. \n\n
If, however, the lines parameter does count - you can use the technique suggested above along with parsing file headers:
headers, contents = file_contents.split(os.linesep + os.linesep)
# Get lines length
headers_list = [line.split for line in headers.splitlines()]
lines_count = int([line[1] for line in headers_list if line[0].lower() == 'lines:'][0])
# Get contents
real_contents = contents[:lines_count]
Assuming we have the general case where there could be multiple messages following each other, maybe something like
from itertools import takewhile
def msgreader(file):
while True:
header = list(takewhile(lambda x: x.strip(), file))
if not header: break
header_dict = {k: v.strip() for k,v in (line.split(":", 1) for line in header)}
line_count = int(header_dict['Lines'])
message = [next(file) for i in xrange(line_count)] # or islice..
yield message
would work, where
with open("53903") as fp:
for message in msgreader(fp):
print message
would give all the listed messages. For this particular use case the above would be overkill, but frankly it's not much harder to extract all the header info than it is only the one line. I'd be surprised if there weren't already a module to parse these messages, though.
You need to store the state of whether the headers have finished. That's all.