I am trying to write a python script,
which breaks a continuous string into lines,
when the max_line_length has been exceeded.
It shall not break words,
and searches therefore the last occurrence of a whitespace-char,
which will be replaced by a newline-char.
For some reason it does not break within the specified limit.
E.g. when defining the max_line_length = 80,
the text sometimes breaks at 82 or 83, etc.
Since quite some time I am trying to fix the problem,
however it feels like i am having the tunnel vision
and don't see the problem here:
#!/usr/bin/python
import sys
if len(sys.argv) < 3:
print('usage: $ python3 breaktext.py <max_line_length> <file>')
print('example: $ python3 breaktext.py 80 infile.txt')
exit()
filename = str(sys.argv[2])
with open(filename, 'r') as file:
text_str = file.read().replace('\n', '')
m = int(sys.argv[1]) # max_line_length
text_list = list(text_str) # convert string to list
l = 0; # line_number
i = m+1 # line_character_index
index = m+1 # total_list_index
while index < len(text_list):
while text_list[l * m + i] != ' ':
i -= 1
pass
text_list[l * m + i] = '\n'
l += 1
i = m+1
index += m+1
pass
text_str = ''.join(text_list)
print(text_str)
I guess we'll take this from the top.
text_str = file.read().replace('\n', '')
Here's one assumption about the input data I don't know if it's true. You're replacing all the newline characters with nothing; if there weren't spaces next to them, this means the code below will never break the lines in the same places.
text_list = list(text_str) # convert string to list
This splits the input file into single character strings. I guess you might have done so to make it mutable, such that you can replace individual characters, but it's a very expensive operation and loses all the features of a string. Python is a high level language that would allow you to split into e.g. words instead.
index = m+1 # total_list_index
while index < len(text_list):
#...
index += m+1
Let's consider what this means. We're not entering into the loop if index exceeds the text_list length. But index is advancing in steps of m+1. So we're splitting math.floor(len(text)/(max_line_length+1)) times. Unless every line is exactly max_line_length characters, not counting its space we replace with a newline, that's too few times. Too few times means too long lines, at least at the end.
l = 0; # line_number
i = m+1 # line_character_index
#loop:
while text_list[l * m + i] != ' ':
i -= 1
text_list[l * m + i] = '\n'
l += 1
i = m+1
This is making things difficult with index math. Quite clearly the one index we ever use is l * m + i. This moves in a quite odd way; it searches backwards for a space, then leaps forward as l increments and i resets. Whatever position it had reversed to is lost as all the leaps are in steps of m.
Let's apply m=5 to the string "Fee fie faw fum who did you see now". For the first iteration, 0 * 5 + 5+1 hits the second word, and i seeks back to the first space. The first line then is "Fee", as expected. The second search starts at 1*5 + 5+1, which is a space, and the second line becomes "fie faw", which already exceeds our limit of 5! The reason is that l * m isn't the beginning of the line; it's actually in the middle of "fie", a discrepancy which can only grow as you continue through the file. It grows whenever you split off a line that is shorter than m.
The solution involves remembering where you did your split. That could be as simple as replacing l * m with index, and updating it by index += i instead of m+1.
Another odd effect happens if you ever encounter a word that exceeds the maximum line length. Beyond meaning a line is longer than the limit, i will still search backwards until it finds a space; that space could then be in an earlier line altogether, producing extra short lines as well as too long ones. That's a result of handling the entire text as one array and not limiting which section we're looking at.
Personally I'd much rather use Python's built in methods, such as str.rindex, which can find a particular character in a given region within a string:
s = "Fee fie faw fum who did you see now"
maxlen = 5
start = 8
end = s.rindex(' ', start, start+maxlen)
print(s[start:end])
start = end + 1
We also, as PaulMcG pointed out, can go full "batteries included" and use the standard library textwrap module for the entire task.
So I am reading in a .txt file that is largely similar to this: TTACGATATACGA etc. but contains thousands of characters. Now I can read in a file and output it as a csv according to user input that decides characters per column and number of columns however it writes a new file for each time.
Ideally I would like to have a format such as such per file:
User enters 4 and 3.
Output: TCAG, TGCT, TACG,
My curent output is this:
TCAGTGCTTACG
I have tried looking at string splitting but I don't seem to be able to get it to work.
here is what I've written thus far, apologies if it's poor:
#user input for parameters
user_input_character = int(input("Enter how many characters you;d like
per column"))
user_input_column = int(input("Enter how many columns you'd like"))
character_per_column = user_input_character
columns_per_entry = user_input_column
characters_to_read = int((character_per_column * columns_per_entry))
print("Total characters: " + str(characters_to_read))
#counts used to set letters to be taken into intake
index_start = 0
index_finish = characters_to_read
count =1
#open the file to be read
lines = []
test_file = open("dna.txt", "r")
for line in test_file:
line = line.strip()
if not line:
continue
lines.append(',')
#read the file and take note of its size for index purposes
read_file = test_file.read()
file_size = read_file.__len__()
print((file_size))
i = 1
index = 0
#use loop to make more than one file output
while(index < 50):
#print count used to measure progress for testing
print('the count is', count)
count += 1
index += characters_to_read
print('index: ',index)
#intake only uses letters from index count per file
intake = read_file[index_start:index_finish]
print(intake)
index_start += characters_to_read
index_finish +=characters_to_read
#output a txt file with the 4 letters from intake as a individually numbered txt file
text_file_output = open("Output%i.csv"%i,'w')
i += 1
text_file_output.write(intake)
text_file_output.close()
#define path to print to console for file saving
path = os.path.abspath("Output%i")
directory = os.path.dirname(path)
print(path)
test_file.close()
Here's a simple way to split your DNA data into rows consisting of columns and chunks of specified sizes. It assumes that the DNA data is in a single string with no white space characters (spaces, tabs, newlines, etc).
To test this code, I create some fake data using the random module.
from random import seed, choice
seed(42)
# Make some random DNA data
num = 66
data = ''.join([choice('ACGT') for _ in range(num)])
print(data, '\n')
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
print(' '.join(row))
row = []
if row:
print(' '.join(row))
output
AAGCCCAATAAACCACTCTGACTGGCCGAATAGGGATATAGGCAACGACATGTGCGGCGACCCTTG
AAGC CCAA TAAA
CCAC TCTG ACTG
GCCG AATA GGGA
TATA GGCA ACGA
CATG TGCG GCGA
CCCT TG
On my old 2GHz 32 bit machine, running Python 3.6.0, this code can process and save to disk around 100000 chars per second (that includes the time taken to generate the random data).
Here's a version of the above code that handles spaces and blank lines in the input data. It reads the input data from a file and writes the output to a CSV file.
Firstly, here's the code I used to create some fake test data, which I saved to "dnatest.txt".
from random import seed, choice, randrange
seed(123)
# Make some random DNA data containing spaces
pool = 'ACGT' * 5 + ' '
for _ in range(15):
# Choose a random line length
size = randrange(50, 70)
data = ''.join([choice(pool) for _ in range(size)])
print(data)
# Randomly add a blank line
if randrange(5) < 2:
print()
Here's the file it created:
dnatest.txt
AGCATCACCGGCCAGCGTCACGTAGAGGTCGAAACCGTATCCGATGT AGG
ACC TTACTAC CGTACGGCAGGAGGAGGG TATTACAC CT TCTCACGAGCAAGGAATA
ATTGATGGCACAGC AAGATCCGCTA CCGATTG CAACCA CATACGAT CGACCAGATGG
ACAGAACAGATCTTGGGAATGGAACAGGAGAGAGTGTGGGCCACATTAAAGTGATAAT ATTT
TCTGTCGTGGGGCACCAAACCATGCTAATGCACGACTGGGT GAGGGTTGAGAGCCTACTATCCTCAG
TCGATCGAGATGACCCTCCTATCGCAACAGCTGTCAGTGTCCAGAG ACGTCGC CA
TAGGTCTGGAAAC GCACTCCCCTC GGAATAGTCTACACGAGTCCATTATGTC
GATCTGACTATGGGGACCATAACGGCTATGCGACCATGGACTGGTTCGAG
GATTCCCGTTCTACAT CACCTT ACCTCTGATAA CGACTGGTTCGA GGGTCTC CC
AAA CGTCTATTATGTCATAACGTAACTCTGC CGTAGTTTGATCAAACGTACAGCCACCAC
TGAAGC CGCCTCGAACCGCGTCCGACCCTGGGGAGCCTGGGGCCCAGCA
CCTTAGC ACTGCGA AGCTACACCCCACGAGTAATTTG T CTATCGT CCG
GCCTCGTTTCCTTGTGAAATTAT ATGGT C AGTCTTCAATCAA CACCTA CTAATAA
GTGCTAGC CCGGGGATCTTGTCCTGGTCCA GGTC AT AATCCGTGCTCAAATTACATGGCTT
TTAGTAATGAGTTCGGGC GCGCCCTCAAAGTTGGTCTAGAAGCGCGCAGTTTTCCTTAGGT
Here's the code that processes that data:
# Input & output file names
iname = 'dnatest.txt'
oname = 'dnatest.csv'
# Read the data and eliminate all whitespace
with open(iname) as f:
data = ''.join(f.read().split())
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
with open(oname, 'w') as f:
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
f.write(', '.join(row) + '\n')
row = []
if row:
f.write(', '.join(row) + '\n')
And here's the file it creates:
dnatest.csv
AGCA, TCAC, CGGC
CAGC, GTCA, CGTA
GAGG, TCGA, AACC
GTAT, CCGA, TGTA
GGAC, CTTA, CTAC
CGTA, CGGC, AGGA
GGAG, GGTA, TTAC
ACCT, TCTC, ACGA
GCAA, GGAA, TAAT
TGAT, GGCA, CAGC
AAGA, TCCG, CTAC
CGAT, TGCA, ACCA
CATA, CGAT, CGAC
CAGA, TGGA, CAGA
ACAG, ATCT, TGGG
AATG, GAAC, AGGA
GAGA, GTGT, GGGC
CACA, TTAA, AGTG
ATAA, TATT, TTCT
GTCG, TGGG, GCAC
CAAA, CCAT, GCTA
ATGC, ACGA, CTGG
GTGA, GGGT, TGAG
AGCC, TACT, ATCC
TCAG, TCGA, TCGA
GATG, ACCC, TCCT
ATCG, CAAC, AGCT
GTCA, GTGT, CCAG
AGAC, GTCG, CCAT
AGGT, CTGG, AAAC
GCAC, TCCC, CTCG
GAAT, AGTC, TACA
CGAG, TCCA, TTAT
GTCG, ATCT, GACT
ATGG, GGAC, CATA
ACGG, CTAT, GCGA
CCAT, GGAC, TGGT
TCGA, GGAT, TCCC
GTTC, TACA, TCAC
CTTA, CCTC, TGAT
AACG, ACTG, GTTC
GAGG, GTCT, CCCA
AACG, TCTA, TTAT
GTCA, TAAC, GTAA
CTCT, GCCG, TAGT
TTGA, TCAA, ACGT
ACAG, CCAC, CACT
GAAG, CCGC, CTCG
AACC, GCGT, CCGA
CCCT, GGGG, AGCC
TGGG, GCCC, AGCA
CCTT, AGCA, CTGC
GAAG, CTAC, ACCC
CACG, AGTA, ATTT
GTCT, ATCG, TCCG
GCCT, CGTT, TCCT
TGTG, AAAT, TATA
TGGT, CAGT, CTTC
AATC, AACA, CCTA
CTAA, TAAG, TGCT
AGCC, CGGG, GATC
TTGT, CCTG, GTCC
AGGT, CATA, ATCC
GTGC, TCAA, ATTA
CATG, GCTT, TTAG
TAAT, GAGT, TCGG
GCGC, GCCC, TCAA
AGTT, GGTC, TAGA
AGCG, CGCA, GTTT
TCCT, TAGG, T
I have a text file with 2000 words, one word on each line. I'm trying to create a code that prints out two random words from the textfile on the same line every 10 seconds. The beginning part of my text file is shown below:
slip
melt
true
therapeutic
scarce
visitor
wild
tickle
.
.
.
The code that I've written is:
from time import sleep
import random
my_file = open("words.txt", "r")
i = 1
while i > 0:
number_1 = random.randint(0, 2000)
number_2 = random.randint(0, 2000)
word_1 = my_file.readline(number_1)
word_2 = my_file.readline(number_2)
print(word_1.rstrip() + " " + word_2.rstrip())
i += 1
sleep(10)
When I execute the code instead of printing two random words it starts printing all the words in order from the top of the text. I'm not sure why this is happening since number_1 and number_2 are inside the loop so every time two words print number_1 and number_2 should be changed to two other random numbers. I don't think replacing number_1 and number_2 outside of the loop will work either since they'll be fixed to two values and the code will just keep on printing the same two words. Does anyone know what I can do to fix the code?
readline() doesn't take any parameters and just returns the next line in your file input*. Instead, try to create a list using readlines(), then choose randomly from that list. So here, you'd make word_list = my_file.readlines(), then choose random elements from word_list.
*Correction: readline() does take a parameter of the number of bytes to read. The documentation for the function doesn't seem to explicitly state this. Thanks E. Ducateme!
my_file.readline(number_1) does not do what you want. The argument for readline is the max size in bytes of a line you can read rather than the position of the line in the file.
As the other answer mentioned, a better approach is to first read the lines into a list and then randomly select words from it:
from time import sleep
import random
my_file = open("words.txt", "r")
words = my_file.readlines()
i = 1
while i > 0:
number_1 = random.randint(0, 2000)
number_2 = random.randint(0, 2000)
word_1 = words[number_1]
word_2 = words[number_2]
print(word_1.rstrip() + " " + word_2.rstrip())
i += 1
sleep(10)
I have found so much information from previous search on this website but I seem to be stuck on the following issue.
I have two text files that looks like this
Inter.txt ( n-lines but only showed 4 lines,you get the idea)
7275
30000
6693
855
....
rules.txt (2n-lines)
7275
8500
6693
7555
....
3
1000
8
5
....
I want to compare the first line of Inter.txt with rules.txt and in case of a match, I jump for n-lines in order to get the score of that line. (E.g. with 7275, there is a match, I jump n to get the score 3)
I produced the following code but for some reasons, I only have the ouput of the first line when I should have one for each match from my first file. With the previous example, I should have 8 as an output for 6693.
import linecache
inter = open("Inter.txt", "r")
rules = open("rules.txt", "r")
iScore = 0
jump = 266
i=0
for lineInt in inter:
#i = i+1
#print(i)
for lineRul in rules:
i = i+1
#print(i)
if lineInt == lineRul:
print("Match")
inc = linecache.getline("rules.txt", i + jump)
#print(inc)
iScore = iScore + int(inc)
print(iScore)
#break
else:
continue
All the print(i) are there because I checked that all the lines were read. I am a novice in Python.
To sum up, I don't understand why I only have one output. Thanks in advance !
Ok, I think the main thing that blocks you from getting forward is that the for loops on files gets the pointer to the end of the file, and doesn't resets when you starts the loops again.
So when you only open rules.txt once, and uses its intance in the inner loop it only goes through all the lines at the first iteration of the outer loop, the second time it tries to go over the remains lines, which are non.
The solution is to close and open the file outside the inner loop.
This code worked for me.
import linecache
inter = open("Inter.txt", "r")
iScore = 0
jump = 4
for lineInt in inter:
i=0
#i = i+1
#print(i)
rules = open("rules.txt", "r")
for lineRul in rules:
i = i+1
#print(i)
if lineInt == lineRul:
print("Match")
inc = linecache.getline("rules.txt", i + jump)
#print(inc)
iScore = iScore + int(inc)
print(iScore)
#break
else:
continue
rules.close()
I also moved where you set the i to 0 to the beginning of the outer loop, but I guess you'd find it yourself.
And I changed jump to 4 to fit the example files your gave :p
Can you please try this solution:
def get_rules_values(rules_file):
with open(rules_file, "r") as rules:
return map(int, rules.readlines())
def get_rules_dict(rules_values):
return dict(zip(rules_values[:len(rules_values)/2], rules_values[len(rules_values)/2:]))
def get_inter_values(inter_file):
with open(inter_file, "r") as inter:
return map(int, inter.readlines())
rules_dict = get_rules_dict(get_rules_values("rules.txt"))
inter_values = get_inter_values("inter.txt")
for inter_value in inter_values:
print inter_value, rules_dict[inter_value]
Hope it's working for you!
I have a text file IDlistfix, which contains a list of youtube video IDs. I'm trying to make a new text file, newlist.txt, which is the IDs in the first video with apostrophes around them and a comma in between the IDs. This is what I've written to accomplish this:
n = open('IDlistfix','r+')
j = open('newlist.txt','w')
line = n.readline()
def listify(rd):
return '\'' + rd + '\','
for line in n:
j.write(listify(line))
This gives me an output of ','rUfg2SLliTQ where I'd expect the output to be 'rUfg2SLliTQ',. Where is my function going wrong?
You just have to strip it of newlines:
j.write(listify(line.strip())) # Notice the call of the .strip() method on the String
Try to remove trailing whitespace and return a formatted string:
n = open('IDlistfix','r+')
j = open('newlist.txt','w')
line = n.readline()
def listify(rd):
# remove trailing whitespace
rd = rd.rstrip()
# return a formatted string
# this is generally preferable to '+'
return "'{0}',".format(rd)
for line in n:
j.write(listify(line))
The problem must be in,
`return '\'' + rd + '\`','
because rd is ending with '/n'.
Remove the '/n' from rd and it should be fine
Is a problem with change of line.
Change:
for line in n:
j.write(listify(line.replace('\n','')))