Reading characters from file into list - python

I have a part of a program with the following code
file1 = [line.strip()for line in open(sometext.txt).readlines()]
print ((file1)[0])
and when the code is executed it gives me the whole contents of the txt file which is a very long sentence,
how would I go about reading every letter and placing it in a list to index each character separately? I have used the list() function which seems to put the whole text file into a list and not each character.

You can use file.read() rather than file.readlines():
file1 = [char for char in open(sometext.txt).read()]
You don't really need list-comprehension, however; instead you can do this:
file1 = list(open(sometext.txt).read())
Also, as #furas mentioned in his comment, you don't need a list to have indexing. str also has a method called index, so you could say file1 = open(sometext.txt).read() and still be able to use file1.index(). Note, str also has a find method which will return -1 if the substring is not found, rather than raising a ValueError.

With a read() is enough. Plus. if you want to store the list without \n and white spaces, you can use:
char_list = [ch for ch in open('test.txt').read() if ch != '\n' if ch != ' ']
You can remove the if statements if you want to maintain them.

Related

How to make everything in a string lowercase

I am trying to write a function that will print a poem reading the words backwards and make all the characters lower case. I have looked around and found that .lower() should make everything in the string lowercase; however I cannot seem to make it work with my function. I don't know if I'm putting it in the wrong spot or if .lower() will not work in my code. Any feedback is appreciated!
Below is my code before entering .lower() anywhere into it:
def readingWordsBackwards( poemFileName ):
inputFile = open(poemFileName, 'r')
poemTitle = inputFile.readline().strip()
poemAuthor = inputFile.readline().strip()
inputFile.readline()
print ("\t You have to write the readingWordsBackwards function \n")
lines = []
for line in inputFile:
lines.append(line)
lines.reverse()
for i, line in enumerate(lines):
reversed_line = remove_punctuation(line).strip().split(" ")
reversed_line.reverse()
print(len(lines) - i, " ".join(reversed_line))
inputFile.close()
As per official documentation,
str.lower()
Return a copy of the string with all the cased characters [4] converted to lowercase.
So you could use it at several different places, e.g.
lines.append(line.lower())
reversed_line = remove_punctuation(line).strip().split(" ").lower()
or
print(len(lines) - i, " ".join(reversed_line).lower())
(this would not store the result, but only print it, so it is likely not what you want).
Note that, depending on the language of the source, you may need a little caution, e.g., this.
See also other relevant answers for How to convert string to lowercase in Python
I think changing the second to last line to this may work
print(len(lines) - i, " ".join(reversed_line).lower())
You could probably insert it here, for instance:
lines.append(line.lower())
Note that line.lower() does not do anything to line itself (strings are immutable!), but returns a new string object. To make line hold that lowercase string, you'd do:
line = line.lower()
Store the contents of a file in a variable, the assign it to itself .lower() like so:
fileContents = inputFile.readline()
fileContents = fileContents.lower()

Having problems with strings and arrays

I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~
You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]
In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.
I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want
s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]
If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).

Removing newline characters in a txt files

I'm doing Euler Problems and am at problem #8 and wanted to just copy this huge 1000-digit number to a numberToProblem8.txt file and then just read it into my script but I can't find a good way to remove newlines from it. With that code:
hugeNumberAsStr = ''
with open('numberToProblem8.txt') as f:
for line in f:
aSingleLine = line.strip()
hugeNumberAsStr.join(aSingleLine)
print(hugeNumberAsStr)
Im using print() to only check if it works and well, it doesnt. It doesnt print out anything. What's wrong with my code? I remove all the trash with strip() and then use join() to add that cleaned line into hugeNumberAsStr (need a string to join those lines, gonna use int() later on) and its repeated for all the lines.
Here is the .txt file with a number in it.
What about something like:
hugeNumberAsStr = open('numberToProblem8.txt').read()
hugeNumberAsStr = hugeNumberAsStr.strip().replace('\n', '')
Or even:
hugeNumberAsStr = ''.join([d for d in hugeNumberAsStr if d.isdigit()])
I was able to simplify it to the following to get the number from that file:
>>> int(open('numberToProblem8.txt').read().replace('\n',''))
731671765313306249192251196744265747423553491949349698352031277450632623957831801698480186947885184385861560789112949495459501737958331952853208805511125406987471585238630507156932909632952274430435576689664895044524452316173185640309871112172238311362229893423380308135336276614282806444486645238749303589072962904915604407723907138105158593079608667017242712188399879790879227492190169972088809377665727333001053367881220235421809751254540594752243525849077116705560136048395864467063244157221553975369781797784617406495514929086256932197846862248283972241375657056057490261407972968652414535100474821663704844031998900088952434506585412275886668811642717147992444292823086346567481391912316282458617866458359124566529476545682848912883142607690042242190226710556263211111093705442175069416589604080719840385096245544
You need to do hugeNumberAsStr += aSingleLine instead of hugeNumberAsStr.join(..)
str.join() joins the passed iterator and return the string value joined by str. It doesn't update the value of hugeNumberAsStr as you think. You want to create a new string with removed \n. You need to store these values in new string. For that you need append the content to the string
The join method for strings simply takes an iterable object and concatenates each part together. It then returns the resulting concatenated string. As stated in help(str.join):
join(...)
S.join(iterable) -> str
Return a string which is the concatenation of the strings in the
iterable. The separator between elements is S.
Thus the join method really does not do what you want.
The concatenation line should be more like:
hugeNumberAsString += aSingleLine
Or even:
hugeNumberAsString += line.strip()
Which gets rid of the extra line of code doing the strip.

complex regex matches in python

I have a txt file that contains the following data:
chrI
ATGCCTTGGGCAACGGT...(multiple lines)
chrII
AGGTTGGCCAAGGTT...(multiple lines)
I want to first find 'chrI' and then iterate through the multiple lines of ATGC until I find the xth char. Then I want to print the xth char until the yth char. I have been using regex but once I have located the line containing chrI, I don't know how to continue iterating to find the xth char.
Here is my code:
for i, line in enumerate(sacc_gff):
for match in re.finditer(chromo_val, line):
print(line)
for match in re.finditer(r"[ATGC]{%d},{%d}\Z" % (int(amino_start), int(amino_end)), line):
print(match.group())
What the variables mean:
chromo_val = chrI
amino_start = (some start point my program found)
amino_end = (some end point my program found)
Note: amino_start and amino_end need to be in variable form.
Please let me know if I could clarify anything for you, Thank you.
It looks like you are working with fasta data, so I will provide an answer with that in mind, but if it isn't you can use the sub_sequence selection part still.
fasta_data = {} # creates an empty dictionary
with open( fasta_file, 'r' ) as fh:
for line in fh:
if line[0] == '>':
seq_id = line.rstrip()[1:] # strip newline character and remove leading '>' character
fasta_data[seq_id] = ''
else:
fasta_data[seq_id] += line.rstrip()
# return substring from chromosome 'chrI' with a first character at amino_start up to but not including amino_end
sequence_string1 = fasta_data['chrI'][amino_start:amino_end]
# return substring from chromosome 'chrII' with a first character at amino_start up to and including amino_end
sequence_string2 = fasta_data['chrII'][amino_start:amino_end+1]
fasta format:
>chr1
ATTTATATATAT
ATGGCGCGATCG
>chr2
AATCGCTGCTGC
Since you are working with fasta files which are formatted like this:
>Chr1
ATCGACTACAAATTT
>Chr2
ACCTGCCGTAAAAATTTCC
and are a bioinformatics major I am guessing you will be manipulating sequences often I recommend install the perl package called FAST. Once this is installed to get the 2-14 character of every sequence you would do this:
fascut 2..14 fasta_file.fa
Here is the recent publication for FAST and github that contains a whole toolbox for manipulating molecule sequence data on the command line.

Nested lists and converting to floats

I'm trying to read some numbers from a text file and convert them to a list of floats, but nothing I try seems to work right.
Here's my code right now:
python_data = open('C:\Documents and Settings\redacted\Desktop\python_lengths.txt','r')
python_lengths = []
for line in python_data:
python_lengths.append(line.split())
python_lengths.sort()
print python_lengths
It returns:
[['12.2'], ['26'], ['34.2'], ['5.0'], ['62'], ['62'], ['62.6']]
(all brackets included)
But I can't convert it to a list of floats with any regular commands like:
python_lengths = float(python_lengths)
or:
float_lengths = [map(float, x) for x in python_lengths]
because it seems to be nested or something?
That is happening because .split() always returns a list of items even if there was just 1 element present. If you change your python_lengths.append(line.split()) to python_lengths.extend(line.split()) you will get your flat list you expected.
#eumiro's answer is correct, but here is something else that can help:
numbers = []
with open('C:\Documents and Settings\redacted\Desktop\python_lengths.txt','r') as f:
for line in f.readlines():
numbers.extend(line.split())
numbers.sort()
print numbers
def floats_from_file(f):
for line in f:
for word in line.split():
yield float(word)
with open('C:/Documents and Settings/redacted/Desktop/python_lengths.txt') as f:
python_lengths = list(floats_from_file(f))
python_lengths.sort()
print python_lengths
Note that you can use forward slashes, even on Windows. If you want to use backslashes you should use a "raw" string, to avoid problems. What sort of problems? Well, some characters are special with backslash; for example, \n represents a newline. If you just put a path in plain quotes, and one of the directory names starts with n, you will get a newline there. Solutions are to double the backslashes, use raw strings, or just use forward slashes.

Categories