Trying to grab these numbers for example from a text file :
00000
11111
22222
33333
44444
Trying to substract the total string that I got from the text file to have a function doing stuff with each row as an integer this way:
import linecache
with file('textfiletest.txt', 'r') as original: testfile = str(original.read())
lines = len(testfile.splitlines())
for i in range(lines):
SID = int(testfile[5*i: -(len(testfile)-5*(i+1))])
print SID
This code results in printing all the lines and getting an error on the last one, saying its not a convertable char.
ValueError: invalid literal for int() with base 10: ''
Note: every line is 5 characters long.
Line reading in Python is so much simpler than that, you are really overcomplicating things here:
with file('textfiletest.txt', 'r') as original:
for line in original:
if not line.strip():
continue
SID = int(line)
print SID
This loops over the lines in the file directly, one by one. int() can handle extra whitespace, including the newline character, so all we need to take care of is making sure we skip lines that are empty apart from whitespace.
Your solution doesn't take into account that the newline characters take up space too; lines are 6 characters long with the newline. But why make it so hard for yourself when you clearly already found the str.splitlines() function; you could just have looped over that:
for line in testfile.splitlines():
# ...
would have given you a loop over the lines in the file contents without trailing newlines and certainly no need for complicated slicing computations.
Related
I am checking the position of semicolons in text files. I have length-delimited text files having thousands of rows which look like this:
AB;2;43234;343;
CD;4;41234;443;
FE53234;543;
FE;5;53;34;543;
I am using the following code to check the correct position of the semicolons. If a semicolon is missing where I would expect it, a statement is printed:
import glob
path = r'C:\path\*.txt'
for fname in glob.glob(path):
print("Checking file", fname)
with open(fname) as f:
content = f.readlines()
for count, line in enumerate(content):
if (line[2:3]!=";"
or line[4:5]!=";"
or line[10:11]!=";"
# really a lot of continuing entries like these
or line[14:15]!=";"
):
print("\nSemikolon expected, but not found!\nrow:", count+1, "\n", fname, "\n", line)
The code works. No error is thrown and it detects the data row.
My problem now is that I have a lot of semicolons to check and I have really a lot of continuing entries like
or line[xx:xx]!=";"
I think this is inefficient regarding two points:
It is visually not nice to have these many code lines. I think it could be shortened.
It is logically not efficient to have these many splitted or checks. I think it could be more efficient probably decreasing the runtime.
I search for an efficient solution which:
Improves the readability
Most importantly: reduces the runtime (as I think the way it is written now is inefficient, with all the or statements)
I only want to check if there are semicolons where I would expect them. Where I need them. I do not care about any additional semicolons in the data fields.
Just going off of what you've written:
filename = ...
with open(filename) as file:
lines = file.readlines()
delimiter_indices = (2, 4, 10, 14) # The indices in any given line where you expect to see semicolons.
for line_num, line in enumerate(lines):
if any(line[index] != ";" for index in delimiter_indices):
print(f"{filename}: Semicolon expected on line #{line_num}")
If the line doesn't have at least 15 characters, this will raise an exception. Also, lines like ;;;;;;;;;;;;;;; are technically valid.
EDIT: Assuming you have an input file that looks like:
AB;2;43234;343;
CD;4;41234;443;
FE;5;53234;543;
FE;5;53;34;543;
(Note: the blank line at the end)
My provided solution works fine. I do not see any exceptions or Semicolon expected on line #... outputs.
If your input file ends with two blank lines, this will raise an exception. If your input file contains a blank line somewhere in the middle, this will also raise an exception. If you have lines in your file that are less than 15 characters long (not counting the last line), this will raise an exception.
You could simply say that every line must meet two criteria to be considered valid:
The current line must be at least 15 characters long (or max(delimiter_indices) + 1 characters long).
All characters at delimiter indices in the current line must be semicolons.
Code:
for line_num, line in enumerate(lines):
is_long_enough = len(line) >= (max(delimiter_indices) + 1)
has_correct_semicolons = all(line[index] == ';' for index in delimiter_indices)
if not (is_long_enough and has_correct_semicolons):
print(f"{filename}: Semicolon expected on line #{line_num}")
EDIT: My bad, I ruined the short-circuit evaluation for the sake of readability. The following should work:
is_valid_line = (len(line) >= (max(delimiter_indices) + 1)) and (all(line[index] == ';' for index in delimiter_indices))
if not is_valid_line:
print(f"{filename}: Semicolon expected on line #{line_num}")
If the length of the line is not correct, the second half of the expression will not be evaluated due to short-circuit evaluation, which should prevent the IndexError.
EDIT:
Since you have so many files with so many lines and so many semicolons per line, you could do the max(delimiter_indices) calculation before the loop to avoid having calculate that value for each line. It may not make a big difference, but you could also just iterate over the file object directly (which yields the next line each iteration), as opposed to loading the entire file into memory before you iterate via lines = file.readlines(). This isn't really required, and it's not as cute as using all or any, but I decided to turn the has_correct_semicolons expression into an actual loop that iterates over delimiter indices - that way your error message can be a bit more explicit, pointing to the offending index of the offending line. Also, there's a separate error message for when a line is too short.
import glob
delimiter_indices = (2, 4, 10, 14)
max_delimiter_index = max(delimiter_indices)
min_line_length = max_delimiter_index + 1
for path in glob.glob(r"C:\path\*.txt"):
filename = path.name
print(filename.center(32, "-"))
with open(path) as file:
for line_num, line in enumerate(file):
is_long_enough = len(line) >= min_line_length
if not is_long_enough:
print(f"{filename}: Line #{line_num} is too short")
continue
has_correct_semicolons = True
for index in delimiter_indices:
if line[index] != ";":
has_correct_semicolons = False
break
if not has_correct_semicolons:
print(f"{filename}: Semicolon expected on line #{line_num}, character #{index}")
print("All files done")
If you just want to validate the structure of the lines, you can use a regex that is easy to maintain if your requirement changes:
import re
with open(fname) as f:
for row, line in enumerate(f, 1):
if not re.match(r"[A-Z]{2};\d;\d{5};\d{3};", line):
print("\nSemicolon expected, but not found!\nrow:", row, "\n", fname, "\n", line)
Regex demo here.
If you don't actually care about the content and only want to check the position of the ;, you can simplify the regex to: r".{2};.;.{5};.{3};"
Demo for the dot regex.
so far i can write the code to filter out words that are less than 8 characters long and also the words that contain the #, # or : symbols. However i cant figure out how to just get the last words. My code looks like this so far.
f = open("file.txt").read()
for words in f.split():
if len(words) >= 8 and not "#" in words and not "#" in words and not ":" in words:
print(words)
Edit - sorry im pretty new to this and so ive probably done something wrong above as well. The file is quite long so ill give the first line and the expected output. The first line is:
"I wish they would show out takes of Dick Cheney #GOPdebates Candidates went after #HillaryClinton 32 times in the #GOPdebate-but remained"
the expected output is "remained" however my code outputs "Candidates" and "remained".
for line in open(filename):
if some_test(line):
do_rad_thing(line)
I think is what you want .... you have the some_test part and the do_rad_thing part
I think this works: you can open the file with readlines and pass the delimeter in split(), then get the last one using [-1].
f = open("file.txt").realines()
for line in f:
last_word = line.split()[-1]
This should accomplish what you are trying to do.
Split words of the file into an array using .split() and then access the last value using [-1]. I also put all the illegal characters into an array and just did a check to see if any of the chars in the illegal_chars array are in last_word.
f = open("file.txt").read()
illegal_chars = ["#", "#", ":"]
last_word = f.split()[-1]
if( len(last_word) >= 8 and illegal_chars not in last_word:
print(last_word)
I was searching for how to define an empty array and fill it using a text file and got stuck at one point and couldn't find the solution.
My Script is :
array=[]
file=open("path\\file.txt","r")
array.append(file.readline())
array.append(file.readline()) #wanted to put only first 2 line just for learning.
incident_count=len(array)
print(incident_count)
print(array)
The first problem is that when I'm trying to put elements in array newline character is also attached("\n"). Also is append right function for putting elements in array.
Second when I'm trying to print the count of array It's printing number of char.
You can use file.readline()[:-1] this will return all the line except the last character which is \n
array.append(file.readline()[:-1])
array.append(file.readline()[:-1])
You can add [:-1] to avoid the last character which is \n consider this example :
hello\n
world\n
so when you read your file line by line the \n is include in each line, so to avoid \n you can read the line from index 0 to index -1
hello\n
^ ^_______index 4 or -1
|___________index 0
to understand more, take a look at this :
Understanding Python's slice notation
2
['hello', 'world']
Or like Omar Einea mention in his comment you can use :
array.append(file.readline().rstrip())
array.append(file.readline().rstrip())
In case the last line not have a \n
You could also use: str.splitlines() method.
You'll end up with something like this:
array.append(file.readline().splitlines())
If your trying to append lines from file to list it always comes with "\n" so to avoid this we need to split the content.
Solution 1:
array=[ ]
with open(rC:\Users\wasim akram\Desktop\was.txt) as fp:
array = fp.read().splitlines()
print(array)
Solution 2 :
myNames=[ ]
f = open(r"C:\Users\wasim akram\Desktop\was.txt",'r')
for line in f:
myNames.append(line.strip())
print(myNames)
I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~
You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]
In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.
I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want
s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]
If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).
I'm doing Euler Problems and am at problem #8 and wanted to just copy this huge 1000-digit number to a numberToProblem8.txt file and then just read it into my script but I can't find a good way to remove newlines from it. With that code:
hugeNumberAsStr = ''
with open('numberToProblem8.txt') as f:
for line in f:
aSingleLine = line.strip()
hugeNumberAsStr.join(aSingleLine)
print(hugeNumberAsStr)
Im using print() to only check if it works and well, it doesnt. It doesnt print out anything. What's wrong with my code? I remove all the trash with strip() and then use join() to add that cleaned line into hugeNumberAsStr (need a string to join those lines, gonna use int() later on) and its repeated for all the lines.
Here is the .txt file with a number in it.
What about something like:
hugeNumberAsStr = open('numberToProblem8.txt').read()
hugeNumberAsStr = hugeNumberAsStr.strip().replace('\n', '')
Or even:
hugeNumberAsStr = ''.join([d for d in hugeNumberAsStr if d.isdigit()])
I was able to simplify it to the following to get the number from that file:
>>> int(open('numberToProblem8.txt').read().replace('\n',''))
731671765313306249192251196744265747423553491949349698352031277450632623957831801698480186947885184385861560789112949495459501737958331952853208805511125406987471585238630507156932909632952274430435576689664895044524452316173185640309871112172238311362229893423380308135336276614282806444486645238749303589072962904915604407723907138105158593079608667017242712188399879790879227492190169972088809377665727333001053367881220235421809751254540594752243525849077116705560136048395864467063244157221553975369781797784617406495514929086256932197846862248283972241375657056057490261407972968652414535100474821663704844031998900088952434506585412275886668811642717147992444292823086346567481391912316282458617866458359124566529476545682848912883142607690042242190226710556263211111093705442175069416589604080719840385096245544
You need to do hugeNumberAsStr += aSingleLine instead of hugeNumberAsStr.join(..)
str.join() joins the passed iterator and return the string value joined by str. It doesn't update the value of hugeNumberAsStr as you think. You want to create a new string with removed \n. You need to store these values in new string. For that you need append the content to the string
The join method for strings simply takes an iterable object and concatenates each part together. It then returns the resulting concatenated string. As stated in help(str.join):
join(...)
S.join(iterable) -> str
Return a string which is the concatenation of the strings in the
iterable. The separator between elements is S.
Thus the join method really does not do what you want.
The concatenation line should be more like:
hugeNumberAsString += aSingleLine
Or even:
hugeNumberAsString += line.strip()
Which gets rid of the extra line of code doing the strip.