Removes white spaces while reading in a file - python

with open(filename, "r") as f:
for line in f:
line = (' '.join(line.strip().split())).split()
Can anyone break down the line where whitespaces get removed?
I understand line.strip().split() first removes leading and trailing spaces from line then the resulting string gets split on whitespaces and stores all words in a list.
But what does the remaining code do?

The line ' '.join(line.strip().split()) creates a string consisting of all the list elements separated by exactly one whitespace character. Applying split() method on this string again returns a list containing all the words in the string which were separated by a whitespace character.

Here's a breakdown:
# Opens the file
with open(filename, "r") as f:
# Iterates through each line
for line in f:
# Rewriting this line, below:
# line = (' '.join(line.strip().split())).split()
# Assuming line was " foo bar quux "
stripped_line = line.strip() # "foo bar quux"
parts = stripped_line.split() # ["foo", "bar", "quux"]
joined = ' '.join(parts) # "foo bar quux"
parts_again = joined.split() # ["foo", "bar", "quux"]
Is this what you were looking for?

That code is pointlessly complicated is what it is.
There is no need to strip if you're no-arg spliting next (no-arg split drops leading and trailing whitespace by side-effect), so line.strip().split() can simplify to line.split().
The join and re-split doesn't change a thing, join sticks the first split back together with spaces, then split resplits on those very same spaces. So you could save the time spent joining only to split and just keep the original results from the first split, changing it to:
line = line.split()
and it would be functionally identical to the original:
line = (' '.join(line.strip().split())).split()
and faster to boot. I'm guessing the code you were handed was written by someone who didn't understand spliting and joining either, and just threw stuff at their problem without understanding what it did.

Here is explanation to code:-
with open(filename, "r") as f:
for line in f:
line = (' '.join(line.strip().split())).split()
First line.strip() removes leading and trailing white spaces from line and .split() break to list on basis of white spaces.
Again .join convert previous list to a line of white space separated. Finally .split again convert it to list.
This code is superfluous line = (' '.join(line.strip().split())).split(). And it should be:-
line = line.split()
If you again want to strip use:-
line = map(str.strip, line.split())

I think they are doing this to maintain a constant amount of whitespace. The strip is removing all whitespace (could be 5 spaces and a tab), and then they are adding back in the single space in its place.

Related

How do I combine lines in a text file in a specific order?

I'm trying to transform the text in a file according the following rule: for each line, if the line does not begin with "https", add that word to the beginning of subsequent lines until you hit another line with a non-https word.
For example, given this file:
Fruit
https://www.apple.com//
https://www.banana.com//
Vegetable
https://www.cucumber.com//
https://www.lettuce.com//
I want
Fruit-https://www.apple.com//
Fruit-https://www.banana.com//
Vegetable-https://www.cucumber.com//
Vegetable-https://www.lettuce.com//
Here is my attempt:
one = open("links.txt", "r")
for two in one.readlines():
if "https" not in two:
sitex = two
else:
print (sitex + "-" +two)
Here is the output of that program, using the above sample input file:
Fruit
-https://www.apple.com//
Fruit
-https://www.banana.com//
Vegetable
-https://www.cucumber.com//
Vegetable
-https://www.lettuce.com//
What is wrong with my code?
To fix that we need to implement rstrip() method to sitex to remove the new line character at the end of the string. (credit to BrokenBenchmark)
second, the print command by default newlines everytime it's called, so we must add the end="" parameter to fix this.
So your code should look like this
one = open("links.txt", "r")
for two in one.readlines():
if "https" not in two:
sitex = two.rstrip()
else:
print (sitex + "-" +two,end="")
one.close()
Also always close the file when you are done.
Lines in your file end on "\n" - the newline character.
You can remove whitespaces (includes "\n") from a string using strip() (both ends) or rstrip() / lstrip() (remove at one end).
print() adds a "\n" at its end by default, you can omit this using
print("something", end=" ")
print("more) # ==> 'something more' in one line
Fix for your code:
# use a context handler for better file handling
with open("data.txt","w") as f:
f.write("""Fruit
https://www.apple.com//
https://www.banana.com//
Vegetable
https://www.cucumber.com//
https://www.lettuce.com//
""")
with open("data.txt") as f:
what = ""
# iterate file line by line instead of reading all at once
for line in f:
# remove whitespace from current line, including \n
# front AND back - you could use rstring here as well
line = line.strip()
# only do something for non-empty lines (your file does not
# contain empty lines, but the last line may be empty
if line:
# easier to understand condition without negation
if line.startswith("http"):
# printing adds a \n at the end
print(f"{what}-{line}") # line & what are stripped
else:
what = line
Output:
Fruit-https://www.apple.com//
Fruit-https://www.banana.com//
Vegetable-https://www.cucumber.com//
Vegetable-https://www.lettuce.com//
See:
str.lstrip([chars])
str.rstrip([chars])
str.strip([chars])
[chars] are optional - if not given, whitespaces are removed.
You need to strip the trailing newline from the line if it doesn't contain 'https':
sitex = two
should be
sitex = two.rstrip()
You need to do something similar for the else block as well, as ShadowRanger points out:
print (sitex + "-" +two)
should be
print (sitex + "-" + two.rstrip())

How can this particular statement be made faster?

[[u'the', u'terse', u'announcement', u'state-run', u'news', u'agency', u'didnt', u'identify', u'the', u'aggressor', u'but', u'mister', u'has', u'accused', u'neighboring', u'country', u'of', u'threatening', u'to', u'attack', u'its', u'nuclear', u'installations'], [], [u'government', u'officials']]
I want to remove the empty lines, represented by empty brackets above[]. I am currently using:
with codecs.open("textfile.txt", "r", "utf-8") as f:
for line in f:
dataset.append(line.lower().strip().split()) #dataset contains the data above in the format shown
lines=[sum((line for line in dataset if line), [])]
This statement takes pretty long to remove the empty lines. Is there a better way to remove empty lines from a list of lists and still maintain the format shown?
You can skip the whitespace-only lines when reading the file:
with codecs.open("textfile.txt", "r", "utf-8") as f:
dataset = [line.lower().split() for line in f if not line.isspace()]
Note that split() ignores leading/trailing whitespace, so strip() is redundant.
EDIT:
Your question is very unclear, but it seems from the comments that all you want to do is read a file and remove all the empty lines. If that is correct then you simply need to do:
with codecs.open("textfile.txt", "r", "utf-8") as f:
dataset = [line.lower() for line in f if not line.isspace()]
Now dataset is a list of lower-cased lines (i.e. strings). If you want to combine them into one string, you can do:
text = ''.join(dataset)
I am a little confused why you are doing:
lines = [sum((line for line in dataset if line), [])]
First off by adding square brackets around the call to sum you end up with a list with one element: the result of sum, not sure if that was intended...
Regardless, the result of sum() will be a list of all the words in the file that were separated by whitespace, if this is the desired end result then you can simply use re.split:
with open(...) as f:
links = [re.split("\W+",f.read())]
#is it possible you instead wanted:
#links = re.split("\W+",f.read())
the "\W" simply means any whitespace ("\n"," ","\t" etc.) and the + means (1 or more multiples) so it will handle multiple newlines or multiple spaces.

How to extract last line of text in Python (excluding new lines)?

Textfile:
1
2
3
4
5
6
\n
\n
I know lines[-1] gets you the last line, but I want to disregard any new lines and get the last line of text (6 in this case).
The best approach regarding memory is to exhaust the file. Something like this:
with open('file.txt') as f:
last = None
for line in (line for line in f if line.rstrip('\n')):
last = line
print last
It can be done more elegantly though. A slightly different approach:
with open('file.txt') as f:
last = None
for last in (line for line in f if line.rstrip('\n')):
pass
print last
For a small file you can just read all of the lines, discarding any empty ones. Notice that I've used an inner generator to strip the lines before excluding them in the outer one.
with open(textfile) as fp:
last_line = [l2 for l2 in (l1.strip() for l1 in fp) if l2][-1]
with open('file') as f:
print([i for i in f.read().split('\n') if i != ''][-1])
This is just an edit to Avinash Raj's answer (but since I'm a new account, I can't comment on it). This will preserve any None values in your data (i.e. if the data in your last line is "None" it will work, though depending on your input this may not be an issue).
with open('path/to/file') as infile:
for line in infile:
if not line.strip('\n'):
continue
answer = line
print(answer)
This will print 6 with a newline at the end. You can decide how to strip that. Following are some options:
answer.rstrip('\n') removes trailing newlines
answer.rstrip() removes trailing whitespaces
answer.strip() removes any surrounding whitespaces
with open ('file.txt') as myfile:
for num,line in enumerate(myfile):
pass
print num

Python rstrip() (for tabs) not working as expected

I was trying out the rstrip() function, but it doesn't work as expected.
For example, if I run this:
lines = ['tra\tla\tla\t\t\t\n', 'tri\tli\tli\t\t\t\n', 'tro\tlo\tlo\t\t\t\n']
for line in lines:
line.rstrip('\t')
print lines
It returns
['tra\tla\tla\t\t\t\n', 'tri\tli\tli\t\t\t\n', 'tro\tlo\tlo\t\t\t\n']
whereas I want it to return:
['tra\tla\tla\n', 'tri\tli\tli\n', 'tro\tlo\tlo\n']
What is the problem here?
The function returns the new, stripped string, but you discard that return value.
Use a list comprehension instead to replace the whole lines list; you'll need to ignore the newlines at the end as well; the .rstrip() method won't ignore those for you.
lines = [line[:-1].rstrip('\t') + '\n' for line in lines]
Demo:
>>> lines = ['tra\tla\tla\t\t\t\n', 'tri\tli\tli\t\t\t\n', 'tro\tlo\tlo\t\t\t\n']
>>> [line[:-1].rstrip('\t') + '\n' for line in lines]
['tra\tla\tla\n', 'tri\tli\tli\n', 'tro\tlo\tlo\n']

\n appending at the end of each line

I am writing lines one by one to an external files. Each line has 9 columns separated by Tab delimiter. If i split each line in that file and output last column, i can see \n being appended to the end of the 9 column. My code is:
#!/usr/bin/python
with open("temp", "r") as f:
for lines in f:
hashes = lines.split("\t")
print hashes[8]
The last column values are integers, either 1 or 2. When i run this program, the output i get is,
['1\n']
['2\n']
I should only get 1 or 2. Why is '\n' being appended here?
I tried the following check to remove the problem.
with open("temp", "r") as f:
for lines in f:
if lines != '\n':
hashes = lines.split("\t")
print hashes[8]
This too is not working. I tried if lines != ' '. How can i make this go away? Thanks in advance.
Try using strip on the lines to remove the \n (the new line character). strip removes the leading and trailing whitespace characters.
with open("temp", "r") as f:
for lines in f.readlines():
if lines.strip():
hashes = lines.split("\t")
print hashes[8]
\n is the newline character, it is how the computer knows to display the data on the next line. If you modify the last item in the array hashes[-1] to remove the last character, then that should be fine.
Depending on the platform, your line ending may be more than just one character. Dos/Windows uses "\r\n" for example.
def clean(file_handle):
for line in file_handle:
yield line.rstrip()
with open('temp', 'r') as f:
for line in clean(f):
hashes = line.split('\t')
print hashes[-1]
I prefer rstrip() for times when I want to preserve leading whitespace. That and using generator functions to clean up my input.
Because each line has 9 columns, the 8th index (which is the 9th object) has a line break, since the next line starts. Just take that away:
print hashes[8][:-1]

Categories