How can this particular statement be made faster? - python

[[u'the', u'terse', u'announcement', u'state-run', u'news', u'agency', u'didnt', u'identify', u'the', u'aggressor', u'but', u'mister', u'has', u'accused', u'neighboring', u'country', u'of', u'threatening', u'to', u'attack', u'its', u'nuclear', u'installations'], [], [u'government', u'officials']]
I want to remove the empty lines, represented by empty brackets above[]. I am currently using:
with codecs.open("textfile.txt", "r", "utf-8") as f:
for line in f:
dataset.append(line.lower().strip().split()) #dataset contains the data above in the format shown
lines=[sum((line for line in dataset if line), [])]
This statement takes pretty long to remove the empty lines. Is there a better way to remove empty lines from a list of lists and still maintain the format shown?

You can skip the whitespace-only lines when reading the file:
with codecs.open("textfile.txt", "r", "utf-8") as f:
dataset = [line.lower().split() for line in f if not line.isspace()]
Note that split() ignores leading/trailing whitespace, so strip() is redundant.
EDIT:
Your question is very unclear, but it seems from the comments that all you want to do is read a file and remove all the empty lines. If that is correct then you simply need to do:
with codecs.open("textfile.txt", "r", "utf-8") as f:
dataset = [line.lower() for line in f if not line.isspace()]
Now dataset is a list of lower-cased lines (i.e. strings). If you want to combine them into one string, you can do:
text = ''.join(dataset)

I am a little confused why you are doing:
lines = [sum((line for line in dataset if line), [])]
First off by adding square brackets around the call to sum you end up with a list with one element: the result of sum, not sure if that was intended...
Regardless, the result of sum() will be a list of all the words in the file that were separated by whitespace, if this is the desired end result then you can simply use re.split:
with open(...) as f:
links = [re.split("\W+",f.read())]
#is it possible you instead wanted:
#links = re.split("\W+",f.read())
the "\W" simply means any whitespace ("\n"," ","\t" etc.) and the + means (1 or more multiples) so it will handle multiple newlines or multiple spaces.

Related

Getting specific element(s) from after read from file

After I read from file:
with open(fileName) as f:
for line in f:
print(line.split(",")) #split the file into multiple lists
How do I get some specific element(s) from those lists?
For example, only elements with index[0 to 3], but discard/ignore any elements after that.
If you want to save the first three items in each line, you could use a list comprehension
with open(fileName) as f:
firstitems = [line.rstrip().split(",")[0:3] for line in f]
Note that the rstrip() is needed to remove the final newline character, if there are fewer than four items in a line. Note that the "items" are all strings, even if they look like other types. If you want integers, for example, you will need to convert them to integers.
Then you can print them:
for line in firstitems:
print(line)
Try the below code:
with open('f.txt') as f:
print('\n'.join([i for i in f.read().split(',')[0:3]]))

Removes white spaces while reading in a file

with open(filename, "r") as f:
for line in f:
line = (' '.join(line.strip().split())).split()
Can anyone break down the line where whitespaces get removed?
I understand line.strip().split() first removes leading and trailing spaces from line then the resulting string gets split on whitespaces and stores all words in a list.
But what does the remaining code do?
The line ' '.join(line.strip().split()) creates a string consisting of all the list elements separated by exactly one whitespace character. Applying split() method on this string again returns a list containing all the words in the string which were separated by a whitespace character.
Here's a breakdown:
# Opens the file
with open(filename, "r") as f:
# Iterates through each line
for line in f:
# Rewriting this line, below:
# line = (' '.join(line.strip().split())).split()
# Assuming line was " foo bar quux "
stripped_line = line.strip() # "foo bar quux"
parts = stripped_line.split() # ["foo", "bar", "quux"]
joined = ' '.join(parts) # "foo bar quux"
parts_again = joined.split() # ["foo", "bar", "quux"]
Is this what you were looking for?
That code is pointlessly complicated is what it is.
There is no need to strip if you're no-arg spliting next (no-arg split drops leading and trailing whitespace by side-effect), so line.strip().split() can simplify to line.split().
The join and re-split doesn't change a thing, join sticks the first split back together with spaces, then split resplits on those very same spaces. So you could save the time spent joining only to split and just keep the original results from the first split, changing it to:
line = line.split()
and it would be functionally identical to the original:
line = (' '.join(line.strip().split())).split()
and faster to boot. I'm guessing the code you were handed was written by someone who didn't understand spliting and joining either, and just threw stuff at their problem without understanding what it did.
Here is explanation to code:-
with open(filename, "r") as f:
for line in f:
line = (' '.join(line.strip().split())).split()
First line.strip() removes leading and trailing white spaces from line and .split() break to list on basis of white spaces.
Again .join convert previous list to a line of white space separated. Finally .split again convert it to list.
This code is superfluous line = (' '.join(line.strip().split())).split(). And it should be:-
line = line.split()
If you again want to strip use:-
line = map(str.strip, line.split())
I think they are doing this to maintain a constant amount of whitespace. The strip is removing all whitespace (could be 5 spaces and a tab), and then they are adding back in the single space in its place.

Importing strings from a file, into a list, using python

I have a txt file, with these contents:
a,b
c,d
e,f
g,h
i,j
k,l
And i am putting them into a list, using these lines:
keywords=[]
solutions=[]
for i in file:
keywords.append((i.split(","))[0])
solutions.append(((i.split(","))[1]))
but when I print() the solutions, here is what it displays:
['b\n', 'd\n', 'f\n', 'h\n', 'j\n', 'l']
How do I make it, so that the \n-s are removed from the ends of the first 5 elements, bu the last element is left unaltered, using as few lines as possible.
You can use str.strip() in order to trimming the last whitespace. But as a more pythonic approach you better to use csv module for loading your file content which will accept a delimiter and return an iterable contain tuples of separated items (here, the characters). The use zip() function to get the columns.
import csv
with open(file_name) as f:
reader_obj = csv.reader(f, delimiter=',') # here passing the delimiter is optional because by default it will consider comma as delimiter.
first_column, second_column = zip(*reader_obj)
You need to string.strip() the whitespace/new-line characters from the string after reading it to remove the \n:
keywords=[]
solutions=[]
for i_raw in file:
i = i_raw.strip() # <-- removes extraneous spaces from start/end of string
keywords.append((i.split(","))[0])
solutions.append(((i.split(","))[1]))

How to add items from a text file into a list?

I've already tried:
with open('names.txt', 'r') as f:
myNames = f.readlines()
And multiple other ways that I have found on stack overflow but they are not doing exactly what I need.
I have a text file that has one line with multiple words ex:
FLY JUMP RUN
I need these in a list but for them to be separate elements in the list like:
['FLY', 'JUMP', 'RUN']
Except when using the the methods I find on stack overflow, I get:
['FLY JUMP RUN']
But I need them to be separate because I am using the random.choice method on the list.
As was mentioned in the comments, you should try
myNames = None
with open('names.txt', 'r') as f:
myNames = f.read().split()
Assuming the file is written the way you say. Of course it won't matter as the default behaviour of split() is to split the string, using whitespace characters like spaces, and newline characters, so if your file consists of
One Two
Three
then
f.read().split()
will still return
["One", "Two", "Three"]
#Trey: To answer your comment.
No. readlines() will read all the contents of the file into a single list while read() simply reads it as a single strring.
Try this code:
with open('names.txt', 'r') as f:
myNames = f.read().split() #split by space
This will work as you expected for any type of data, as long as the words in a line is space separated.

Load text file python could not convert string to float

I have a text file that looks like this:
(1064.2966,1898.787,1064.2986,1898.787,1064.2986,1898.785,1064.2966,1898.785)
(1061.0567,1920.3816,1065.1361,1920.2276,1065.5847,1915.9657,1065.4726,1915.2927,1061.0985,1914.3955,1058.1824,1913.9468,1055.6028,1913.9468,1051.0044,1916.19,1051.5651,1918.8817,1056.0514,1918.9939,1058.9675,1919.6668,1060.8741,1920.4519)
etc (all rows have different lengths)
when I use
np.loadtxt(filename,dtype=float,delimiter=',')
I get
ValueError: could not convert string to float: (1031.4647
I think np.loadtxt expects numbers so it does not know how to convert a value which starts with a '(', I think you have two choices here:
lines = []
with open('datafile') as infile:
for line in infile:
line = line.rstrip('\n')[1:-1] # this removes first and last parentheses from the line
lines.append([float(v) for v in line.split(',')])
in this way you end up with lines which is a list of lists of values (i.e. lines[0] is a list of the values on line 1).
The other way to go is modifying the data file to remove the parentheses, which you can do in many ways depending on the platform you are working on.
In most Linux systems for instance you can just do something along the lines of this answer
EDIT: as suggested by #AlexanderHuszagh in the comments section, different systems can have different ways of representing newlines, so a more robust solution would be:
lines = []
with open('datafile') as infile:
file_lines = infile.read().splitlines()
for line in file_lines:
lines.append([float(v) for v in line[1:-1].split(',')])
You got the error because of the parentheses, you can replace it this way:
s = open(filename).read().replace('(','').replace(')','')
This return a list of arrays:
arrays = [np.array(map(float, line.split(","))) for line in s.split("\n")]

Categories