why I am getting empty list when I use split()? - python

I have a textfile as:
-- Generated ]
FILEUNIT
METRIC /
Hello
-- timestep: Jan 01,2017 00:00:00
3*2 344.0392 343.4564 343.7741
343.9302 343.3884 343.7685 0.0000 341.0843
342.2441 342.5899 343.0728 343.4850 342.8882
342.0056 342.0564 341.9619 341.8840 342.0447 /
I have written a code to read the file and remove the words, characters and empty lines, and do some other processes on that and finally return those numbers in the last four lines. I cannot understand how to put all the numbers of the text file properly in a list. Right now the new_line generates a string of those lines with numbers
import string
def expand(chunk):
l = chunk.split("*")
chunk = [str(float(l[1]))] * int(l[0])
return chunk
with open('old_textfile.txt', 'r') as infile1:
for line in infile1:
if set(string.ascii_letters.replace("e","")) & set(line):
continue
chunks = line.split(" ")
#Get rid of newlines
chunks = list(map(lambda chunk: chunk.strip(), chunks))
if "/" in chunks:
chunks.remove("/")
new_chunks = []
for i in range(len(chunks)):
if '*' in chunks[i]:
new_chunks += expand(chunks[i])
else:
new_chunks.append(chunks[i])
new_chunks[len(new_chunks)-1] = new_chunks[len(new_chunks)-1]+"\n"
new_line = " ".join(new_chunks)
when I use the
A = new_line.split()
B = list(map(float, A))
it returns an empty list. Do you have any idea how I can put all these numbers in one single list?
currently, I am writing the new_line as a textfile and reading it again, but it increase my runtime which is not good.
f = open('new_textfile.txt').read()
A = f.split()
B = list(map(float, A))
list_1.extend(B)
There was another solution to use Regex, but it deletes 3*2. I want to process that as 2 2 2
import re
with open('old_textfile.txt', 'r') as infile1:
lines = infile1.read()
nums = re.findall(r'\d+\.\d+', lines)
print(nums)

I'm not quite sure if I entirely understand what you are trying to do, but as I understand it you want to extract all numbers which either are in a decimal form \d+\.\d+ or an integer which is multiplied by another integer using an asterisk, so \d+\*\d+. You want the results all in a list of floats where the decimals are in the list directly and for the integers the second is repeated by the first.
One way to do this would be:
lines = """
-- Generated ]
FILEUNIT
METRIC /
Hello
-- timestep: Jan 01,2017 00:00:00
3*2 344.0392 343.4564 343.7741
343.9302 343.3884 343.7685 0.0000 341.0843
342.2441 342.5899 343.0728 343.4850 342.8882
342.0056 342.0564 341.9619 341.8840 342.0447 /
"""
nums = []
for n in re.findall(r'(\d+\.\d+|\d+\*\d+)', lines):
split_by_ast = n.split("*")
if len(split_by_ast) == 1:
nums += [float(split_by_ast[0])]
else:
nums += [float(split_by_ast[1])] * int(split_by_ast[0])
print(nums)
Which returns:
[2.0, 2.0, 2.0, 344.0392, 343.4564, 343.7741, 343.9302, 343.3884, 343.7685, 0.0, 341.0843, 342.2441, 342.5899, 343.0728, 343.485, 342.8882, 342.0056, 342.0564, 341.9619, 341.884, 342.0447]
The regular expression searches for numbers matching one of the formats (decimal or int*int). Then in case of a decimal it is directly appended to the list, in case of int*int it is parsed to a smaller list repeating the second int by first int times, then the lists are concatenated.

Related

Not Parsing Through

I tried to parse through a text file, and see the index of the character where the four characters before it are each different. Like this:
wxrgh
The h would be the marker, since it is after the four different digits, and the index would be 4. I would find the index by converting the text into an array, and it works for the test but not for the actually input. Does anyone know what is wrong.
def Repeat(x):
size = len(x)
repeated = []
for i in range(_size):
k = i + 1
for j in range(k, _size):
if x[i] == x[j] and x[i] not in repeated:
repeated.append(x[i])
return repeated
with open("input4.txt") as f:
text = f.read()
test_array = []
split_array = list(text)
woah = ""
for i in split_array:
first = split_array[split_array.index(i)]
second = split_array[split_array.index(i) + 1]
third = split_array[split_array.index(i) + 2]
fourth = split_array[split_array.index(i) + 3]
test_array.append(first)
test_array.append(second)
test_array.append(third)
test_array.append(fourth)
print(test_array)
if Repeat(test_array) != []:
test_array = []
else:
woah = split_array.index(i)
print(woah)
print(woah)
I tried a test document and unit tests but that still does not work
You can utilise a set to help you with this.
Read the entire file into a list (buffer). Iterate over the buffer starting at offset 4. Create a set of the 4 characters that precede the current position. If the length of the set is 4 (i.e., they're all different) and the character at the current position is not in the set then you've found the index you're interested in.
W = 4
with open('input4.txt') as data:
buffer = data.read()
for i in range(W, len(buffer)):
if len(s := set(buffer[i-W:i])) == W and buffer[i] not in s:
print(i)
Note:
If the input data are split over multiple lines you may want to remove newline characters.
You will need to be using Python 3.8+ to take advantage of the assignment expression (walrus operator)

How to put a group of integers in a row in a text file into a list?

I have a text file composed mostly of numbers something like this:
3 011236547892X
9 02321489764 Q
4 031246547873B
I would like to extract each of the following (spaces 5 to 14 (counting from zero)) into a list:
1236547892
321489764
1246547873
(Please note: each "number" is 10 "characters" long - the second row has a space at the end.)
and then perform analysis on the contents of each list.
I have umpteen versions, however I think I am closest with:
with open('k_d_m.txt') as f:
for line in f:
range = line.split()
num_lst = [x for x in range(3,10)]
print(num_lst)
However I have: TypeError: 'list' object is not callable
What is the best way forward?
What I want to do with num_lst is, amongst other things, as follows:
num_lst = list(map(int, str(num)))
print(num_lst)
nth = 2
odd_total = sum(num_lst[0::nth])
even_total = sum(num_lst[1::nth])
print(odd_total)
print(even_total)
if odd_total - even_total == 0 or odd_total - even_total == 11:
print("The number is ok")
else:
print("The number is not ok")
Use a simple slice:
with open('k_d_m.txt') as f:
num_lst = [x[5:15] for x in f]
Response to comment:
with open('k_d_m.txt') as f:
for line in f:
num_lst = list(line[5:15])
print(num_lst)
First of all, you shouldn't name your variable range, because that is already taken for the range() function. You can easily get the 5 to 14th chars of a string using string[5:15]. Try this:
num_lst = []
with open('k_d_m.txt') as f:
for line in f:
num_lst.append(line[5:15])
print(num_lst)

Extracted float values are stored in a list of lists instead of a list of values

I am doing an exercise for finding all the float point values in a text file and computing the average .
I have managed to extract all the necessary values but they are being stored in a list of lists and I don't know how extract them as floats in order to do the calculations .
Here is my code :
import re
fname = input("Enter file name: ")
fhandle = open(fname)
x = []
count = 0
for line in fhandle:
if not line.startswith("X-DSPAM-Confidence:") : continue
s = re.findall(r"[-+]?\d*\.\d+|\d+", line)
x.append(s)
count = count + 1
print(x)
print("Done")
and this is the output of x :
[['0.8475'], ['0.6178'], ['0.6961'], ['0.7565'], ['0.7626'], ['0.7556'], ['0.7002'], ['0.7615'], ['0.7601'], ['0.7605'], ['0.6959'], ['0.7606'], ['0.7559'], ['0.7605'], ['0.6932'], ['0.7558'], ['0.6526'], ['0.6948'], ['0.6528'], ['0.7002'], ['0.7554'], ['0.6956'], ['0.6959'], ['0.7556'], ['0.9846'], ['0.8509'], ['0.9907']]
Done
You can make x a flat list of floats from the start:
# ...
for line in fhandle:
# ...
s = re.findall(r"[-+]?\d*\.\d+|\d+", line)
x.extend(map(float, s))
Note that re.findall returns a list, so we extend x by it while applying float to all the strings in it.

Combining lines using reg ex in python

If wanted to combine six lines (each containing 3 elements) so that the final outcome is a single line with three elements so that the first is the addition of all the first elements, the second is the addition of all the second elements and the third is the concatenation of all the third elements.
For example,
We have,
12.34 -79 x
-3.5 23 y
32.2E2 2 z
4.23e-10 +45 x
62E+2 -4 y
0.0 0 z
and we need
9428.84 -13 xyzxyz
Here is my current code:
f = open('data.txt', 'r')
""" opens the file """
import re
""" Imports the regular expressions module"""
# lines = f.readlines ()
lines = list(f)
""" Reads all the lines of the file """
p = re.compile(r'\s*^([-]?([1-9]\d|\d)[E|e]?[+\d]?(.)(\d+(E|e)[-]?\d+|\d+))\s*([-,+]?([1-9]\d+|\d))\s*([x|y|z])$')
for x in lines:
m = p.match(x)
if m:
print (x)
You can do this by zipping the contents of the file so that all number of the first column are on first list, all number of the second column on second list and finally all characters on the third list. Then all you do is simply sum the first two lists and join the third list that contains the characters:
sum1 = 0
sum2 = 0
finalStr = ""
with open("data.txt", "r") as infile:
lines = list(zip(*[line.split() for line in list(infile)]))
sum1 = sum(map(float,lines[0]))
sum2 = sum(map(float,lines[1]))
finalStr = "".join(lines[2])
# Some formatting for float numbers
print("{:.2f}".format(sum1), end=" ")
print("{:.0f}".format(sum2), end=" ")
print(finalStr)
Output:
9428.84 -13 xyzxyz
There is no need for a regex in your case. Regular expressions are used to deconstruct strings, not to combine them. If you do not mind using pandas, the solution takes two lines:
import pandas as pd
data = pd.read_table("data.txt", sep='\s+', header=None)
df.sum().values.tolist()
#[9428.840000000422, -13, 'xyzxyz']

How to extract numbers from a text file and multiply them together?

I have a text file which contains 800 words with a number in front of each. (Each word and its number is in a new line. It means the file has 800 lines) I have to find the numbers and then multiply them together. Because multiplying a lot of floats equals to zero, I have to use logarithm to prevent the underflow, but I don't know how.
this is the formula:
cNB=argmaxlogP(c )+log P(x | c )
this code doesn't print anything.
output = []
with open('c:/python34/probEjtema.txt', encoding="utf-8") as f:
w, h = map(int, f.readline().split())
tmp = []
for i, line in enumerate(f):
if i == h:
break
tmp.append(map(int, line.split()[:w]))
output.append(tmp)
print(output)
the file language is persian.
a snippet of the file:
فعالان 0.0019398642095053346
محترم 0.03200775945683802
اعتباري 0.002909796314258002
مجموع 0.0038797284190106693
حل 0.016488845780795344
مشابه 0.004849660523763337
مشاوران 0.027158098933074686
مواد 0.005819592628516004
معادل 0.002909796314258002
ولي 0.005819592628516004
ميزان 0.026188166828322017
دبير 0.0019398642095053346
دعوت 0.007759456838021339
اميد 0.002909796314258002
You can use regular expressions to find the first number in each line, e.g.
import re
output = []
with open('c:/python34/probEjtema.txt', encoding="utf-8") as f:
for line in f:
match = re.search(r'\d+.?\d*', line)
if match:
output.append(float(match.group()))
print(output)
re.search(r'\d+.?\d*', line) looks for the first number (integer or float with . in each line.
Here is a nice online regex tester: https://regex101.com/ (for debuging / testing).
/Edit: changed regex to \d+.?\d* to catch integers and float numbers.
If I understood you correctly, you could do something along the lines of:
result = 1
with open('c:/python34/probEjtema.txt', encoding="utf-8") as f:
for line in f:
word, number = line.split() # line.split("\t") if numbers are seperated by tab
result = result * float(number)
This will create an output list with all the numbers.And result will give the final multiplication result.
import math
output = []
result=1
eres=0
with open('c:/python34/probEjtema.txt', encoding="utf-8") as f:
for line in (f):
output.append(line.split()[1])
result *= float((line.split()[1]))
eres += math.log10(float((line.split()[1]))) #result in log base 10
print(output)
print(result)
print eres

Categories