Writing white-space delimited text to be human readable in Python #2 - python

I want to write a 2D numpy array into a human-readable text file format. I came across this question asked before but it only specifies equal number of space to be associated with each element in the array. In it all elements are spaced out with 10 spaces. What I want is different number of space for each column in my array.
Writing white-space delimited text to be human readable in Python
For example, I want 7 spaces for my 1st column, 10 spaces for my 2nd column, 4 spaces for my 3rd column, etc. Is there an analogy to numpy.savetxt(filename, X, delimiter = ',', fmt = '%-10s'), but where instead of '%-10s' I have say '%-7s, %-10s, %-4s' etc?
Thank you

Here is an example of what it can look like (Python2&3):
l = [[1,2,3,4], [3,4,5,6]]
for row in l:
print(u'{:<7} {:>7} {:^7} {:*^7}'.format(*row))
1 2 3 ***4***
3 4 5 ***6***
The formatting options are taken from http://docs.python.org/2/library/string.html
>>> '{:<30}'.format('left aligned')
'left aligned '
>>> '{:>30}'.format('right aligned')
' right aligned'
>>> '{:^30}'.format('centered')
' centered '
>>> '{:*^30}'.format('centered') # use '*' as a fill char
'***********centered***********'
If you need a file, then do this:
l = [[1,2,3,4], [3,4,5,6]]
with open('file.txt', 'wb') as f:
f.write(u'\ufeff'.encode('utf-8'))
for row in l:
line = u'{:<7} {:>7} {:^7} {:*^7}\r\n'.format(*row)
f.write(line.encode('utf-8'))
The content of the file is
1 2 3 ***4***
3 4 5 ***6***
And the encoding is UTF-8. This means that you can have not only numbers but also any letter in your heading: ☠ ⇗ ⌚ ② ☕ ☃ ⛷
heading = [u'☠', u'⇗', u'⌚', u'②']
with open('file.txt', 'wb') as f:
f.write(u'\ufeff'.encode('utf-8'))
line = '{:<7} {:>7} {:^7} {:*^7}\r\n'.format(*heading)
f.write(line.encode('utf-8'))
for row in l:
line = '{:<7} {:>7} {:^7} {:*^7}\r\n'.format(*row)
f.write(line.encode('utf-8'))
☠ ⇗ ⌚ ***②***
1 2 3 ***4***
2 3 4 ***5***

Related

regex on reading a txt file in Python

I have a txt that contains data for classification purposes. The first column is the class, that is 0 or 1 and the other four columns contain the features of the class. Yet the features has numbers before them, that is 1: for feature 1, 2: for feature 2 etc. I tried to use regex in numpy split but I failed. How can I take only the columns I need? Below is the txt with the data.
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
1 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02
1 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02
1 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02
1 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02
1 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02
1 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02
1 1:1.451200e+02 2:2.088600e+02 3:-1.760859e-01 4:1.542257e+02
1 1:3.849699e+01 2:4.146600e+01 3:-1.886419e-01 4:1.239661e+02
1 1:2.927699e+01 2:1.072510e+02 3:1.149632e-01 4:1.077885e+02
1 1:2.886700e+01 2:1.090240e+02 3:-1.239433e-01 4:9.799130e+01
1 1:2.401300e+01 2:7.602000e+01 3:2.850990e-01 4:9.891692e+01
1 1:2.837900e+01 2:1.452160e+02 3:3.870011e-01 4:1.549975e+02
1 1:2.238140e+01 2:8.242810e+01 3:-2.814865e-01 4:8.998764e+01
1 1:1.232100e+02 2:4.561600e+02 3:-1.518468e-01 4:1.432996e+02
1 1:2.008405e+01 2:1.774510e+02 3:2.578101e-01 4:9.253101e+01
1 1:3.285699e+01 2:1.826750e+02 3:2.204406e-01 4:9.457175e+01
1 1:0.000000e+00 2:1.154780e+02 3:1.504970e-01 4:1.096315e+02
1 1:3.954504e+01 2:2.374420e+02 3:1.089429e-01 4:1.376333e+02
1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02
1 1:3.408200e+01 2:1.198280e+02 3:2.200156e-01 4:1.383639e+02
1 1:0.000000e+00 2:8.671080e+01 3:4.201880e-01 4:1.298851e+02
1 1:4.865997e+01 2:3.071500e+02 3:1.756066e-01 4:1.640174e+02
1 1:2.341090e+01 2:8.347140e+01 3:1.766868e-01 4:9.803250e+01
1 1:1.222390e+02 2:4.357930e+02 3:-1.812907e-01 4:1.687663e+02
1 1:1.624560e+01 2:4.830620e+01 3:5.508614e-01 4:2.632639e+01
1 1:4.389899e+01 2:2.421300e+02 3:2.006008e-01 4:1.331948e+02
1 1:6.143698e+01 2:2.338500e+02 3:2.758731e-01 4:1.612433e+02
1 1:5.952499e+01 2:2.176700e+02 3:-8.601014e-02 4:1.170831e+02
1 1:2.915850e+01 2:1.259875e+02 3:1.910455e-01 4:1.279927e+02
1 1:5.059702e+01 2:2.430620e+02 3:1.863443e-01 4:1.352273e+02
1 1:6.024097e+01 2:1.977340e+02 3:-1.319924e-01 4:1.320220e+02
1 1:2.620490e+01 2:6.270790e+01 3:-1.402450e-01 4:1.135866e+02
1 1:2.847198e+01 2:1.483760e+02 3:-1.868249e-01 4:1.672337e+02
1 1:2.707990e+01 2:7.770390e+01 3:-2.509235e-01 4:9.798032e+01
1 1:2.068600e+01 2:8.446800e+01 3:1.761782e-01 4:1.199423e+02
1 1:1.962450e+01 2:4.923090e+01 3:4.302725e-01 4:9.361318e+01
1 1:4.961401e+01 2:3.234850e+02 3:-1.963741e-01 4:1.622486e+02
1 1:7.982401e+01 2:2.017540e+02 3:-1.412161e-01 4:1.310716e+02
1 1:6.696402e+01 2:2.214030e+02 3:-1.187778e-01 4:1.416626e+02
1 1:5.842999e+01 2:1.348610e+02 3:2.876077e-01 4:1.286684e+02
1 1:6.982007e+01 2:3.693401e+02 3:-1.539849e-01 4:1.511659e+02
1 1:1.902200e+01 2:2.210120e+02 3:1.689450e-01 4:1.368066e+02
1 1:4.582898e+01 2:2.215950e+02 3:2.419124e-01 4:1.627100e+02
I do hate pandas but try these three lines:
import pandas as pd
# Use pandas read_csv; sep is interpreted as a regex
x=pd.read_csv('file.txt',sep='[: ]').to_numpy()
# Now select the required columns
output=x[:,(2,4,6,8)]
print(output)
"""
array([[ 5.707397e+01, 2.214040e+02, 8.607959e-02, 1.229114e+02],
[ 1.725900e+01, 1.734360e+02, -1.298053e-01, 1.250318e+02],
[ 2.177940e+01, 1.249531e+02, 1.538853e-01, 1.527150e+02],
[ 9.133997e+01, 2.935699e+02, 1.423918e-01, 1.605402e+02],
[ 5.537500e+01, 1.792220e+02, 1.654953e-01, 1.112273e+02],
[ 2.956200e+01, 1.913570e+02, 9.901439e-02, 1.034076e+02]])
"""
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can also check out How to use regex as delimiter function while reading a file into numpy array or similar
I rediscovered the solution below independently but this answer follows the same strategy via sep:
https://stackoverflow.com/a/51195469/1021819
You wish to parse a line like this:
1 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02
Your parser reads the input with the usual idiom of:
for line in input_file:
Start by distinguishing the class label from the features.
label, *raw_features = line.split()
label = int(label)
Now it just remains to strip the unhelpful N: prefix from each feature.
features = [float(feat.split(':')[1])
for feat in raw_features]
Sure, you could solve this with a regex.
But that doesn't sound like the simplest solution, in this case.
I was bored :-) , So thought of writing a snippet for you. See below, which is kind of a dirty text processing and loading it to a dataframe.
lines = open("predictions.txt", "r").readlines()
column_lines = [
[fline[0]] + [feat[1] for feat in sorted([tuple(feature.split(":")) for feature in fline[1:]], key=lambda f: f[0])]
for fline in [line.split(" ") for line in lines]
]
import pandas as pd
table = pd.DataFrame(column_lines, columns = ["Class", "Feature1","Feature2","Feature3","Feature4"])
Instead of this, you can also think of tranforming the file to a csv, using a similar text processing and then use them directly to create a dataframe, so you dont need to run this code everytime.!
I hope this is helpful.
If you want to use regex to only extract you columns you can use this regex expression on each line:
import re
line = '1 1:1.067980e+02 2:3.237560e+02 3:-1.509505e-01 4:1.754021e+02'
reg = re.compile(r'(-*\d+\.\d+e[+|-]\d+)')
# Your columns:
reg.findall(line)
>>> ['1.067980e+02', '3.237560e+02', '-1.509505e-01', '1.754021e+02']
# Assuming you also want numbers of those values:
list(map(float, reg.findall(line)))
>>> [106.798, 323.756, -0.1509505, 175.4021]
What is does:
(-*\d+\.\d+e[+|-]\d+) the first brackets are used to create groups. Inside the group first -* is the optional minus sign. Thereafter there is at least 1 number, but there can be more than 1 \d+. The number does have a decimal point with decimals therefore \.\d+. Then there is an exponent with either + or - e[+|-] following with number \d+.

Python to remove extra delimiter

We have a 100MB pipe delimited file that has 5 column/4 delimiters each separated by a pipe. However there are few rows where the second column has an extra pipe. For these few rows total delimiter are 5.
For example, in the below 4 rows, the 3rd is a problematic one as it has an extra pipe.
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
Is there any way we can remove an extra pipe from the second position where the delimiter count for the row is 5. So, post correction, the file needs to look like below.
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
Please note that the file size is 100 MB. Any help is appreciated.
Source: my_file.txt
1|B|3|D|5
A|1|2|34|5
D|This is a |text|3|5|7
B|4|5|5|6
E|1 |9 |2 |8 |Not| a |text|!!!|3|7|4
Code
# If using Python3.10, this can be Parenthesized context managers
# https://docs.python.org/3.10/whatsnew/3.10.html#parenthesized-context-managers
with open('./my_file.txt') as file_src, open('./my_file_parsed.txt', 'w') as file_dst:
for line in file_src.readlines():
# Split the line by the character '|'
line_list = line.split('|')
if len(line_list) <= 5:
# If the number of columns doesn't exceed, just write the original line as is.
file_dst.write(line)
else:
# If the number of columns exceeds, count the number of columns that should be merged.
to_merge_columns_count = (len(line_list) - 5) + 1
# Merge the columns from index 1 to index x which includes all the columns to be merged.
merged_column = "".join(line_list[1:1+to_merge_columns_count])
# Replace all the items from index 1 to index x with the single merged column
line_list[1:1+to_merge_columns_count] = [merged_column]
# Write the updated line.
file_dst.write("|".join(line_list))
Result: my_file_parsed.txt
1|B|3|D|5
A|1|2|34|5
D|This is a text|3|5|7
B|4|5|5|6
E|1 9 2 8 Not a text!!!|3|7|4
A simple regular expression pattern like this works on Python 3.7.3:
from re import compile
bad_pipe_re = compile(r"[ \w]+\|[ \w]+(\|)[ \w]+\|[ \w]+\|[ \w]+\|[ \w]+\n")
with open("input", "r") as fp_1, open("output", "w") as fp_2:
line = fp_1.readline()
while line is not "":
mo = bad_pipe_re.fullmatch(line)
if mo is not None:
line = line[:mo.start(1)] + line[mo.end(1):]
fp_2.write(line)
line = fp_1.readline()

How to replace() a specific value within a list item directly after a keyword [duplicate]

This question already has answers here:
replace characters not working in python [duplicate]
(3 answers)
Closed 2 years ago.
Currently have a standard txt-style file that I'm trying to open, copy, and change a specific value within. However, a standard replace() fn isn't producing any difference. Here's what the 14th line of the file looks like:
' Bursts: 1 BF: 50 OF: 1 On: 2 Off: 8'
Here's the current code I have:
conf_file = 'configs/m-b1-of1.conf'
read_conf = open(conf_file, 'r')
conf_txt = read_conf.readlines()
conf_txt[14].replace(conf_txt[14][13], '6')
v_conf
Afterwards, though, no changes have been applied to the specific value I'm referencing (in this case, the first '1' in the 14th line above.
Any help would be appreciated - thanks!
There are few things here I think:
I copied your string and the first 1 is actually character 12
replace result needs to be assigned back to something (it gives you a new string)
replace, will replace all "1"s with "6"s!
Example:
>>> a = ' Bursts: 1 BF: 50 OF: 1 On: 2 Off: 8'
>>> a = a.replace(a[12], '6')
>>> a
' Bursts: 6 BF: 50 OF: 6 On: 2 Off: 8'
If you only want to replace the first instance (or N instances) of that character you need to let replace() know:
>>> a = ' Bursts: 1 BF: 50 OF: 1 On: 2 Off: 8'
>>> a = a.replace(a[12], '6', 1)
>>> a
' Bursts: 6 BF: 50 OF: 1 On: 2 Off: 8'
Note that above only "Bursts" are replaced and not "OF"
try this
conf_file = 'configs/m-b1-of1.conf'
read_conf = open(conf_file, 'r')
conf_txt = read_conf.readlines()
conf_txt[14] = conf_txt[14].replace(conf_txt[14][13], '6')
the replace function does not edit the actual string, it just returns the replaced value, so you have to redefine the string in the array

Sorting lines of a text file using column with integers -

I need to sort the lines of a text file using the integer values of one of the columns
(the first one).
The file (coord.xyz) looks like this
9 1 -1.379785 0.195902 -1.197553
5 4 -0.303549 0.242253 -0.810244
2 2 -0.582923 1.208243 1.566588
3 3 -0.494556 0.028594 0.763130
4 1 -0.749005 -1.209878 1.358057
1 1 -0.883509 1.111866 2.882335
6 1 -1.005786 -1.278486 2.719391
7 5 -1.128898 -0.088124 3.508042
10 1 -0.253070 -0.289294 5.424662
8 1 -1.243879 -0.217228 5.247915
I used the code
import numpy as np
with open("coord.xyz") as inf:
data = []
for line in inf:
line = line.split()
if len(line)==5:
data.append(line)
f_h = file('sorted.dat','a')
m = sorted(data, key=lambda data_entry: data_entry[0])
np.savetxt(f_h, m, fmt='%s', delimiter=' ')
f_h.close()
the sorted.dat file resulted to be like this
1 1 -0.883509 1.111866 2.882335
10 1 -0.253070 -0.289294 5.424662
2 2 -0.582923 1.208243 1.566588
3 3 -0.494556 0.028594 0.763130
4 1 -0.749005 -1.209878 1.358057
5 4 -0.303549 0.242253 -0.810244
6 1 -1.005786 -1.278486 2.719391
7 5 -1.128898 -0.088124 3.508042
8 1 -1.243879 -0.217228 5.247915
9 1 -1.379785 0.195902 -1.197553
The 10 is considered as a smaller value than 2.
Could someone help me to fix this ?
What you wrote is sorting the lines as strings. Alphabetically 10 comes before 2.
Try writing your lambda as:
m = sorted(data, key=lambda data_entry: int(data_entry[0]))
If you used NumPy to import the data as well as to export it, you wouldn't have this problem. For example:
m = np.loadtxt("coord.xyz", dtype="i, i, f8, f8, f8")
Now you've got a 1D array of tuples of the appropriate types, and the default m.sort() will sort the tuples in the usual way, which is exactly what you want. So the whole thing reduces to three lines: read the array, sort the array, write the array.
But let's show you what you did wrong with your attempt:
m = sorted(data, key=lambda data_entry: data_entry[0])
You're asking it to sort by the first string in the list of strings data_entry. So that's what it does. If you want it to sort by that first string as a number, you have to tell it that. Like this:
m = sorted(data, key=lambda data_entry: int(data_entry[0]))
And that's it.
Also, if you want to read (or write) CSV-like files without using NumPy, rather than writing your own string processing, the csv module in the standard library makes your like easier:
with open("coord.xyz") as inf:
data = list(csv.reader(inf, delimiter=' '))
m = sorted(data, key=lambda data_entry: int(data_entry[0]))
with open("sorted.dat", "a") as outf:
csv.writer(outf, delimiter=' ').writerows(m)

Read integers from file using struct.unpack in Python

Suppose I have a file name num.txt as below:
1 2 3 4 5
6 7 8 9 0
I want to read 3 integers from this file, that is 1 2 3.
I know that struct.unpack might do the trick, but I just cannot get it right.
Here is how I did it:
fp = open('num.txt', 'rb')
print struct.unpack('iii', fp.read(12)) #right?
Anyone can help me with this?
PS
This is how I got file num.txt:
fp = open('num.txt', 'wb')
fp.write('1 2 3 4 5\n6 7 8 9 0')
fp.close()
You don't use struct to read numbers from a text file. It is for reading data from a binary file -- where the first byte is actually 0x01 rather than a byte order mark or the encoded value of the character '1'.
You just want
three_ints = [int(x) for x in numfile.readline().strip().split(' ')[:3]]
If you're only interested in the first three numbers, or
all_ints = [[int(x) for x in line.split()] for line in numfile]
if you want a list of lists of the ints on each line.
struct is used for C-style binary representations of numbers. If you have text representations instead then you should just pass them to int().
>>> [int(x) for x in '1 2 3 4 5'.split()]
[1, 2, 3, 4, 5]

Categories