Fix columns indentation with python - python

There is a file format called .xyz that helps visualizing molecular bonds. Basically the format asks for a specific pattern:
At the first line there must be the number of atoms, which in my case is 30.
After that there should be the data where the first line is the name of the atom, in my case they are all carbon. The second line is the x information and the third line is the y information and the last line is the z information which are all 0 in my case. Indentation should be correct so that all of the corresponding lines should start at the same place. So something like this:
30
C x1 y1 z1
C x2 y2 z2
...
...
...
and not:
30
C x1 y1 z1
C x2 y2 z2
since this is the wrong indentation.
My generated data is stored like this in a .txt file:
C 2.99996 7.31001e-05 0
C 2.93478 0.623697 0
C 2.74092 1.22011 0
C 2.42702 1.76343 0
C 2.0079 2.22961 0
C 1.50006 2.59812 0
C 0.927076 2.8532 0
C 0.313848 2.98349 0
C -0.313623 2.9837 0
C -0.927229 2.85319 0
C -1.5003 2.5981 0
C -2.00732 2.22951 0
C -2.42686 1.76331 0
C -2.74119 1.22029 0
C -2.93437 0.623802 0
C -2.99992 -5.5509e-05 0
C -2.93416 -0.623574 0
C -2.7409 -1.22022 0
C -2.42726 -1.7634 0
C -2.00723 -2.22941 0
C -1.49985 -2.59809 0
C -0.92683 -2.85314 0
C -0.313899 -2.98358 0
C 0.31363 -2.98356 0
C 0.927096 -2.85308 0
C 1.50005 -2.59792 0
C 2.00734 -2.22953 0
C 2.4273 -1.76339 0
C 2.74031 -1.22035 0
C 2.93441 -0.623647 0
I want to correct the indentation of this by making all of the lines start from the same point. I tried to do this with AWK to no avail. So I turned to Python. So far I have this:
#!/usr/bin/env/python
text_file = open("output.txt","r")
lines = text_file.readlines()
myfile = open("output.xyz","w")
for line in lines:
atom, x, y, z = line.split()
x, y, z = map(float(x,y,z))
myfile.write("{}\t {}\t {}\t {}\t".format(atom,x,y,z))
myfile.close()
text_file.close()
but I don't know currently as to how indentation can be added into this.
tl;dr: I have a data file in .txt, I want to change it into .xyz that's been specified but I am running into problems with indentation.

It appears that I misinterpreted your requirement...
To achieve a fixed width output using awk, you could use printf with a format string like this:
$ awk '{printf "%-4s%12.6f%12.6f%5d\n", $1, $2, $3, $4}' data.txt
C 2.999960 0.000073 0
C 2.934780 0.623697 0
C 2.740920 1.220110 0
C 2.427020 1.763430 0
C 2.007900 2.229610 0
C 1.500060 2.598120 0
C 0.927076 2.853200 0
C 0.313848 2.983490 0
C -0.313623 2.983700 0
# etc.
Numbers after the % specify the width of the field. A negative number means that the output should be left aligned (as in the first column). I have specified 6 decimal places for the floating point numbers.
Original answer, in case it is useful:
To ensure that there is a tab character between each of the columns of your input, you could use this awk script:
awk '{$1=$1}1' OFS="\t" data.txt > output.xyz
$1=$1 just forces awk to touch each line, which makes sure that the new Output Field Separator (OFS) is applied.
awk scripts are built up from a series of condition { action }. If no condition is supplied, the action is performed for every line. If a condition but no action is supplied, the default action is to print the line. 1 is a condition that always evaluates to true, so awk prints the line.
Note that even though the columns are all tab-separated, they are still not lined up because the content of each column is of a variable length.

Your data has already been ill formatted and converted to string. To correctly allign the numeric and non-numeric data, you need to parse the individual fields to respective data types (possibly using duck-typing) before formating using str.format
for line in st.splitlines():
def convert(st):
try:
return int(st)
except ValueError:
pass
try:
return float(st)
except ValueError:
pass
return st
print "{:8}{:12.5f}{:12.5f}{:5d}".format(*map(convert,line.split()))
C 2.99996 0.00007 0
C 2.93478 0.62370 0
C 2.74092 1.22011 0
C 2.42702 1.76343 0
C 2.00790 2.22961 0
C 1.50006 2.59812 0
C 0.92708 2.85320 0
C 0.31385 2.98349 0
C -0.31362 2.98370 0
C -0.92723 2.85319 0

Using this: awk '{printf "%s\t%10f\t%10f\t%i\n",$1,$2,$3,$4}' atoms
give this output:
C 2.999960 0.000073 0
C 2.934780 0.623697 0
C 2.740920 1.220110 0
C 2.427020 1.763430 0
C 2.007900 2.229610 0
C 1.500060 2.598120 0
C 0.927076 2.853200 0
C 0.313848 2.983490 0
C -0.313623 2.983700 0
C -0.927229 2.853190 0
C -1.500300 2.598100 0
C -2.007320 2.229510 0
C -2.426860 1.763310 0
C -2.741190 1.220290 0
C -2.934370 0.623802 0
C -2.999920 -0.000056 0
C -2.934160 -0.623574 0
C -2.740900 -1.220220 0
C -2.427260 -1.763400 0
C -2.007230 -2.229410 0
C -1.499850 -2.598090 0
C -0.926830 -2.853140 0
C -0.313899 -2.983580 0
C 0.313630 -2.983560 0
C 0.927096 -2.853080 0
C 1.500050 -2.597920 0
C 2.007340 -2.229530 0
C 2.427300 -1.763390 0
C 2.740310 -1.220350 0
C 2.934410 -0.623647 0
Is it what you're meaning or did I misunderstood ?
Edit for side note: I used tabs \t for separation, a space could do too and I limited the output to a precision of 10, I didn't verify your input lenght

You can use string formatting to print values with consistent padding. For your case, you might write lines like this to the file:
>>> '%-12s %-12s %-12s %-12s\n' % ('C', '2.99996', '7.31001e-05', '0')
'C 2.99996 7.31001e-05 0 '
"%-12s" means "take the str() of the value and make it take up at least 12 characters left-justified.

Related

Is there a function to write certain values of a dataframe to a .txt file in Python?

I have a dataframe as follows:
Index A B C D E F
1 0 0 C 0 E 0
2 A 0 0 0 0 F
3 0 0 0 0 E 0
4 0 0 C D 0 0
5 A B 0 0 0 0
Basically I would like to write the dataframe to a txt file, such that every row consists of the index and the subsequent column name only, excluding the zeroes.
For example:
txt file
1 C E
2 A F
3 E
4 C D
5 A B
The dataset is quite big, about 1k rows, 16k columns. Is there any way I can do this using a function in Pandas?
Take a matrix vector multiplication between the boolean matrix generated by "is this entry "0" or not" and the columns of the dataframe, and write it to a text file with to_csv (thanks to #Andreas' answer!):
df.ne("0").dot(df.columns + " ").str.rstrip().to_csv("text_file.txt")
where we right strip the spaces at the end due to the added " " to the last entries.
If you don't want the name Index appearing in the text file, you can chain a rename_axis(index=None) to get rid of it i.e.,
df.ne("0").dot(df.columns + " ").str.rstrip().rename_axis(index=None)
and then to_csv as above.
You can try this (replace '0' with 0 if that are numeric 0 instead of string 0):
# Credits to Pygirl, made the code even better.
df.set_index('Index', inplace=True)
df = df.replace('0',np.nan)
df.stack().groupby(level=0).apply(list)
# Out[79]:
# variable
# 0 [C, E]
# 1 [A, F]
# 2 [E]
# 3 [C, D]
# 4 [A, B]
# Name: value, dtype: object
For the writing to text, you can use pandas as well:
df.to_csv('your_text_file.txt')
You could replace string '0' with empty string '', then so some string-list-join manipulation to get the final results. Finally append each line into a text file. See code:
df = pd.DataFrame([
['0','0','C','0','E','0'],
['A','0','0','0','0','F'],
['0','0','0','0','E','0'],
['0','0','C','D','0','0'],
['A','B','0','0','0','0']], columns=['A','B','C','D','E','F']
)
df = df.replace('0', '')
logfile = open('test.txt', 'a')
for i in range(len(df)):
temp = ''.join(list(df.loc[i,:]))
logfile.write(str(i+1) + ' ' + ' '.join(list(temp)) + '\n')
logfile.close()
Output test.txt
1 C E
2 A F
3 E
4 C D
5 A B

Speed up the pattern match and replace in python for huge file

I Have a very huge file (10Gb) which looks like this
ref R A C G T
N 0 0 0 0 0
A 5 5 0 0 0
N 0 0 0 0 0
C 8 0 8 0 0
N 0 0 0 0 0
A 6 6 0 0 0
T 0 0 0 0 0
So for all the entries where R=0 and if the ref is A|C|G|T replace 0 with 25 in their respective columns. I want something that looks like below
ref R A C G T
A 5 5 0 0 0
C 8 0 8 0 0
A 6 6 0 0 0
T 0 0 0 0 25
Here is what I tried. It works fine but takes too much time. Wanted to know if there is any faster way to do it
import pandas as pd
df=pd.read_csv("test",delimiter = "\t",header=0)
for index, row in df.iterrows():
if df.loc[index, 'R']==0:
if df.loc[index, 'ref']=="A":
df.loc[index, 'A_pp']=25
if df.loc[index, 'ref']=="T":
df.loc[index, 'T_pp']=25
elif df.loc[index, 'ref']=="C":
df.loc[index, 'C_pp']=25
elif df.loc[index, 'ref']=="G":
df.loc[index, 'G_pp']=25
df_filtered = df[df['ref']!= "N"]
df_filtered.to_csv('./test_formatted.txt', sep = "\t")
The key is not to use python at all. This is an awk specialty and will be orders of magnitude faster done in awk alone, e.g.
Edit - Improved Efficiency
awk '
BEGIN { OFS="\t" }
FNR==1 {
for (i=1; i<=NF; i++)
h[$i] = ++n
print
next
}
$1~/[ACGT]/ {
if ($h[$1] == 0)
$h[$1] = 25
print
}
' file
awk processes one record (line) at a time with each field available as $1, ... $NF (where NF is the number of fields). awk applies each of the rules you write in the order you write them. There are two special rules BEGIN and END that you can use for pre-post processing. All other rules are condition { commands } form where the condition can either be a REGEX, numeric or string conditional.
In addition to the BEGIN rule above (which simply sets the Output Field Separator (OFS) to tab), there are two rules. The first only applies to FNR == 1 (File record number, e.g. line number 1) to process the header, storing the header column values in the h indexed array and printing the header row. next simply skips all remaining rules and tells awk to start processing the next record (line).
The second rule matches when the first field ($1) matches the REGEX [ACGT], when field 1 is one of A, C, G or T within the [...] character class. If it is, then it loops over the number of fields and checks which index in the h array matches the first field and if the value stored in the field is 0. If so the value of the field is set equal to 25. The line is printed.
Example Use/Output
With your data in the file named gene you would have:
$ awk '
> BEGIN { OFS="\t" }
> FNR==1 {
> for (i=1; i<=NF; i++)
> h[$i] = ++n
> print
> next
> }
> $1~/[ACGT]/ {
> if ($h[$1] == 0)
> $h[$1] = 25
> print
> }
> ' gene
ref R A C G T
A 5 5 0 0 0
C 8 0 8 0 0
A 6 6 0 0 0
T 0 0 0 0 25
Look things over and let me know if I understood your replacement correctly or if not, drop a comment and I'm happy to help further.

Using itertools to generate an exponential binary space

I am interested in generating all binary combination of N variables without having to implement a manual loop of iterating N times over N and each time looping over N/2 and so on.
Do we have such functionality in python?
E.g:
I have N binary variables:
pool=['A','B','C',...,'I','J']
len(pool)=10
I would like to generate 2^10=1024 space out of these such as:
[A B C ... I J]
iter0 = 0 0 0 ... 0 0
iter1 = 0 0 0 ... 0 1
iter2 = 0 0 0 ... 1 1
...
iter1022 = 1 1 1 ... 1 0
iter1023 = 1 1 1 ... 1 1
You see that I don't have repetitions here, each variable is enabled once per each of these iter's sequences. How can I do that using Python's itertools?
itertools.product with the repeat parameter is the simplest answer:
for A, B, C, D, E, F, G, H, I, J in itertools.product((0, 1), repeat=10):
The values of each variable will cycle fastest on the right, and slowest on the left, so you'll get:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 1 0 0
etc. This may be recognizable to you: It's just the binary representation of an incrementing 10 bit number. Depending on your needs, you may actually want to just do:
for i in range(1 << 10):
then mask i with 1 << 9 to get the value of A, 1 << 8 for B, and so on down to 1 << 0 (that is, 1) for J. If the goal is just to print them, you can even get more clever, by binary stringifying and then using join to insert the separator:
for i in range(1 << 10):
print(' '.join('{:010b}'.format(i)))
# Or letting print insert the separator:
print(*'{:010b}'.format(i)) # If separator isn't space, pass sep='sepstring'

Search string and get below two strings values in same column in python

"detail" has below contents:
1 2 3 4
a b c
1 2 3 4 5 6 7 8 status 10
a b c d e f g h up x
a b c d e f g h Idle y
What I am trying is I need to get value's below status string in detail contents (up and idle or what ever it has below to status string in next two lines in same column ). In this case up and Idle have in detail contents.
I have tried below method in code
var1, var2 = islice(line, 2)
not able to get below two lines output from detail contents.
Please can any one can help me what is best method to achieve this.
Here is the code what i tried
from itertools import islice
import string
detail = """1 2 3 4 5 6 7 8 status 10
a b c d e f g h up x
a b c d e f g h idle y"""
print detail
for line in detail.split("\n"):
line = ' '.join(line.split())
line = line.split(" ")
print line
if len(line) >= 9:
if line[8] == "status":
var1, var2 = islice(line, 2)
if any("idle" in s for s in var1.lower()) or any("never" in s for s in var1.lower()):
print var1[8]
else:
print var1[8]
if any("idle" in s for s in var2.lower()) or any("never" in s for s in var2.lower()):
print var2[8]
else:
print var2[8]

is there a z.fill function in R

I have a data frame
df=data.frame(f=c('a','ab','abc'),v=1:3)
and make a new column with:
df$c=paste(df$v,df$f,sep='')
the result is
> df
f v c
1 a 1 1a
2 ab 2 2ab
3 abc 3 3abc
I would like column c to be in this format:
> df
f v c
1 a 1 1 a
2 ab 2 2 ab
3 abc 3 3abc
such that the total length of the concatenated values is a fixed number (in this case 4 characters) and to fill it will a chosen character, such as | (in this case \w).
Is there a function like this in R? I think it is similar to the z.fill function in python, but I am not a python programmer, and would prefer to stay in R as opposed to switching between languages for processing. Ultimately, I am creating a supervariable of 10 columns, and think this would help in downstream processing
I guess it would be in the paste function, but not sure how to 'fill a factor' so that it is of a fixed width
You can use the format() function to pretty print the values of your column. For example:
> format(df$f, width = 3, justify = "right")
[1] " a" " ab" "abc"
So your code should be:
df <- within(df, {
c <- paste0(v, format(f, width = 3, justify = "right"))
})
df
The result:
> df
f v c
1 a 1 1 a
2 ab 2 2 ab
3 abc 3 3abc
You can use the formatCfunction as follow
df$c <- paste(df$v, formatC(as.character(df$f), width = 3, flag = " "), sep = "")
df
f v c
1 a 1 1 a
2 ab 2 2 ab
3 abc 3 3abc
DATA
df <- data.frame(f = c('a','ab','abc'), v=1:3)

Categories