How can I split a column in a specific range? - python

I'm working with proteins trajectory and I've got a long data frame. (File with one column and 600.000 lines.
This is and example:
100
100
0
100
100
...
n=600.000
What I wish is to split this data every 3000 lines, creating a new column beside like this example:
Col1 Col2 Col3 Col4 Col...200:
n=1 n=3001 n=6001 n=9001 ...
0 0 0 0 ...
0 0 0 0 ...
100 100 100 100 ...
... ... ... ... ...
n=3000 n=6000 n=9000 n=12000 n=600.000
n= line number.
Is there any way to do this in R or bash?
Thank you very much in advance.
EDIT: I'm using this script in python to generate that column:
from decimal import *
i = 1
while(i <= 15):
output = open('cache/distances_'+str(i)+'.dat.results', 'w')
with open('cache/distances_medias_'+str(i)+'.dat', 'r') as f:
for line in f:
columns = line.split(' ')
if(Decimal(columns[0]) <= 2.5 and (Decimal(columnas[1]) > 120 and Decimal(columnas[1]) < 180)):
salida.write("100\n")
else:
salida.write("0\n")
salida.close()
i+=2
Is there any way to modify the script and when it reaches the line 3000, start in a new column?

I am not sure I understand your example, but you should be able to use a combination of split and paste:
$ cat filetosplit
1
2
3
4
5
6
7
8
9
10
$ split filetosplit "split." -l 3 -d ; paste split*
1 4 7 10
2 5 8
3 6 9
The split command will generate files for 3 lines per row (you can modify to 3000). The paste will put all them together. You can use sed to add an header with column names and initial number.

In R you may just may add a dim attribute:
dim(your_vector) <- c(3000, 600000/3000)
It will change an object class to matrix, so if you need data frame, you will need:
df <- data.frame(your_vector)

With awk:
awk -v n=5 '{data[(NR-1)%n FS int((NR-1)/n)]=$0}
END {cols=NR/n;
for (i=0;i<n;i++) {
for (j=0;j<cols;j++)
printf "%s%s", data[i FS j], FS}
print ""
}
}'
That is: store all the content in a kind-of matrix and then loop accordingly.
Sample outputs
$ seq 15 | awk -v n=3 '{data[(NR-1)%n FS int((NR-1)/n)]=$0} END {cols=NR/n; for (i=0;i<n;i++) {for (j=0;j<cols;j++) {printf "%s%s", data[i FS j], FS} print ""}}'
1 4 7 10 13
2 5 8 11 14
3 6 9 12 15
$ seq 15 | awk -v n=7 '{data[(NR-1)%n FS int((NR-1)/n)]=$0} END {cols=NR/n; for (i=0;i<n;i++) {for (j=0;j<cols;j++) {printf "%s%s", data[i FS j], FS} print ""}}'
1 8 15
2 9
3 10
4 11
5 12
6 13
7 14
$ seq 15 | awk -v n=5 '{data[(NR-1)%n FS int((NR-1)/n)]=$0} END {cols=NR/n; for (i=0;i<n;i++) {for (j=0;j<cols;j++) {printf "%s%s", data[i FS j], FS} print ""}}'
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15

Related

How is print('\r') or print(' ') giving me the output?

We were asked to print the following output:
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
4 4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6
7 7 7 7
8 8 8
9 9
10
I understand that it would require two loops so I tired this:
a = int(input())
i = a
f = 1
while i>0:
for j in range(i):
print(f,end=' ')
f += 1
i -= 1
print('\r')
With this I am getting the desired output, but as soon as I remove the last line of print('\r') the output becomes something like this:
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 7 7 7 7 8 8 8 9 9 10
The desired output also comes out when I used print(' ') instead of print('\r'), I don't understand why this is happening?
Ps: I am a noob coder, starting my freshman year, so please go easy on me, if the formatting is not up to the mark, or the code looks bulky.
Probably not helping you so much but the following code produces the expected output:
a = 10
for i, j in enumerate(range(a, 0, -1), 1):
print(*[i] * j)
# Output:
1 1 1 1 1 1 1 1 1 1 # i=1, j=10
2 2 2 2 2 2 2 2 2 # i=2, j=9
3 3 3 3 3 3 3 3 # i=3, j=8
4 4 4 4 4 4 4 # i=4, j=7
5 5 5 5 5 5 # i=5, j=6
6 6 6 6 6 # i=6, j=5
7 7 7 7 # i=7, j=4
8 8 8 # i=8, j=3
9 9 # i=9, j=2
10 # i=10, j=1
The two important parameters here are sep (when you print a list) and end as argument of print. Let's try to use it:
a = 10
for i, j in enumerate(range(a, 0, -1), 1):
print(*[i] * j, sep='-', end='\n\n')
# Output:
1-1-1-1-1-1-1-1-1-1
2-2-2-2-2-2-2-2-2
3-3-3-3-3-3-3-3
4-4-4-4-4-4-4
5-5-5-5-5-5
6-6-6-6-6
7-7-7-7
8-8-8
9-9
10
Update
Step by step:
# i=3; j=8
>>> print([i])
[3]
>>> print([i] * j)
[3, 3, 3, 3, 3, 3, 3, 3]
# print takes an arbitrary number of positional arguments.
# So '*' unpack the list as positional arguments (like *args, **kwargs)
# Each one will be printed and separated by sep keyword (default is ' ')
>>> print(*[i] * j)
To make it all easier and prevent errors, you can simply do this:
n = 10
for i in range(1, n + 1):
txt = str(i) + " " # Generate the characters with space between
print(txt * (n + 1 - i)) # Print the characters the inverse amount of times i.e. 1 10, 10 1
Where it generates the text which is simply the number + a space, then prints it out the opposite amount of times, (11 - current number), i.e. 1 ten times, 10 one time.
I suggest using 2 or 4 spaces for indenting. Let's take a look:
a = int(input())
i = a
f = 1
while i>0:
for j in range(i):
print(f,end=' ')
f += 1
i -= 1
print('\r')
Notice the print(f,end=' ') within the inner loop. the end=' ' bit is important because print() appends a trailing new line \n to every call by default. Using end=' ' instead appends a space.
Now take a look at print('\r'). It does not specify end=' ', so it does still append a newline after each call to print. The fact that you additionally print a \r is inconsequential in this case. You could also just do print().
you can do this way :
rows = 10
b = 0
for i in range(rows, 0, -1):
b += 1
for j in range(1, i + 1):
print(b, end=' ')
print('\r')
No need for multiple loops.
for i in range(1,11):
# concatenate number + a space repeatedly, on the same line
# yes, there is an extra space at the end, which you won't see ;-)
print(f"{i} " * (11-i))
output:
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
4 4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6
7 7 7 7
8 8 8
9 9
10
As to what's happening with your code...
A basic Python print prints on a line, meaning that it ends with a line feed (which moves it to the next line).
So, if I take your word for it, you've done all the hard work of say the first line of 10 ones with spaces, when you are done at the following point.
#your code
f += 1
i -= 1
Now, so far you've avoided that line feed by changing the end parameter to print so that it doesn't end with a newline. So you have:
1 1 1 1 1 1 1 1 1 1
And still no line feed. Great!
But if you now start printing 2 2 2 2 2 2 2 2 2 , it will just get added to... the end of the previous line, without line feed.
So to force a line feed, you *print anything you want, but without the end parameter being set, so that print now ends with the linefeed it uses by default.
Example:
#without line feed
print("1 " * 3, end=' ')
print("2 " * 2, end=' ')
output:
1 1 1 2 2
Lets try printing something, anything, without a end = ' ')
print("1 " * 3, end=' ')
#now add a line by a print that is NOT using `end = ' '`
print("!")
print("2 " * 2, end=' ')
output:
1 1 1 !
2 2
OK, so now we have a line feed after ! so you jump to the next line when printing the 2s. But you don't want to see anything after the 1s.
Simples, print anything that is invisible.
print("1 " * 3, end=' ')
#now add a line by a print, but using a non-visible character.
#or an empty string. Tabs, spaces, etc... they will all work
print(" ")
print("2 " * 2, end=' ')
output:
1 1 1
2 2
This would also work:
print("1 " * 3, end=' ')
#we could also print a linefeed and end without one...
print("\n", end="")
print("2 " * 2, end=' ')

Extract data from alternate rows with python

I want to extract the number corresponding to O2H from the following file format (The delimiter used here is space):
# Timestep No_Moles No_Specs SH2 S2H4 S4H6 S2H2 H2 S2H3 OSH2 Mo1250O3736S57H111 OSH S3H6 OH2 S3H4 O2S SH OS2H3
144500 3802 15 3639 113 1 10 18 2 7 1 3 2 1 2 1 1 1
# Timestep No_Moles No_Specs SH2 S2H4 S2H2 H2 S2H3 OSH2 Mo1250O3733S61H115 OS2H2 OSH S3H6 OS O2S2H2 OH2 S3H4 SH
149000 3801 15 3634 114 11 18 2 7 1 1 2 2 1 1 4 2 1
# Timestep No_Moles No_Specs SH2 OS2H3 S3H Mo1250O3375S605H1526 OS S2H4 O3S3H3 OSH2 OSH S2H2 H2 OH2 OS2H2 S2H O2S3H3 SH O4S4H4 OH O2S2H O6S5H3 O6S5H5 O3S4H4 O2S3H2 O3S4H3 OS3H3 O3S2H2 O4S3H4 O3S3H O6S4H5 OS4H3 O3S2H O5S4H4 OS2H O2SH2 S2H3 O4S3H3 O3S3H4 O O5S3H4 O5S3H3 OS3H4 O2S4H4 O4S4H3 O2SH O2S2H2 O5S4H5 O3S3H2 S3H6
589000 3269 48 2900 11 1 1 47 11 1 81 74 26 25 21 17 1 3 5 2 3 3 1 1 2 2 1 2 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Timestep No_Moles No_Specs SH2 Mo1250O3034S578H1742 OH2 OSH2 O3S3H5 OS2H2 OS OSH O2S3H2 OH O3S2H2 O6S6H4 SH O2S2H2 S2H2 OS2H H2 OS2H3 O5S4H2 O7S6H5 S3H2 O2SH2 OSH3 O7S6H4 O2S2H3 O6S5H3 O2SH O4S4H O3S2H3 S2 O2S2H S5H3 O7S4H4 O3S3H OS3H OS4H O5S3H3 S3H O17S12H9 O3S3H2 O7S5H4 O4SH3 O3S2H O7S8H4 O3S3H3 O11S9H6 OS3H2 S4H2 O10S8H6 O4S3H2 O5S5H4 O6S8H4 OS2 OS3H6 S3H3
959500 3254 55 2597 1 83 119 1 46 59 172 4 3 4 1 27 7 38 6 23 3 1 2 3 5 3 1 2 1 2 1 1 6 3 1 1 2 1 1 1 1 1 3 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1
That is, all the alternate rows contain the corresponding data of its previous row.
And I want the output to look like this
1
4
21
83
How it should work:
1 (14th number on 2nd row which corresponds to 14th word of 1st row i.e. O2H)
4 (16th number on 4th row which corresponds to 16th word of 3rd row i.e. O2H)
21 (15th number on 6th row which corresponds to 15th word of 5th row i.e. O2H)
83 (6th number on 8th row which corresponds to 6th word of 7th row i.e. O2H)
I was trying to extract it using regex but couldnot do it. Can anyone please help me to extract the data?
You easily parse this to a dataframe and select the desired column to fetch the values.
Assuming your data looks like the sample you've provided, you can try the following:
import pandas as pd
with open("data.txt") as f:
lines = [line.strip() for line in f.readlines()]
header = max(lines, key=len).replace("#", "").split()
df = pd.DataFrame([line.split() for line in lines[1::2]], columns=header)
print(df["OH2"])
df.to_csv("parsed_data.csv", index=False)
Output:
0 1
1 11
2 1
3 83
Name: OH2, dtype: object
Dumping this to a .csv would yield:
i think you want OH2 and not O2H and it's a typo. Assuming this:
(1) iterate every single line
(2) take in account only even lines. ( if (line_counter % 2) == 0: continue )
(3) splitting all the spaces and using a counter variable, count the index of the OH2 in the even line. assuming it is 14 in the first line
(4) access the following line ( +1 index ) and splitting spaces of the following line, access the element at the index of the element that you find in point (3)
since you haven't post any code i assumed your problem was more about finding a way to achieve this, than coding, so i wrote you the algorithm
Thank you, everyone, for the help, I figured out the solution
i=0
j=1
with open ('input.txt','r') as fin:
with open ('output.txt','w') as fout:
for lines in fin: #Iterating over each lines
lists = lines.split() #Splits each line in list of words
try:
if i%2 == 0: #Odd lines
index_of_OH2 = lists.index('OH2')
#print(index_of_OH2)
i=i+1
if j%2 == 0: #Even lines
number_of_OH2 = lists[index_of_OH2-1]
print(number_of_OH2 + '\n')
fout.write(number_of_OH2 + '\n')
j=j+1
except:
pass
Output:
1
4
21
83
try:, except: pass added so that if OH2 is not found in that line it moves on without error

Appending To Files Challenge

I write a program to append the times table for our poem in sample.txt
So , this is all of my code
numbers = 1
for i in range(2, 12):
while 13 >= numbers > 0:
multiply = numbers * i
print('| {0} Times {1} is {2} '.format(numbers, i, multiply))
numbers += 1
print('=' * 21)
with open('times_table.txt', 'w') as times:
for table in times:
print(table, file=times)
and the output is:
| 1 Times 2 is 2
| 2 Times 2 is 4
| 3 Times 2 is 6
| 4 Times 2 is 8
| 5 Times 2 is 10
| 6 Times 2 is 12
| 7 Times 2 is 14
| 8 Times 2 is 16
| 9 Times 2 is 18
| 10 Times 2 is 20
| 11 Times 2 is 22
| 12 Times 2 is 24
| 13 Times 2 is 26
========================================
But because of code at the end lines for files appending , I'm facing this error below :
for table in times:
io.UnsupportedOperation: not readable
So , finally I don't know how to append this time table stuffs into a sample.txt file .
I really appreciate you all guys . If you can possibly help me with this .
If you want to append to the file, you should use 'a' instead of 'w' ! you can write these contents on file directly.
numbers = 1
with open('times_table.txt', 'a') as file:
for i in range(2, 12):
while 13 >= numbers > 0:
multiply = numbers * i
file.write(f'| {numbers} Times {i} is {multiply}\n')
numbers += 1
file.write('=' * 21)

MatLab (or any other language) to convert a matrix or a csv to put 2nd column values to the same row if 1st column value is the same?

So for example I have
1st column | 2nd column
1 1
1 3
1 9
2 4
2 7
I want to convert it to
1st column | 2nd column | 3rd column | 4th column
1 1 3 9
2 4 7 3
The (3,4) element should be empty.
I can do it by Matlab using for and if but it takes too much time for huge data, so I need a more elegant and brilliant idea.
I prefer Matlab but other languages are ok. (I can export the matrix to csv or xlsx or txt and use the other languages, if that language can solve my problem.)
Thank you in advance!
[Updates]
If
A = [2 3 234 ; 2 44 33; 2 12 22; 3 123 99; 3 1232 45; 5 224 57]
1st column | 2nd column | 3rd column
2 3 234
2 44 33
2 12 22
3 123 99
3 1232 45
5 224 57
then running
[U ix iu] = unique(A(:,1) ); r= accumarray( iu, A(:,2:3), [], #(x) {x'} )
will show me the error
Error using accumarray
Second input VAL must be a vector with one element for each row in SUBS, or a
scalar.
I want to make
1st col | 2nd col | 3rd col | 4th col | 5th col | 6th col| 7th col
2 3 234 44 33 12 22
3 123 99 1232 45
5 224 57
How can I do this? Thank you in advance!
Use accumarray with a custom function
>> r = accumarray( A(:,1), A(:,2), [], #(x) {x'} ); %//'
r =
[1x3 double]
[1x2 double]
>> r{1}
ans =
1 3 9
>> r{2}
ans =
4 7
Update:
Converting cell r to a matrix B (accomodating further requests in comments):
>> [U ix iu] = unique( A(:,1) ); % see EitantT's comment
>> r = accumarray( iu, A(:,2), [], #(x) {x'} );
>> n = cellfun( #numel, r ); % fund num elements in each row - need for max
>> mx = max(n);
>> pad = 555555; % padding value
>> r = cellfun( #(x) [x pad*ones(1,mx - numel(x))], r, 'uni', 0 );
>> B = vertcat( r{:} ); % construct B from padded rows of r

Comparing 2 files line by line

I have 2 file of the following form:
file1:
work1
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
file2:
work2
2 3 4 5 5
2 4 7 8 9
work1
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
work3
1 7 8 9 10
Now I want to compare to file and wherever say the header (work1) is equal..I want to compare the subsequent sections and print the line at which the difference is found. E.g.
work1 (file1)
7 8 9 10 11
1 2 3 4 5
6 7 8 9 10
work1 (file2)
7 8 9 10 11
1 2 4 4 5
6 7 8 9 10
Now I want to print the line where difference occurs i.e. "1 2 4 4 5"
For doing so I have written the following code:
with open("file1",) as r, open("file2") as w:
for line in r:
if "work1" in line:
for line1 in w:
if "work1" in line1:
print "work1"
However, from here on I am confused as to how can I read both the files parallely. Can someone please help me with this...as I am not getting after comparing "work1"'s how should I read the files parallelly
You would probably want to try out itertools module in Python.
It contains a function called izip that can do what you need, along with a function called islice. You can iterate through the second file until you hit the header you were looking for, and you could slice the header up.
Here's a bit of the code.
from itertools import *
w = open('file2')
for (i,line) in enumerate(w):
if "work1" in line:
iter2 = islice(open('file2'), i, None, 1) # Starts at the correct line
f = open('file1')
for (line1,line2) in izip(f,iter2):
print line1, line2 # Place your comparisons of the two lines here.
You're guaranteed now that on the first run through of the loop you'll get "work1" on both lines. After that you can compare. Since f is shorter than w, the iterator will exhaust itself and stop once you hit the end of f.
Hopefully I explained that well.
EDIT: Added import statement.
EDIT: We need to reopen file2. This is because iterating through iterables in Python consumes the iterable. So, we need to pass a brand new one to islice so it works!
with open('f1.csv') as f1, open('f2.csv') as f2 :
i=0
break_needed = False
while True :
r1, r2 = f1.readline(), f2.readline()
if len(r1) == 0 :
print "eof found for f1"
break_needed = True
if len(r2) == 0 :
print "eof found for f2"
break_needed = True
if break_needed :
break
i += 1
if r1 != r2 :
print " line %i"%i
print "file 1 : " + r1
print "file 2 : " + r2

Categories