Add columns from a file to a specific position in another file - python

I have 3 files. In there header they all have Id's. What I want to do is to find the intersecting ID's (template is file 1) and then copy the columns with the correct ID behind the ID in the template-file.
Here is an example:
File 1 is the template:
name 123 124 125 128 131 145 156
rdt4 35 12 23 21 36 34 37
gtf2 24 18 18 29 26 12 40
hzt7 40 23 26 25 13 21 28
File 2:
name 123 124 125 126 127 128 131 132 133 145 156
rdt4 F F F T T F T T T F T
gtf2 F F F T T F T T T F T
hzt7 F F F T T F T T T F T
File 3:
name 123_a 123_b 123_c 124_a 124_b 124_c 125_a 125_b 125_c 126_a 126_b 126_c 127_a 127_b 127_c 128_a 128_b 128_c and so on
rdt4 0,087 0,265 0,632 0,220 0,851 0,271 0,436 0,148 0,080 0,899 0,636 0,467 0,508 0,460 0,393 0,689 0,427 0,798
gtf2 0,770 0,971 0,231 0,969 0,494 0,181 0,989 0,155 0,351 0,131 0,204 0,553 0,581 0,138 0,982 0,287 0,702 0,522
hzt7 0,185 0,535 0,093 0,807 0,487 0,786 0,886 0,905 0,966 0,283 0,490 0,190 0,688 0,714 0,577 0,643 0,476 0,738
The final file should look like this:
name 123 123 123_b 124 124 124_b 125 125 125_b 128 128 128_b 131 131 131_b 145 145 145_b 156 156 156_b
rdt4 35 F 0,265 12 F 0,851 23 F 0,148 21 F 0,427 36 T 34 F 37 T
gtf2 24 F 0,971 18 F 0,494 18 F 0,155 29 F 0,702 26 T 12 F 40 T
hzt7 40 F 0,535 23 F 0,487 26 F 0,905 25 F 0,476 13 T 21 F 28 T
Note: I skipped to type in everything for File 3 because it has the same numbers of ID's like file 2 but in file 3 every ID have 3 columns and I need only one of these columns (in the example column b).
What I tried so far:
I started first to do everything only with file 1 and file 2.
I copied the ID's to a new list and then find the positions of these ID's in file 2 to extract the data. But this seems to be very tricky (at least for me). The appending works so far but the problem is that every list which is stored in the list final is the same. It would be nice if you can help me with this.
This is my code so far:
try:
Expr_Matrix_1="file1.txt"
#Expr_Matrix_1=raw_input('Name file with expression data: ')
Expr_Matrix_2=open(Expr_Matrix_1)
Expr_Matrix_3=open(Expr_Matrix_1)
except:
print 'This is a wrong filename!'
exit()
try:
Probe_Detect_1="file2.txt"
#Probe_Detect_1=raw_input('Name of file with probe detection: ')
Probe_Detect_2=open(Probe_Detect_1)
Probe_Detect_3=open(Probe_Detect_1)
except:
print 'This is a wrong filename!'
exit()
find_list=list()
for b, line2 in enumerate(Expr_Matrix_2):
line2 = line2.rstrip()
line2 = line2.split("\t")
if b == 0:
for item in line2:
find_list.append(item)
find_list=find_list[7:]
find_list2=list()
for i, line in enumerate(Probe_Detect_2):
line = line.rstrip()
line = line.split("\t")
if i == 0:
for item in find_list:
find_list2.append(line.index(item))
#print find_list2
index1=8
final=list()
for b, line2 in enumerate(Expr_Matrix_3):
line2 = line2.rstrip()
line2 = line2.split("\t")
for c, line in enumerate(Probe_Detect_3):
line = line.rstrip()
line = line.split("\t")
if line2[b]==line[c]:
for item in find_list2:
if len(line2)<1551:
line2.insert(index1, line[item])
index1=index1+2
final.append(line2)
print final[1]
The fist ID-column in file 1 is column 7 that's why I used the 7 for slicing.
The 1551 means the number of rows to which it should be copied but I think this is a complete wrong approach. However, I wanted to show you my try!
Another note: All files start with the name-column but between this column and the first ID-column there are some columns which shouldn't be considered. Because file 1 is the template those columns should also be in the final file.
What is the solution?

Related

Read data from txt file from a specific pattern in text

I need to extract data into dataframe from a text file, after a specific word was found
Here is an example of text file:
Signal:
197 198 180 140
X_values:
11 26 8 15
14 17 12 10
** Exe Trs:
[115] time1: 1 (42 ms) - tou(): 4.7 - JH: 5 (B: 0)
[230] time2: 2 (87 ms) - tou(): 3.0 - Am: 5 (B: 0)
What I want is to have two dataframes created as below:
df1:
Signal X_values1 X_values2
197 11 14
198 26 17
180 8 12
140 15 10
df2:
order time(ms)
115 42
230 87
A quick solution is to use a regex if your files always follow the same order of the fields. If you want something more efficient/robust, you need to write a custom parser.
import re
import numpy as np
import pandas as pd
with open('your_file', 'r') as f:
text = f.read()
signal, values, exec_t = re.search('Signal:\n([\d ]+)\nX_values:\n([\d\n ]+)\n\*\* Exe Trs:\n([\w\W]*)',
text).groups()
signal = re.split('\s+', signal.strip())
values = re.split('\s+', values.strip())
exec_t = re.findall('\[(\d+)][^(]+\((\d+) ms', exec_t)
df1 = pd.DataFrame(np.array(values).reshape(-1, len(signal)).T,
index=signal,
columns=['X_values%s' % (i+1) for i in range(len(values)//len(signal))]
).rename_axis('Signal').reset_index()
df2 = pd.DataFrame(exec_t, columns=['order', 'time(ms)'])
output:
>>> df1
Signal X_values1 X_values2
0 197 11 14
1 198 26 17
2 180 8 12
3 140 15 10
>>> df2
order time(ms)
0 115 42
1 230 87

Text file parsing fastest as possible

I have a very large file with lines like follows:
....
0.040027 a b c d e 12 34 56 78 90 12 34 56
0.050027 f g h i l 12 34 56 78 90 12 34 56
0.060027 a b c d e 12 34 56 78 90 12 34 56
0.070027 f g h i l 12 34 56 78 90 12 34 56
0.080027 a b c d e 12 34 56 78 90 12 34 56
0.090027 f g h i l 12 34 56 78 90 12 34 56
....
I need to have a dictionary as follows in the fastest way possible.
I using the following code:
ascFile = open('C:\\eample.txt', 'r', encoding='UTF-8')
tag1 = ' a b c d e '
tag2 = ' f g h i l '
tags = [tag1, tag2]
temp = {'k1':[], 'k2':[]}
key_tag = {'k1':tag1, 'k2':tag2 }
t1 = time.time()
for line in ascFile:
for path, tag in key_tag.items():
if tag in line:
columns = line.strip().split(tag, 1)
temp[path].append([columns[0], columns[-1].replace(' ', '')])
t2 = time.time()
print(t2-t1)
I have the following result in 6 second parsing a file of 360MB, I'd like to improve the time.
temp = {'k1':[['0.040027', '1234567890123456'], ['0.060027', '1234567890123456'], ['0.080027', '1234567890123456']], 'k2':[['0.050027', '1234567890123456'], ['0.070027', '1234567890123456'], ['0.090027', '1234567890123456']]
}
I assume you have a fixed number of words in the file that are your keys. Use split to break the string, then take a slice of the split list to compute your key directly:
import collections
# raw strings don't need \\ for backslash:
FILESPEC = r'C:\example.txt'
lines_by_key = collections.defaultdict(list)
with open(FILESPEC, 'r', encoding='UTF-8') as f:
for line in f:
cols = line.split()
key = ' '.join(cols[1:6])
pair = (cols[0], ''.join(cols[6:]) # tuple, not list, could be changed
lines_by_key[key].append(pair)
print(lines_by_key)
I used partition instead of split so that the 'in' test and splitting can be done in a single pass.
for line in ascFile:
for path, tag in key_tag.items():
val0, tag_found, val1 = line.partition(tag)
if tag_found:
temp[path].append([val0, val1.replace(' ', '')])
break
Is this any better with your 360MB file?
You might also do a simple test where all you do is loop through the file a line at a time:
for line in ascFile:
pass
This will tell you what your best possible time will be.

Reading txt file with number and suming them python

I have txt file witht the following txt in it:
2
4 8 15 16 23 42
1 3 5
6
66 77
77
888
888 77
34
23 234 234
1
32
3
23 23 23
365
22 12
I need a way to read the file and sum all the numbers.
i have this code for now but not sure what to do next. Thx in advance
`lstComplete = []
fichNbr = open("nombres.txt", "r")
lstComplete = fichNbr
somme = 0
for i in lstComplete:
i = i.split()`
Turn them into a list and sum them:
with open('nombres.txt', 'r') as f:
num_list = f.read().split()
print sum([int(n) for n in num_list])
Returns 3227
Open the file and use read() method to get the content and then convert string to int, use sum() to get the result:
>>> sum(map(int,open('nombres.txt').read().split()))
3227

Reading values from a text file with different row and column size in python

I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3

Python: How to write values to a csv file from another csv file

For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
import collections
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
out = np.zeros((len(data2),len(data1)))
for row in data2:
for ch_row in range(len(data1)):
if (row[3] == ch_row + 1):
out = row.tolist() + data1[ch_row].tolist()
print(out)
writer = csv.writer(open('dn.csv','w'), delimiter=',',quoting=csv.QUOTE_ALL)
writer.writerow(out)
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
If I do "print(out)", it comes out a correct answer. However, when I input "out" in the shell, there are only one row appears like [1.0, 1.0, 1.0, 1.0, 20.0, 30.0, 50.0]
What I need is to store all the values in the "out" variables and write them to the dn.csv file.
This ought to do the trick for you:
Code:
from csv import reader, writer
data = list(reader(open("filename.csv", "r"), delimiter=" "))
out = writer(open("output.csv", "w"), delimiter=" ")
for row in reader(open("index.csv", "r"), delimiter=" "):
out.writerow(row + data[int(row[3])])
index.csv:
0 0 0 1
0 0 0 2
0 0 0 3
filename.csv:
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
This produces the output:
0 0 0 1 70 60 45
0 0 0 2 35 26 77
0 0 0 3 93 37 68
Note: There's no need to use numpy here. The stadard library csv module will do most of the work for you.
I also had to modify your sample datasets a bit as what you showed had indexes out of bounds of the sample data in filename.csv.
Please also note that Python (like most languages) uses 0th indexes. So you may have to fiddle with the above code to exactly fit your needs.
with open('dn.csv','w') as f:
writer = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
idx = row[3]
out = [idx] + [x for x in data1[idx-1]]
writer.writerow(out)

Categories