i have multiple files, i want to copy two columns from one file and replace it to another two columns in another file.
the first file contains:
ag-109 3.905E-07
am-241 1.121E-06
am-243 7.294E-09
cs-133 1.210E-05
eu-151 2.393E-08
eu-153 4.918E-07
gd-155 2.039E-08
mo-95 1.139E-05
nd-143 9.869E-06
.......
........
and the second file is :
h-1 10 0 0.06674 293 end
zr 11 0 0.0423 293 end
u-234 101 0 7.471e-06 293 end
u-235 101 0 0.0005265 293 end
u-236 101 0 0.0001285 293 end
u-238 101 0 0.02278 293 end
np-237 101 0 1.018e-05 293 end
pu-238 101 0 2.262e-06 293 end
pu-239 101 0 0.000147 293 end
.........
.......
.
.
u-234 1018 0 7.471e-06 293 end
u-235 1018 0 0.0005265 293 end
u-236 1018 0 0.0001285 293 end
u-238 1018 0 0.02278 293 end
np-237 1018 0 1.018e-05 293 end
pu-238 1018 0 2.262e-06 293 end
i want to replace the first column of file2 from file1, and the 2nd column of file1 to the 4th column of file2.
file 2 contain more lines that i want to continue reading without changing.
second problem is:
file2 has repetitive of column 1 for 18 times. the column "101" to "1018"
each 18 nuclides in the first column has different values in column 4
i have tried, to read file1 line by line and the same for the file2.
then start to replace from specific value '11'
including a condition for column 2 to change every time the nuclides iteration finished ( i have 29 nuclides).
with open('100_60.inp','a+') as fapp:
with open("20_3.2_10_100_18.txt") as copf:
line = fapp.readline()
# if not line:
# break
source = re.split(r"\s+", line.strip())
nuclide = copf.readline()
# if not nuclide:
# break
comp = re.split(r"\s+", nuclide.strip())
if len(source)==6 and source[1] != '11':
for i in range(29):
source[3][i]= nuclide[1][i]
source[0][i] = nuclide[0][i]
fapp.append(replace(source[0][i],nuclide[0][i]))
if len(source)==6 and source[1] !='101':
for i in range(29):
source[3][i]= nuclide[1][i]
source[0][i] = nuclide[0][i]
fapp.append(replace(source[0][i],nuclide[0][i]))
the expected result must be like this:
h-1 10 0 0.06674 293 end
zr 11 0 0.0423 293 end
ag-109 101 0 3.905E-07 293 end
am-241 101 0 1.121E-06 293 end
am-243 101 0 7.294E-09 293 end
cs-133 101 0 1.210E-05 293 end
eu-151 101 0 2.393E-08 293 end
eu-153 101 0 4.918E-07 293 end
gd-155 101 0 2.039E-08 293 end
....
....
....
I think that if you manage to convert the text file into a csv working with column is going to be much easier.
if columns are separated by tabs you can also do it in excel without having to script it up yourself https://support.geekseller.com/knowledgebase/convert-txt-file-csv/
After that you could use the csv module and read the file by getting a dict where you can add or remove keys(columns). I can't script up a full working solution for you right now, but I hope this gives you some hint on how to approach it.
Related
I would like to know how to remove some variables from a dataset, specifically numbers and list of strings. For example.
Test Num
0 bam 132
1 - 65
2 creation 47
3 MAN 32
4 41 831
... ... ...
460 Luchino 21
461 42 4126 7
462 finger 43
463 washing 1
I would like to have something like
Test Num
0 bam 132
2 creation 47
... ... ...
460 Luchino 21
462 finger 43
463 washing 1
where I removed (manually) MAN (it should be included in a list of strings, like a stop word), -, and numbers.
I have tried with isdigit but it is not working so I am sure that there are errors in my code:
df['Text'].where(~df['Text'].str.isdigit())
and for my stop words:
my_stop=['MAN','-']
df['Text'].apply(lambda lst: [x for x in lst if x in my_stop])
If you want to filter you could use .loc
df = df.loc[~df.Text.str.isdigit() & ~df.Text.isin(['MAN']), :]
.where(cond, other) returns a dataframe or series of the same shape as self, but keeps the original values where cond is true and replaces with other where it is false.
Read more in the docs
hi you should try this code :
df[df['Text']!='MAN']
I am reading a column from an excel file using openpyxl.
I have written code to get the column of data I need into excel but the data is separated by empty cells.
I want to group these data wherever the cell value is not None into 19 sets of countries so that I can use it later to calculate the mean and standard deviation for the 19 countries.
I don't want to hard code it using list slices. Instead I want to save these integers to a list or list of lists using a loop but im not sure how to because this is my first project with Python.
Here's my code:
#Read PCT rankings project ratified results
#Beta
import openpyxl
wb=openpyxl.load_workbook('PCT rankings project ratified results.xlsx', data_only=True)
sheet=wb.get_sheet_by_name('PCT by IP firms')
row_counter=sheet.max_row
column_counter=sheet.max_column
print(row_counter)
print(column_counter)
#iterating over column of patent filings and trying to use empty cells to flag loop for it to append/store list of numbers before reaching the next non empty cell and repeat this everytime it happens(expecting 19 times)
list=[]
for row in range(4,sheet.max_row +1):
patent=sheet['I'+str(row)].value
print(patent)
if patent == None:
list.append(patent)
print(list)
This is the output from Python giving you a visualisation of what I am trying to do.
Column I:
412
14
493
488
339
273
238
226
200
194
153
164
151
126
None
120
None
None
133
77
62
79
24
0
30
20
16
0
6
9
11
None
None
None
None
608
529
435
320
266
264
200
272
134
113
73
23
12
52
21
In csv file if the line start with # sign or it is empty, I can remove or ignore it easily.
# some description here
# 1 is for good , 2 is bad and 3 for worse
empty line
I can deal by ignoring the empty line and line start with # by following logic in python.
while True:
if len(data[0]) == 0 or data[0][0][0] == '#':
data.pop(0)
else:
break
return data
But Below is header data but it has few empty spaces in start and then data is available
0 temp_data 1 temp_flow 2 temp_record 3 temp_all
22 33 434 344
34 43 434 355
In some files i got header data like below and then I had to ignore only # sign and not column names
#0 temp_data 1 temp_flow 2 temp_record 3 temp_all
22 33 434 344
34 43 434 355
But I get no clue how to deal with these two situation.
if someone help me. it would be grateful. because my above logic fails on these two situations.
You can use the string strip() function to remove leading and trailing whitespace first...
>>> ' 0 temp_data 1 temp_flow 2 temp_record 3 temp_all'.strip()
'0 temp_data 1 temp_flow 2 temp_record 3 temp_all'
I have created a function, that I want to run over an entire file, but I am having some trouble. I am only getting output from the last line of the file.
I have two different input files, and the idea is to take the lines from one file and collecting certain terms, adding them to a dictionary, and then searching the second file for the corresponding lines and printing the output. I know the problem is most likely the placement of my call for the function.
The matrix file looks like this
Sp_ds Sp_hs Sp_log Sp_plat
c3833_g1_i2 4.00 0.07 16.84 26.37
c4832_g1_i1 24.55 116.87 220.53 28.82
c5161_g1_i1 107.49 89.39 26.95 698.97
c4399_g1_i2 27.91 72.57 5.56 36.58
c5916_g1_i1 82.57 19.03 48.55 258.22
The Blast file looks like this
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1000_g1_i1|m.799 gi|48474761|sp|O94288.1|NOC3_SCHPO 100.00 747 0 0 5 751 1 747 0.0 1506
c1001_g1_i1|m.800 gi|259016383|sp|O42919.3|RT26A_SCHPO 100.00 268 0 0 1 268 1 268 0.0 557
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
c1003_g1_i1|m.803 gi|74631197|sp|Q6BDR8.1|NSE4_SCHPO 100.00 246 0 0 1 246 1 246 1e-179 502
c1004_g1_i1|m.804 gi|74676184|sp|O94325.1|PEX5_SCHPO 100.00 598 0 0 1 598 1 598 0.0 1227
c1005_g1_i1|m.805 gi|9910811|sp|O42832.2|SPB1_SCHPO 100.00 802 0 0 1 802 1 802 0.0 1644
c1006_g1_i1|m.806 gi|74627042|sp|O94631.1|MRM1_SCHPO 100.00 255 0 0 1 255 47 301 0.0 525
c1007_g1_i1|m.807 gi|20137702|sp|O74370.1|ISY1_SCHPO 100.00 201 0 0 1 201 1 201 4e-146 412
The program that I have gotten so far is this
def parse_blast(blast_line="NA"):
transcript = blast_line[0][0]
swissProt = blast_line[1][3]
return(transcript, swissProt)
blast = open("/scratch/RNASeq/blastp.outfmt6")
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
transcript_to_protein = {}
transcript_to_protein[transcript] = swissProt
if transcript in transcript_to_protein:
protein = transcript_to_protein.get(transcript)
matrix = open("/scratch/RNASeq/diffExpr.P1e-3_C2.matrix")
for line in matrix:
matrixFields = line.rstrip("\n").split("\t")
transcript = matrixFields[0]
Sp_ds = matrixFields[1]
Sp_hs = matrixFields[2]
Sp_log = matrixFields[3]
Sp_plat = matrixFields[4]
tab = "\t"
fields = (protein,Sp_ds,Sp_hs,Sp_log,Sp_plat)
out = open("parsed_blast.txt","w")
out.write(tab.join(fields))
matrix.close()
blast.close()
out.close()
It's a scope problem, as your indentation is not correct.
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
So you keep looping till the last line without saving the values you get.
I think you should change your indentation to this
transcript_to_protein = {} # 1. declare the dictionary
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
transcript_to_protein[transcript] = swissProt # 2. Add the data to the dictionary
This will solve the problem of your first file.But not your second as you don't use the dictionary inside the loop.
So you have to move these lines inside the second loop
if transcript in transcript_to_protein:
protein = transcript_to_protein.get(transcript)
I think you got the idea. I will leave the rest for you to do, there a few lines that needs to be moved before the loops and one or two inside the second loop.
This:
for line in blast:
line= [item.split('|') for item in line.split()]
(transcript, swissProt) = parse_blast(blast_line = line)
Reads all the lines, but after it is finished (transcript, swissProt) will only have the value from the last line.
Same for:
for line in matrix:
matrixFields = line.rstrip("\n").split("\t")
transcript = matrixFields[0]
Sp_ds = matrixFields[1]
Sp_hs = matrixFields[2]
Sp_log = matrixFields[3]
Sp_plat = matrixFields[4]
You need to put the rest of your line processing inside your loops.
I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3