Remove rows based on a value in second last column in Python

Remove rows based on a value in second last column in Python - python

So I have a text file that I need to trim based on a value in the second last column - if it says 1, delete the line, if 0, keep the line.
The text looks like this, it just has thousands of rows:
#name #bunch of values #column of interest
00051079+4547116 00 05 10.896 +45 47 11.570 0 0 \n
00051079+4547117 00 05 10.896 +45 47 11.570 432 3 0 0 \n
00051079+4547118 00 05 10.896 +45 47 11.570 34 6 1 0 \n
I have tried this (plus about a hundred variations of this):
with open("Desktop/MStars.txt") as M:
data = M.read()
data = data.split('\n')
mactivity = [row.split()[-2] for row in data]
#name = [row.split(' ')[0] for row in data]
#print ((mactivity))
with open("Desktop/MStars.txt","r") as input:
with open("Desktop/MStarsReduced.txt","w") as output:
for line in input:
if mactivity =="0":
output.write(line)
Thank you in advance, it is driving me mad.

Recall that a line from a CSV reader is a list, where each cell/column is another value.
Editing your last little code block:
with open("Desktop/MStars.txt","r") as input:
with open("Desktop/MStarsReduced.txt","w") as output:
for line in input:
if line[-2] == 0:
output.write(line)
This will write your line if and only if the second to last field is 0. Otherwise, it will not be written.

Related

Python rows and columns arrangment

I have this below data file and I want three columns with heading "TIMESTEP", "id" and "mass".
Its corresponding values are just immediately below itenter image description here. How to do it. Please help
Below link 1 is my snapshot of data file and 2 is my desired arrangement.

I agree with the comments that the question is hard to understand, but from my understanding of your problem I have found this solution:
import pandas as pd
data_input = """TIMESTEP
5000
id mass
TIMESTEP
5100
id mass
42 24
TIMESTEP
5200
id mass
99 123
32 84
79 424"""
columns=["TIMESTEP", "id", "mass"]
data = []
previous_line = ""
for line in data_input.split("\n"):
if columns[0] in previous_line and columns[1] not in line:
data.append({"TIMESTEP": line})
elif columns[1] in previous_line and columns[0] not in line:
data[len(data)-1]["id"], data[len(data)-1]["mass"] = line.split(" ")
elif all(col not in line for col in columns):
data.append({"TIMESTEP": data[len(data)-1]["TIMESTEP"]})
data[len(data)-1]["id"], data[len(data)-1]["mass"] = line.split(" ")
previous_line = line
df = pd.DataFrame(data)
print(df)
Try to run this script and see if this is what you were looking for.

Filter a file based on values in columns of another file

I have a file with a list of positions (columns 1 + 2) and values associated with those positions:
File1.txt:
1 20 A G
4 400 T C
1 12 A T
2 500 G C
And another file with some of the same positions. There may be multiple rows with the same positions as in File1.txt
File2.txt
#CHR POS Count_A Count_C Count_G Count_T
1 20 0 18 2 0
4 400 0 0 0 1
1 12 0 7 0 40
4 400 0 1 0 1
5 50 16 0 0 0
2 500 9 0 4 0
I need to output a version of File1.txt excluding any rows that ever meet both these two conditions:
1: If the positions (columns 1+2) match in File1.txt and File2.txt
2: If the count is > 0 in the column in File2.txt that matches the letter(A,G,C,T) in column 4 of File1.txt for that position.
So for the example above the first row of File1.txt would not be output because in file2.txt for the matching row (based on the first 2 columns: 1 20), the 4th column has the letter G and for this row in File2.txt the Count_G column is >0.
The only line that would be output for this example would be:
2 500 G C
To me the particularly tricky part is that there can be multiple matching rows in file2.txt and I want to exclude rows in File1.txt if the appropriate column in File2.txt is >0 in even just one row in File2.txt. Meaning that in the example above line 2 of File1.txt would not be included because Count_C is > 0 the second time that position appears in File2.txt (Count_C = 1).
I am not sure if that kind of filtering is possible in a single step. Would it be easier to output a list of rows in File1.txt where the count in File2.txt for the letter in the 4th column in File1.txt is >0. Then use this list to compare to File1.txt and remove any rows that appear in both files?
I've filtered one file based on values in another before with the code below, but this was for when there was only one column of values to filter for in file2.txt. I am not sure how to do the conditional filtering so that I check the right column based on the letter in column 4 of file1.txt
My current code is in python but any solution is welcome:
f2 = open('file2.txt', 'r')
d2 = {}
for line in f2.split('\n'):
line = line.rstrip()
fields = line.split("\t")
key = (fields[0], fields[1])
d2[key] = int(fields[2])
f1 = open('file1.txt', 'r')
for line in file1.split('\n'):
line = line.rstrip()
fields = line.split("\t")
key = (fields[0], fields[1])
if d2[key] > 1000:
print line
I think my previous solution is already very verbose and feel there might be a simple tool for this kind of problem of which I am not aware.

I used Perl to solve the problem. First, it loads File2 into a hash table keyed by the char, pos, and nucleotide, the value is the number for the nucleotide. Then, the second file is processed. If there's non-zero value in the hash table for its char, pos, and nucleotide, it's not printed.
#!/usr/bin/perl
use warnings;
use strict;
my %gt0;
open my $in2, '<', 'File2.txt' or die $!;
<$in2>; # Skip the header.
while (<$in2>) {
my %count;
(my ($chr, $pos), #count{qw{ A C G T }}) = split;
$gt0{$chr}{$pos}{$_} = $count{$_} for qw( A C G T );
}
open my $in1, '<', 'File1.txt' or die $!;
while (<$in1>) {
my ($chr, $pos, undef, $c4) = split;
print unless $gt0{$chr}{$pos}{$c4};
}

Your code seems pretty good to me. You can perhaps edit
d2[key] = int(fields[2])
and
if d2[key] > 1000:
print line
As they puzzle me a little bit.
I would do it like this:
f2 = open('file2.txt', 'r')
d2 = {}
for line in f2.split('\n'):
fields = line.rstrip().split("\t")
key = (fields[0], fields[1])
d2[key] = {'A':int(fields[2]),'C':int(fields[3]),'G':int(fields[4]),
'T':int(fields[5])}
f1 = open('file1.txt', 'r')
for line in f1.split('\n'):
line = line.rstrip()
fields = line.split("\t")
key = (fields[0], fields[1])
if (key not in d2) or (d2[key][str(fields[2])] == 0 and d2[key][str(fields[3])] == 0):
print(line)
Edit:
If you have an arbitrary number of letters (and columns in file 2) just generalize the dictionary inside d2 which I have hard coded. Easy. LEt's add 2 letters:
col_names = ['A','C','G','T','K','L']
for a,i in zip(fields[2:],range(len(fields[2:]))):
d2[key][col_names.index(i)] = a

parse text file and generate new .csv file based on that data

I would like to parse a machine log file, re-arange the data and write it to a .csv file, which i will import into a google spreadsheet. Or write the data directly to the spreadsheet.
here is an example of how the log looks like:
39 14 15 5 2016 39 14 15 5 2016 0
39 14 15 5 2016 40 14 15 5 2016 0.609
43 14 15 5 2016 44 14 15 5 2016 2.182
the output should look like this:
start_date,start_time,end_time,meters
15/5/16,14:39,14:39,0
15/5/16,14:39,14:40,0.609
15/5/16,14:43,14:44,2.182
i wrote the following python code:
file = open("c:\SteelUsage.bsu")
for line in file.readlines():
print(line) #just for verification
line=line.strip()
position=[]
numbers=line.split()
for number in numbers:
position.append(number)
print(number)#just for verification
the idea is to save each number in a row to a list, then i can re-write the numbers in the right order according to their position.
for example: in row #1 the string "39" will have position 0, "14" pstion 1, etc.
but it seems the code i wrote stores each number as a new list, because when i change print(number) to print(number[0]), the code prints the first digit of each number, istead of printing the first number. (39)
where did i go wrong?
thank you

Do something like this. Write out to your csv file.
with open('c:\SteelUsage.bsu','r') as reader:
lines = reader.readlines()
for line in lines:
inp = [i for i in line.strip().split()]
out = '%s/%s/%s,%s:%s,%s:%s,%s' % (inp[2],inp[3],inp[4],inp[1],inp[0],inp[6],inp[5],inp[10])
print out

how to extract a range of data in my special data structure in python

I have a data file with similar special structure as below:
#F A 1 1 1 3 3 2
2 1 0.002796 0.000005 0.000008 -4.938531 1.039083
3 1 0.002796 0.000005 0.000007 -4.938531 1.039083
4 0 0.004961 -0.000008 -0.000002 -4.088534 0.961486
5 0 0.004961 0.000006 -0.000002 -4.079798 0.975763
First column is only a description (no need to be considered)and I want to (1)separate all data that their second column is 1 from the ones that their second column is 0 and then (2)extract the data lines that their 5th number(for example in first data line, it will be 0.000008) is in a specific range and then took the 6th number of that line (for our example it would be -4.938531), then take average of all of them( captured 6th values) and finally write them in a new file. For that I wrote this code that although does not include the first task, also it is not working. could anyone please help me with debugging or suggest me a new method?
A=0.0 #to be used for separating data as mentioned in the first task
B=0.0 #to be used for separating data as mentioned in the first task
with open('inputdatafile') as fin, open('outputfile','w') as fout:
for line in fin:
if line.startswith("#"):
continue
else:
col = line.split()
6th_val=float(col[-2])
2nd_val=int(col[1])
if (str(float(col[6])) > 0.000006 and str(float(col[6])) < 0.000009):
fout.write(" ".join(col) + "\n")
else:
del line

Varaible names in python can't start with a number, so change 6th_val to val_6 and 2nd_val to val_2.
str(float(col[6])) produces string, which can't be compared with float '0.000006', so change any str(float(...)) > xxx to float(...) > xxx .
You don't have to delete line, garabage collector does it for you, so remove 'del line'
A=0.000006
B=0.000009
S=0.0
C=0
with open('inputdatafile') as fin, open('outputfile','w') as fout:
for line in fin:
if line.startswith("#"):
continue
else:
col = line.split()
if col[1] == '1':
val_6=float(col[-2])
val_5=int(col[-3])
if val_5 > A and val_5 < B:
fout.write(" ".join(col) + "\n")
s += val_6
c += 1
fout.write("Average 6th: %f\n" % (S/C))

how to convert text values in an array to float and add in python

i have a text file containing the values:
120 25 50
149 33 50
095 41 50
093 05 50
081 11 50
i extracted the values in the first column and put then into an array: adjusted
how do i convert the values from text to float and add 5 to each of them using a for loop?
my desired output is :
125
154
100
098
086
here is my code:
adjusted = [(adj + (y)) for y in floats]
A1 = adjusted[0:1]
A2 = adjusted[1:2]
A3 = adjusted[2:3]
A4 = adjusted[3:4]
A5 = adjusted[4:5]
print A1
print A2
print A3
print A4
print A5
A11= [float(x) for x in adjusted]
FbearingBC = 5 + float(A11)
print FbearingBC
it gives me errors it says i cant add float and string
pliz help

Assuming that you have:
adjusted = ['120', '149', '095', ...]
The simplest way to convert to float and add five is:
converted = [float(s) + 5 for s in adjusted]
This will create a new list, iterating over each string s in the old list, converting it to float and adding 5.
Assigning each value in the list to a separate name (A1, etc.) is not a good way to proceed; this will break if you have more or fewer entries than expected. It is much cleaner and less error-prone to access each value by index (e.g. adjusted[0]) or iterate over them (e.g. for a in adjusted:).

This code should work:
with open('your_text_file.txt', 'rt') as f:
for line in f:
print(float(line.split()[0]) + 5)
It will display:
125.0
154.0
100.0
98.0
86.0
Or of you need all your values in a list:
with open('your_text_file.txt', 'rt') as f:
values = [float(line.split()[0]) + 5 for line in f]
print (values)

Since you are reading the data from a file, this could done as below:
with open('data.txt', 'r+') as f: # where data.txt is the file containing your data
for line in f.readlines(): # Get one line of data from the file
number = float(line.split(' ')[0]) # Split your data by ' ' and get the first number.
print "%03d" % (number + 5) # Add 5 to it and then print using the format "%03d"
Hope this helps.

>>> [float(value)+5 for value in adjusted]
[125.0, 154.0, 100.0, 98.0, 86.0]
This is a list comprehension, so it does all the functions of converting to floats, adding 5 and appending them to a list all in one line. However, if all of your numbers are integers you should convert it to an int so you don't have all of the .0s.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove rows based on a value in second last column in Python - python

Related

Python rows and columns arrangment

Filter a file based on values in columns of another file

parse text file and generate new .csv file based on that data

how to extract a range of data in my special data structure in python

how to convert text values in an array to float and add in python

Categories

Resources