I have a text file with the following content:
str1 str2 str3 str4
0 1 12 34
0 2 4 6
0 3 5 22
0 56 2 18
0 3 99 12
0 8 5 7
1 66 78 9
I want to read the above text file into a list such that program starts reading from the row where first column has a value greater than zero.
How do I do it in Python 3.5?
I tried genfromtxt() but i can only skip fixed number of lines from top. Since I will be reading different files i need something else.
This is one way with csv module.
import csv
from io import StringIO
mystr = StringIO("""\
str1 str2 str3 str4
0 1 12 34
0 2 4 6
0 3 5 22
0 56 2 18
0 3 99 12
0 8 5 7
1 66 78 9
2 50 45 4
""")
res = []
# replace mystr with open('file.csv', 'r')
with mystr as f:
reader = csv.reader(mystr, delimiter=' ', skipinitialspace=True)
next(reader) # skip header
for line in reader:
row = list(map(int, filter(None, line))) # convert to integers
if row[0] > 0: # apply condition
res.append(row)
print(res)
[[1, 66, 78, 9], [2, 50, 45, 4]]
lst = []
flag = 0
with open('a.txt') as f:
for line in f:
try:
if float(line.split()[0].strip('.')) >0:
flag = 1
if flag == 1:
lst += [float(i.strip('.')) for i in line.split()]
except:
pass
Related
I want to extract the number corresponding to O2H from the following file format (The delimiter used here is space):
# Timestep No_Moles No_Specs SH2 S2H4 S4H6 S2H2 H2 S2H3 OSH2 Mo1250O3736S57H111 OSH S3H6 OH2 S3H4 O2S SH OS2H3
144500 3802 15 3639 113 1 10 18 2 7 1 3 2 1 2 1 1 1
# Timestep No_Moles No_Specs SH2 S2H4 S2H2 H2 S2H3 OSH2 Mo1250O3733S61H115 OS2H2 OSH S3H6 OS O2S2H2 OH2 S3H4 SH
149000 3801 15 3634 114 11 18 2 7 1 1 2 2 1 1 4 2 1
# Timestep No_Moles No_Specs SH2 OS2H3 S3H Mo1250O3375S605H1526 OS S2H4 O3S3H3 OSH2 OSH S2H2 H2 OH2 OS2H2 S2H O2S3H3 SH O4S4H4 OH O2S2H O6S5H3 O6S5H5 O3S4H4 O2S3H2 O3S4H3 OS3H3 O3S2H2 O4S3H4 O3S3H O6S4H5 OS4H3 O3S2H O5S4H4 OS2H O2SH2 S2H3 O4S3H3 O3S3H4 O O5S3H4 O5S3H3 OS3H4 O2S4H4 O4S4H3 O2SH O2S2H2 O5S4H5 O3S3H2 S3H6
589000 3269 48 2900 11 1 1 47 11 1 81 74 26 25 21 17 1 3 5 2 3 3 1 1 2 2 1 2 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Timestep No_Moles No_Specs SH2 Mo1250O3034S578H1742 OH2 OSH2 O3S3H5 OS2H2 OS OSH O2S3H2 OH O3S2H2 O6S6H4 SH O2S2H2 S2H2 OS2H H2 OS2H3 O5S4H2 O7S6H5 S3H2 O2SH2 OSH3 O7S6H4 O2S2H3 O6S5H3 O2SH O4S4H O3S2H3 S2 O2S2H S5H3 O7S4H4 O3S3H OS3H OS4H O5S3H3 S3H O17S12H9 O3S3H2 O7S5H4 O4SH3 O3S2H O7S8H4 O3S3H3 O11S9H6 OS3H2 S4H2 O10S8H6 O4S3H2 O5S5H4 O6S8H4 OS2 OS3H6 S3H3
959500 3254 55 2597 1 83 119 1 46 59 172 4 3 4 1 27 7 38 6 23 3 1 2 3 5 3 1 2 1 2 1 1 6 3 1 1 2 1 1 1 1 1 3 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1
That is, all the alternate rows contain the corresponding data of its previous row.
And I want the output to look like this
1
4
21
83
How it should work:
1 (14th number on 2nd row which corresponds to 14th word of 1st row i.e. O2H)
4 (16th number on 4th row which corresponds to 16th word of 3rd row i.e. O2H)
21 (15th number on 6th row which corresponds to 15th word of 5th row i.e. O2H)
83 (6th number on 8th row which corresponds to 6th word of 7th row i.e. O2H)
I was trying to extract it using regex but couldnot do it. Can anyone please help me to extract the data?
You easily parse this to a dataframe and select the desired column to fetch the values.
Assuming your data looks like the sample you've provided, you can try the following:
import pandas as pd
with open("data.txt") as f:
lines = [line.strip() for line in f.readlines()]
header = max(lines, key=len).replace("#", "").split()
df = pd.DataFrame([line.split() for line in lines[1::2]], columns=header)
print(df["OH2"])
df.to_csv("parsed_data.csv", index=False)
Output:
0 1
1 11
2 1
3 83
Name: OH2, dtype: object
Dumping this to a .csv would yield:
i think you want OH2 and not O2H and it's a typo. Assuming this:
(1) iterate every single line
(2) take in account only even lines. ( if (line_counter % 2) == 0: continue )
(3) splitting all the spaces and using a counter variable, count the index of the OH2 in the even line. assuming it is 14 in the first line
(4) access the following line ( +1 index ) and splitting spaces of the following line, access the element at the index of the element that you find in point (3)
since you haven't post any code i assumed your problem was more about finding a way to achieve this, than coding, so i wrote you the algorithm
Thank you, everyone, for the help, I figured out the solution
i=0
j=1
with open ('input.txt','r') as fin:
with open ('output.txt','w') as fout:
for lines in fin: #Iterating over each lines
lists = lines.split() #Splits each line in list of words
try:
if i%2 == 0: #Odd lines
index_of_OH2 = lists.index('OH2')
#print(index_of_OH2)
i=i+1
if j%2 == 0: #Even lines
number_of_OH2 = lists[index_of_OH2-1]
print(number_of_OH2 + '\n')
fout.write(number_of_OH2 + '\n')
j=j+1
except:
pass
Output:
1
4
21
83
try:, except: pass added so that if OH2 is not found in that line it moves on without error
I have a csv file which contains 65000 lines (Size approximately 28 MB). In each of the lines a certain path in the beginning is given e.g. "c:\abc\bcd\def\123\456". Now let's say the path "c:\abc\bcd\" is common in all the lines and rest of the content is different. I have to remove the common part (In this case "c:\abc\bcd\") from all the lines using a python script. For example the content of the CSV file is as mentioned.
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert 87 116 69
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0
In the above example I need the output as below
FILE0.frag 0 0 0
FILE0.vert 0 0 0
FILE0.link-link-0.frag 17 25 2
FILE0.link-link-0.vert 85 111 68
FILE0.link-link-0.vert.bin 77 97 60
FILE0.link-link-0 0 0
FILE0.link 0 0 0
Can any of you please help me out with this?
^\S+/
You can simply use this regex over each line and replace by empty string.See demo.
https://regex101.com/r/cK4iV0/17
import re
p = re.compile(ur'^\S+/', re.MULTILINE)
test_str = u"C:/Abc/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert 87 116 69\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert.bin 75 95 61\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0 "
subst = u" "
result = re.sub(p, subst, test_str)
What about something like,
import csv
with open("file.csv", 'rb') as f:
sl = []
csvread = csv.reader(f, delimiter=' ')
for line in csvread:
sl.append(line.replace("C:/Abc/Def/Test/temp\.\test\GLNext\", ""))
To write the list sl out to filenew use,
with open('filenew.csv', 'wb') as f:
csvwrite = csv.writer(f, delimiter=' ')
for line in sl:
csvwrite.writerow(line)
You can automatically detect the common prefix without the need to hardcode it. You don't really need regex for this. os.path.commonprefix can be used
instead:
import csv
import os
with open('data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
paths = [] #stores all paths
rows = [] #stores all lines
for row in reader:
paths.append(row[0].split("/")) #split path by "/"
rows.append(row)
commonprefix = os.path.commonprefix(paths) #finds prefix common to all paths
for row in rows:
row[0] = row[0].replace('/'.join(commonprefix)+'/', "") #remove prefix
rows now has a list of lists which you can write to a file
with open('data2.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for row in rows:
writer.writerow(row)
The following Python script will read your file in (assuming it looks like your example) and will create a version removing the common folders:
import os.path, csv
finput = open("d:\\input.csv","r")
csv_input = csv.reader(finput, delimiter=" ", skipinitialspace=True)
csv_output = csv.writer(open("d:\\output.csv", "wb"), delimiter=" ")
# Create a set of unique folder names
set_folders = set()
for input_row in csv_input:
set_folders.add(os.path.split(input_row[0])[0])
# Determine the common prefix
base_folder = os.path.split(os.path.commonprefix(set_folders))[0]
nprefix = len(base_folder) + 1
# Go back to the start of the input CSV
finput.seek(0)
for input_row in csv_input:
csv_output.writerow([input_row[0][nprefix:]] + input_row[1:])
Using the following as input:
C:/Abc/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3
C:/Abc/Def/Test/temp/test/GLNext2/FILE0.link-link-0.vert 87 116 69
C:/Abc/Def/Test/temp/test/GLNext5/FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/Def/Test/temp/test/GLNext7/FILE0.link-link-0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0
The output is as follows:
GLNext/FILE0.frag 0 0 0
GLNext/FILE0.vert 0 0 0
GLNext/FILE0.link-link-0.frag 16 24 3
GLNext2/FILE0.link-link-0.vert 87 116 69
GLNext5/FILE0.link-link-0.vert.bin 75 95 61
GLNext7/FILE0.link-link-0 0 0
GLNext/FILE0.link-link-6 0 0 0
With one space between each column, although this could easily be changed.
So i tried something like this
for dirName, subdirList, fileList in os.walk(Directory):
for fname in fileList:
if fname.endswith('.csv'):
for line in fileinput.input(os.path.join(dirName, fname), inplace = 1):
location = line.find(r'GLNext')
if location > 0:
location += len('GLNext')
print line.replace(line[:location], ".")
else:
print line
You can use the pandas library for this. Doing so, you can leverage pandas' amazing handling of big CSV files (even in the hundreds of MB).
Code:
import pandas as pd
csv_file = 'test_csv.csv'
df = pd.read_csv(csv_file, header=None)
print df
print "-------------------------------------------"
path = "C:/Abc/bcd/Def/Test/temp/test/GLNext/"
df[0] = df[0].replace({path:""}, regex=True)
print df
# df.to_csv("truncated.csv") # Export to new file.
Result:
0 1 2 3
0 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
1 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
2 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 16 24 3
3 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 87 116 69
4 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 75 95 61
5 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 0 0 NaN
6 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 0 0 0
-------------------------------------------
0 1 2 3
0 FILE0.frag 0 0 0
1 FILE0.vert 0 0 0
2 FILE0.link-link-0.frag 16 24 3
3 FILE0.link-link-0.vert 87 116 69
4 FILE0.link-link-0.vert.bin 75 95 61
5 FILE0.link-link-0 0 0 NaN
6 FILE0.link-link-6 0 0 0
Hi please how can I loop over a text file, identify lines with 0s at the last index of such a line, and delete those lines while retrieving the ones not deleted. Then also format the output to be tuples.
input.txt = 1 2 0
1 3 0
11 4 0.058529
...
...
...
97 7 0.0789
Desired output should look like this
[(11,4,{'volume': 0.058529})]
Thank you
Pass inplace=1 to the fileinput.input() to modify the file in place. Everything that is printed inside the loop is written to the file:
import fileinput
results = []
for line in fileinput.input('input.txt', inplace=1):
data = line.split()
if data[-1].strip() == '0':
print line.strip()
else:
results.append(tuple(map(int, data[:-1])) + ({'volume': float(data[-1])}, ))
print results
If the input.txt contains:
1 2 0
1 3 0
11 4 0.058529
97 7 0.0789
the code will print:
[(11, 4, {'volume': 0.058529}),
(97, 7, {'volume': 0.0789})]
And the contents of the input.txt becomes:
1 2 0
1 3 0
I checked similar topics, but the results are poor.
I have a file like this:
S1_22 45317082 31 0 9 22 1543
S1_23 3859606 40 3 3 34 2111
S1_24 48088383 49 6 1 42 2400
S1_25 43387855 39 1 7 31 2425
S1_26 39016907 39 2 7 30 1977
S1_27 57612149 23 0 0 23 1843
S1_28 42505824 23 1 1 21 1092
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
I wanted to count occurencies of words in first column, and based on that write the output file with additional field stating uniq if count == 1 and multi if count > 0
I produced the code:
import csv
import collections
infile = 'Results'
names = collections.Counter()
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
names[row[0]] += 1
print names[row[0]],row[0]
but it doesn't work properly
I can't put everything into list, since the file is too big
If you want this code to work you should indent your print statement:
names[row[0]] += 1
print names[row[0]],row[0]
But what you actually want is:
import csv
import collections
infile = 'Result'
names = collections.Counter()
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
names[row[0]] += 1
for name, count in names.iteritems():
print name, count
Edit: To show the rest of the row, you can use a second dict, as in:
names = collections.Counter()
rows = {}
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
rows[row[0]] = row
names[row[0]] += 1
for name, count in names.iteritems():
print rows[name], count
The print statement at the end does not look like what you want. Because of its indentation it is only executed once. It will print S1_29, since that is the value of row[0] in the last iteration of the loop.
You're on the right track. Instead of that print statement, just iterate through the keys & values of the counter and check if each value is greater than or equal to 1.
Consider a input file with 5 column(0-5):
1 0 937 306 97 3
2 164472 75 17 81 3
3 197154 35268 306 97 3
4 310448 29493 64 38 1
5 310541 29063 64 38 1
6 310684 33707 64 38 1
7 319091 47451 16 41 1
8 319101 49724 16 41 1
9 324746 61578 1 5 1
10 324939 54611 1 5 1
for the second column i,e column1(0,164472,197154-----------) need to find the difference b/w numbers so that the column1 should be (0,164472-0,197154-164472,____) so (0,164472,32682..............).
And the output file must change only the column1 values all other values must remain the same as input file:
1 0 937 306 97 3
2 164472 75 17 81 3
3 32682 35268 306 97 3
4 113294 29493 64 38 1
5 93 29063 64 38 1
6 143 33707 64 38 1
7 8407 47451 16 41 1
8 10 49724 16 41 1
9 5645 61578 1 5 1
10 193 54611 1 5 1
if anyone could suggest a python code to do this it would be helpfull........
Actually i tried to append all the columns into list and find the difference of column2 and again write back to another file.But the input file i have posed is just a sample the entire input file contains 50,000 lines so my attempt failed
The attempt code i tried is as follows:
import sys
import numpy
old_stdout = sys.stdout
log_file = open("newc","a")
sys.stdout = log_file
a1 = []; a2 = []; a2f = []; v = []; a3 = []; a4 = []; a5 = []; a6 = []
with open("newfileinput",'r') as f:
for line in f:
job = map(int,line.split())
a1.append(job[0])
a3.append(job[2])
a4.append(job[3])
a5.append(job[4])
a6.append(job[5])
a2.append(job[1])
v = [a2[i+1]-a2[i] for i in range(len(a2)-1)]
print a1
print v
print a3
print a4
print a5
print a6
sys.stdout = old_stdout
log_file.close()
Now from the output file of the code "newc" which contained 6 list i wrote it in to an file one by one...Which was time consuming.... & not so efficient...
So if anyone could suggest a simpler method it will be helpful..........
Try this. let me know if any problems or if you want me to explain any of the code:
import sys
log_file = open("newc.txt","a")
this_no, prev_no = 0, 0
with open("newfileinput.txt",'r') as f:
for line in f:
row = line.split()
this_no = int(row[1])
log_file.write(line.replace(str(this_no), str(this_no - prev_no)))
prev_no = this_no
log_file.close()
don't downvote me, just for fun.
import re
from time import sleep
p = re.compile(r'\s+')
data = '''1 0 937 306 97 3
2 164472 75 17 81 3
3 197154 35268 306 97 3
4 310448 29493 64 38 1
5 310541 29063 64 38 1
6 310684 33707 64 38 1
7 319091 47451 16 41 1
8 319101 49724 16 41 1
9 324746 61578 1 5 1
10 324939 54611 1 5 1\n''' * 5000
data = data.split('\n')[0:-1]
data = [p.split(one) for one in data]
data = [map(int, one) for one in data]
def list_diff(a, b):
temp = a[:]
temp[1] = a[1] - b[1]
return temp
result = [
data[0],
]
for i, _ in enumerate(data):
if i < len(data) - 1:
result.append(list_diff(data[i+1], data[i]))
for i, one in enumerate(result):
one[0] = i+1
print one
sleep(0.1)