I want to extract the number corresponding to O2H from the following file format (The delimiter used here is space):
# Timestep No_Moles No_Specs SH2 S2H4 S4H6 S2H2 H2 S2H3 OSH2 Mo1250O3736S57H111 OSH S3H6 OH2 S3H4 O2S SH OS2H3
144500 3802 15 3639 113 1 10 18 2 7 1 3 2 1 2 1 1 1
# Timestep No_Moles No_Specs SH2 S2H4 S2H2 H2 S2H3 OSH2 Mo1250O3733S61H115 OS2H2 OSH S3H6 OS O2S2H2 OH2 S3H4 SH
149000 3801 15 3634 114 11 18 2 7 1 1 2 2 1 1 4 2 1
# Timestep No_Moles No_Specs SH2 OS2H3 S3H Mo1250O3375S605H1526 OS S2H4 O3S3H3 OSH2 OSH S2H2 H2 OH2 OS2H2 S2H O2S3H3 SH O4S4H4 OH O2S2H O6S5H3 O6S5H5 O3S4H4 O2S3H2 O3S4H3 OS3H3 O3S2H2 O4S3H4 O3S3H O6S4H5 OS4H3 O3S2H O5S4H4 OS2H O2SH2 S2H3 O4S3H3 O3S3H4 O O5S3H4 O5S3H3 OS3H4 O2S4H4 O4S4H3 O2SH O2S2H2 O5S4H5 O3S3H2 S3H6
589000 3269 48 2900 11 1 1 47 11 1 81 74 26 25 21 17 1 3 5 2 3 3 1 1 2 2 1 2 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Timestep No_Moles No_Specs SH2 Mo1250O3034S578H1742 OH2 OSH2 O3S3H5 OS2H2 OS OSH O2S3H2 OH O3S2H2 O6S6H4 SH O2S2H2 S2H2 OS2H H2 OS2H3 O5S4H2 O7S6H5 S3H2 O2SH2 OSH3 O7S6H4 O2S2H3 O6S5H3 O2SH O4S4H O3S2H3 S2 O2S2H S5H3 O7S4H4 O3S3H OS3H OS4H O5S3H3 S3H O17S12H9 O3S3H2 O7S5H4 O4SH3 O3S2H O7S8H4 O3S3H3 O11S9H6 OS3H2 S4H2 O10S8H6 O4S3H2 O5S5H4 O6S8H4 OS2 OS3H6 S3H3
959500 3254 55 2597 1 83 119 1 46 59 172 4 3 4 1 27 7 38 6 23 3 1 2 3 5 3 1 2 1 2 1 1 6 3 1 1 2 1 1 1 1 1 3 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1
That is, all the alternate rows contain the corresponding data of its previous row.
And I want the output to look like this
1
4
21
83
How it should work:
1 (14th number on 2nd row which corresponds to 14th word of 1st row i.e. O2H)
4 (16th number on 4th row which corresponds to 16th word of 3rd row i.e. O2H)
21 (15th number on 6th row which corresponds to 15th word of 5th row i.e. O2H)
83 (6th number on 8th row which corresponds to 6th word of 7th row i.e. O2H)
I was trying to extract it using regex but couldnot do it. Can anyone please help me to extract the data?
You easily parse this to a dataframe and select the desired column to fetch the values.
Assuming your data looks like the sample you've provided, you can try the following:
import pandas as pd
with open("data.txt") as f:
lines = [line.strip() for line in f.readlines()]
header = max(lines, key=len).replace("#", "").split()
df = pd.DataFrame([line.split() for line in lines[1::2]], columns=header)
print(df["OH2"])
df.to_csv("parsed_data.csv", index=False)
Output:
0 1
1 11
2 1
3 83
Name: OH2, dtype: object
Dumping this to a .csv would yield:
i think you want OH2 and not O2H and it's a typo. Assuming this:
(1) iterate every single line
(2) take in account only even lines. ( if (line_counter % 2) == 0: continue )
(3) splitting all the spaces and using a counter variable, count the index of the OH2 in the even line. assuming it is 14 in the first line
(4) access the following line ( +1 index ) and splitting spaces of the following line, access the element at the index of the element that you find in point (3)
since you haven't post any code i assumed your problem was more about finding a way to achieve this, than coding, so i wrote you the algorithm
Thank you, everyone, for the help, I figured out the solution
i=0
j=1
with open ('input.txt','r') as fin:
with open ('output.txt','w') as fout:
for lines in fin: #Iterating over each lines
lists = lines.split() #Splits each line in list of words
try:
if i%2 == 0: #Odd lines
index_of_OH2 = lists.index('OH2')
#print(index_of_OH2)
i=i+1
if j%2 == 0: #Even lines
number_of_OH2 = lists[index_of_OH2-1]
print(number_of_OH2 + '\n')
fout.write(number_of_OH2 + '\n')
j=j+1
except:
pass
Output:
1
4
21
83
try:, except: pass added so that if OH2 is not found in that line it moves on without error
Related
I've been working on a recursive solution to Pascal's Triangle, and I've found a lot of resources/code on how to have the output print as a list. However, I need the output to look like the below:
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
etc.
I've been trying to adapt some of the solutions I've seen to output the triangle to lists or nested lists to string so I can achieve the above output, but I am getting stuck. Below is what I have so far, but it only outputs: 1 1
Any help would be appreciated! :)
def triangle(n):
if n == 0:
return []
elif n == 1:
return "1"
else:
new_row = "1"
result = triangle(n-1)
last_row = result[-1]
for i in range(len(last_row)-1):
new_row = ' '.join([last_row[i]], [last_row[i+1]])
new_row = new_row + "1"
result = ' '.join(new_row)
return result
if __name__ == '__main__':
print(triangle(10))
There are several things wrong with your code. The first is the base case. There only needs to be one base case where n==1 and this should return [[1]] which is a list containing a list which contains 1.
The next is that each new_row should be a list, so I start it off with [1].
The next is that as you iterate over the previous list you need to add adjacent elements together, ie adding an int to an int and not concatenating strings.
Lastly, the new_row should be appended to the result of the previous call to triangle().
Here is the modified code:
def triangle(n):
if n == 1:
return [[1]]
new_row = [1]
result = triangle(n - 1)
last_row = result[-1]
for i in range(len(last_row) - 1):
new_row.append(last_row[i] + last_row[i + 1])
new_row.append(1)
result.append(new_row)
return result
if __name__ == '__main__':
for row in triangle(10):
print(*row)
Output:
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
1 9 36 84 126 126 84 36 9 1
I'm trying to extract tables from log files which are in .txt format. The file is loaded using read_csv() from pandas.
The log file looks like this:
aaa
bbb
ccc
=====================
A B C D E F
=====================
1 2 3 4 5 6
7 8 9 1 2 3
4 5 6 7 8 9
1 2 3 4 5 6
---------------------
=====================
G H I J
=====================
1 3 4
5 6 7
---------------------
=====================
K L M N O
=====================
1 2 3
4 5 6
7 8 9
---------------------
xxx
yyy
zzz
Here are some points about the log file:
Files start and end with some lines of comment which can be ignored.
In the example above there are three tables.
Headers for each table are located between lines of "======..."
The end of each table is signified by a line of "------..."
My code as of now:
import pandas as pd
import itertools
df = pd.read_csv("xxx.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# while loop to find lines which are table rows & append to one list
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
r.append(df.iloc[i+x].str.split().tolist())
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
This code returns AssertionError: 14 columns passed, passed data had 15 columns. I know that this is due to the fact that for the table rows, I am using .str.split() which by default splits on whitespace. Since there are some columns for which there are missing values, the number of elements in table headers and number of elements in table rows does not match for the second and htird table. I am struggling to get around this, since the number of whitespace characters to signify missing values is different for each table.
My question is: is there a way to account for missing values in some of the columns, so that I can get a DataFrame as output where there are either null or NaN for missing values as appropriate?
With usage of Victor Ruiz method I added if options to handle different header sizes.
=^..^=
Description in code:
import re
import pandas as pd
import itertools
df = pd.read_csv("stack.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# get header string
head = df.iloc[i+1].to_string()
# get space distance in header
space_range = 0
for result in re.findall('([ ]*)', head):
if len(result) > 0:
space_range = len(result)
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
# strip line
line = df.iloc[i+x].to_string()[5::]
# collect items based on elements distance
items = []
for result in re.finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > space_range*2+1:
items.append('NaN')
items.append('NaN')
if len(delimiter) < space_range*2+2 and len(delimiter) > space_range:
items.append('NaN')
r.append([items])
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
Output:
A B C D E F
0 1 2 3 4 5 6
1 7 8 9 1 2 3
2 4 5 6 7 8 9
3 1 2 3 4 5 6
G H I J
0 1 NaN 3 4
1 5 NaN 6 7
K L M N O
0 1 NaN NaN 2 3
1 4 5 NaN NaN 6
2 7 8 NaN 9 None
Maybe this can help you.
Suppose we have the next line of text:
1 3 4
The problem is to identify how much spaces delimits two consecutive items without considering that there is a missing value between them.
Let consider that 5 spaces is a delimiter, and more than 5 is a missing value.
You can use regex to parse the items:
from re import finditer
line = '1 3 4'
items = []
for result in finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > 5:
items.append(nan)
print(items)
Output is:
['1', nan, '3', '4']
A more complex situation would be if it can appear two or more consecutive missing values (the code above will just inyect only one nan)
I have the following output containing two columns (line# and ID):
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
How can I make the ID values of second column align at the top of each other, like the following:
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
The problem I am facing is how to incorporate the solution to my code. I tried to use python tabulate, but found this is not working properly since what I am printing: row[0] is a unicode from the tuple row (See the following code).
count = 0
for row in c:
count += 1
print count, row[0]
Any idea how can I incorporate tabulate or other methods to align the unicode-type values in the column?
Use alignment specifiers:
data = {
1:'Q50331',
2:'P75247',
3:'P75544',
4:'P22446',
5:'P78027',
6:'P75271',
7:'P75176',
8:'P0ABB4',
9:'P63284',
10:'P0A6M8',
11:'P0AES4',
12:'P39452',
13:'P0A8T7',
14:'P0A698',
333:'P00bar'
}
length = len(str(max(data.keys())))+1
for k,v in data.items():
print "{:<{}}{}".format(k, length, v)
Output:
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
333 P00bar
I've created length which will contain the length of the max value from data keys, +1. Then I pass that length value to my alignment specifier, which in this case is 4:
{:<4}{}
I checked similar topics, but the results are poor.
I have a file like this:
S1_22 45317082 31 0 9 22 1543
S1_23 3859606 40 3 3 34 2111
S1_24 48088383 49 6 1 42 2400
S1_25 43387855 39 1 7 31 2425
S1_26 39016907 39 2 7 30 1977
S1_27 57612149 23 0 0 23 1843
S1_28 42505824 23 1 1 21 1092
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
I wanted to count occurencies of words in first column, and based on that write the output file with additional field stating uniq if count == 1 and multi if count > 0
I produced the code:
import csv
import collections
infile = 'Results'
names = collections.Counter()
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
names[row[0]] += 1
print names[row[0]],row[0]
but it doesn't work properly
I can't put everything into list, since the file is too big
If you want this code to work you should indent your print statement:
names[row[0]] += 1
print names[row[0]],row[0]
But what you actually want is:
import csv
import collections
infile = 'Result'
names = collections.Counter()
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
names[row[0]] += 1
for name, count in names.iteritems():
print name, count
Edit: To show the rest of the row, you can use a second dict, as in:
names = collections.Counter()
rows = {}
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
rows[row[0]] = row
names[row[0]] += 1
for name, count in names.iteritems():
print rows[name], count
The print statement at the end does not look like what you want. Because of its indentation it is only executed once. It will print S1_29, since that is the value of row[0] in the last iteration of the loop.
You're on the right track. Instead of that print statement, just iterate through the keys & values of the counter and check if each value is greater than or equal to 1.
I have two arrays (a and b) with n integer elements in the range (0,N).
typo: arrays with 2^n integers where the largest integer takes the value N = 3^n
I want to calculate the sum of every combination of elements in a and b (sum_ij_ = a_i_ + b_j_ for all i,j). Then take modulus N (sum_ij_ = sum_ij_ % N), and finally calculate the frequency of the different sums.
In order to do this fast with numpy, without any loops, I tried to use the meshgrid and the bincount function.
A,B = numpy.meshgrid(a,b)
A = A + B
A = A % N
A = numpy.reshape(A,A.size)
result = numpy.bincount(A)
Now, the problem is that my input arrays are long. And meshgrid gives me MemoryError when I use inputs with 2^13 elements. I would like to calculate this for arrays with 2^15-2^20 elements.
that is n in the range 15 to 20
Is there any clever tricks to do this with numpy?
Any help will be highly appreciated.
--
jon
try chunking it. your meshgrid is an NxN matrix, block that up to 10x10 N/10xN/10 and just compute 100 bins, add them up at the end. this only uses ~1% as much memory as doing the whole thing.
Edit in response to jonalm's comment:
jonalm: N~3^n not n~3^N. N is max element in a and n is number of
elements in a.
n is ~ 2^20. If N is ~ 3^n then N is ~ 3^(2^20) > 10^(500207).
Scientists estimate (http://www.stormloader.com/ajy/reallife.html) that there are only around 10^87 particles in the universe. So there is no (naive) way a computer can handle an int of size 10^(500207).
jonalm: I am however a bit curios about the pv() function you define. (I
do not manage to run it as text.find() is not defined (guess its in another
module)). How does this function work and what is its advantage?
pv is a little helper function I wrote to debug the value of variables. It works like
print() except when you say pv(x) it prints both the literal variable name (or expression string), a colon, and then the variable's value.
If you put
#!/usr/bin/env python
import traceback
def pv(var):
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
print('%s: %s'%(text[text.find('(')+1:-1],var))
x=1
pv(x)
in a script you should get
x: 1
The modest advantage of using pv over print is that it saves you typing. Instead of having to
write
print('x: %s'%x)
you can just slap down
pv(x)
When there are multiple variables to track, it's helpful to label the variables.
I just got tired of writing it all out.
The pv function works by using the traceback module to peek at the line of code
used to call the pv function itself. (See http://docs.python.org/library/traceback.html#module-traceback) That line of code is stored as a string in the variable text.
text.find() is a call to the usual string method find(). For instance, if
text='pv(x)'
then
text.find('(') == 2 # The index of the '(' in string text
text[text.find('(')+1:-1] == 'x' # Everything in between the parentheses
I'm assuming n ~ 3^N, and n~2**20
The idea is to work module N. This cuts down on the size of the arrays.
The second idea (important when n is huge) is to use numpy ndarrays of 'object' type because if you use an integer dtype you run the risk of overflowing the size of the maximum integer allowed.
#!/usr/bin/env python
import traceback
import numpy as np
def pv(var):
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
print('%s: %s'%(text[text.find('(')+1:-1],var))
You can change n to be 2**20, but below I show what happens with small n
so the output is easier to read.
n=100
N=int(np.exp(1./3*np.log(n)))
pv(N)
# N: 4
a=np.random.randint(N,size=n)
b=np.random.randint(N,size=n)
pv(a)
pv(b)
# a: [1 0 3 0 1 0 1 2 0 2 1 3 1 0 1 2 2 0 2 3 3 3 1 0 1 1 2 0 1 2 3 1 2 1 0 0 3
# 1 3 2 3 2 1 1 2 2 0 3 0 2 0 0 2 2 1 3 0 2 1 0 2 3 1 0 1 1 0 1 3 0 2 2 0 2
# 0 2 3 0 2 0 1 1 3 2 2 3 2 0 3 1 1 1 1 2 3 3 2 2 3 1]
# b: [1 3 2 1 1 2 1 1 1 3 0 3 0 2 2 3 2 0 1 3 1 0 0 3 3 2 1 1 2 0 1 2 0 3 3 1 0
# 3 3 3 1 1 3 3 3 1 1 0 2 1 0 0 3 0 2 1 0 2 2 0 0 0 1 1 3 1 1 1 2 1 1 3 2 3
# 3 1 2 1 0 0 2 3 1 0 2 1 1 1 1 3 3 0 2 2 3 2 0 1 3 1]
wa holds the number of 0s, 1s, 2s, 3s in a
wb holds the number of 0s, 1s, 2s, 3s in b
wa=np.bincount(a)
wb=np.bincount(b)
pv(wa)
pv(wb)
# wa: [24 28 28 20]
# wb: [21 34 20 25]
result=np.zeros(N,dtype='object')
Think of a 0 as a token or chip. Similarly for 1,2,3.
Think of wa=[24 28 28 20] as meaning there is a bag with 24 0-chips, 28 1-chips, 28 2-chips, 20 3-chips.
You have a wa-bag and a wb-bag. When you draw a chip from each bag, you "add" them together and form a new chip. You "mod" the answer (modulo N).
Imagine taking a 1-chip from the wb-bag and adding it with each chip in the wa-bag.
1-chip + 0-chip = 1-chip
1-chip + 1-chip = 2-chip
1-chip + 2-chip = 3-chip
1-chip + 3-chip = 4-chip = 0-chip (we are mod'ing by N=4)
Since there are 34 1-chips in the wb bag, when you add them against all the chips in the wa=[24 28 28 20] bag, you get
34*24 1-chips
34*28 2-chips
34*28 3-chips
34*20 0-chips
This is just the partial count due to the 34 1-chips. You also have to handle the other
types of chips in the wb-bag, but this shows you the method used below:
for i,count in enumerate(wb):
partial_count=count*wa
pv(partial_count)
shifted_partial_count=np.roll(partial_count,i)
pv(shifted_partial_count)
result+=shifted_partial_count
# partial_count: [504 588 588 420]
# shifted_partial_count: [504 588 588 420]
# partial_count: [816 952 952 680]
# shifted_partial_count: [680 816 952 952]
# partial_count: [480 560 560 400]
# shifted_partial_count: [560 400 480 560]
# partial_count: [600 700 700 500]
# shifted_partial_count: [700 700 500 600]
pv(result)
# result: [2444 2504 2520 2532]
This is the final result: 2444 0s, 2504 1s, 2520 2s, 2532 3s.
# This is a test to make sure the result is correct.
# This uses a very memory intensive method.
# c is too huge when n is large.
if n>1000:
print('n is too large to run the check')
else:
c=(a[:]+b[:,np.newaxis])
c=c.ravel()
c=c%N
result2=np.bincount(c)
pv(result2)
assert(all(r1==r2 for r1,r2 in zip(result,result2)))
# result2: [2444 2504 2520 2532]
Check your math, that's a lot of space you're asking for:
2^20*2^20 = 2^40 = 1 099 511 627 776
If each of your elements was just one byte, that's already one terabyte of memory.
Add a loop or two. This problem is not suited to maxing out your memory and minimizing your computation.