data extraction from text file in Python - python

I have a text file that represents motion vector data from a video clip.
# pts=-26 frame_index=2 pict_type=P output_type=raw shape=3067x4
8 8 0 0
24 8 0 -1
40 8 0 0
...
8 24 0 0
24 24 3 1
40 24 0 0
...
8 40 0 0
24 40 0 0
40 40 0 0
# pts=-26 frame_index=3 pict_type=P output_type=raw shape=3067x4
8 8 0 1
24 8 0 0
40 8 0 0
...
8 24 0 0
24 24 5 -3
40 24 0 0
...
8 40 0 0
24 40 0 0
40 40 0 0
...
So it is some sort of grid where first two digits are x and y coordinates and third and fourth are the x and y values for motion vectors.
To use further this data I need to extract pairs of x and y values where at least one value differs from 0 and organize them in lists.
For example:
(0, -1, 2)
(3, 1, 2)
(0, 1, 3)
(5, 3, 3)
The third digit is a frame_index.
I would appreciate a lot if somebody cold help me with the plan how to crack this task. From what I should start.

This is actually quite simple since there is only one type of data.
We can do this without resorting to e.g. regular expressions.
Disregarding any error checking (Did we actually read 3067 points for frame 2, or only 3065? Is a line malformed? ...) it would look something like this
frame_data = {} # maps frame_idx -> list of (x, y, vx, vy)
for line in open('mydatafile.txt', 'r'):
if line.startswith('#'): # a header line
options = {key: value for key, value in
[token.split('=') for token in line[1:].split()]
}
curr_frame = int(options['frame_index'])
curr_data = []
frame_data[curr_frame] = curr_data
else: # Not a header line
x, y, vx, vy = map(int, line.split())
frame_data.append((x, y, vx, vy))
You know have a dictionary that maps a frame number to a list of (x, y, vx, vy) tuple elements.
Extracting the new list from the dictionary is now easy:
result = []
for frame_number, data in frame_data.items():
for x, y, vx, vy in data:
if not (vx == 0 and vy == 0):
result.append((vx, vy, frame_number))

Related

outputting python/numpy arrays as columns

I'm very new to python, but have been using it to calculate and filter through data. I'm trying to output my array so I can pass it to other programs, but the output is one solid piece of text, with brackets and commas separating it.
I understand there are ways of manipulating this, but I want to understand why my code has output it in this format, and how to make it output it in nice columns instead.
The array was generated with:
! /usr/bin/env python
import numpy as np
import networkx
import gridData
from scipy.spatial.distance import euclidean
INPUT1=open("test_area.xvg",'r')
INPUT2=open("test_atom.xvg",'r')
OUTPUT1= open("negdist.txt",'w')
area = []
pointneg = []
posneg = []
negdistance =[ ]
negresarea = []
while True:
line = INPUT1.readline()
if not line:
break
col = line.split()
if col:
area.append(((col[0]),float(col[1])))
pointneg.append((-65.097000,5.079000,-9.843000))
while True:
line = INPUT2.readline()
if not line:
break
col = line.split()
if col:
pointneg.append((float(col[5]),float(col[6]),float(col[7])))
posneg.append((col[4]))
for col in posneg:
negresarea.append(area[int(col)-1][1])
a=len(pointneg)
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print >> OUTPUT1, negdistance
example output:
[(-1, 1.22333, 0.0), (-1, 1.24223, 153.4651968428021), (-1, 1.48462, 148.59335545709976), (-1, 1.39778, 86.143305392816202), (-1, 0.932278, 47.914688322058403), (-1, 1.04997, 28.622555546282022),
desired output:
[-1, 1.22333, 0.0
-1, 1.24223, 153.4651968428021
-1, 1.48462, 148.59335545709976
-1, 1.39778, 86.143305392816202
-1, 0.932278, 47.914688322058403
-1, 1.04997, 28.622555546282022...
Example inputs:
example input1
1 2.12371 0
2 1.05275 0
3 0.865794 0
4 0.933986 0
5 1.09092 0
6 1.22333 0
7 1.54639 0
8 1.24223 0
9 1.10928 0
10 1.16232 0
11 0.60942 0
12 1.40117 0
13 1.58521 0
14 1.00011 0
15 1.18881 0
16 1.68442 0
17 0.866275 0
18 1.79196 0
19 1.4375 0
20 1.198 0
21 1.01645 0
22 1.82221 0
23 1.99409 0
24 1.0728 0
25 0.679654 0
26 1.15578 0
27 1.28326 0
28 1.00451 0
29 1.48462 0
30 1.33399 0
31 1.13697 0
32 1.27483 0
33 1.18738 0
34 1.08141 0
35 1.15163 0
36 0.93699 0
37 0.940171 0
38 1.92887 0
39 1.35721 0
40 1.85447 0
41 1.39778 0
42 1.97309 0
Example Input2
ATOM 35 CA GLU 6 56.838 -5.202 -102.459 1.00273.53 C
ATOM 55 CA GLU 8 54.729 -6.650 -96.930 1.00262.73 C
ATOM 225 CA GLU 29 5.407 -2.199 -58.801 1.00238.62 C
ATOM 321 CA GLU 41 -24.633 -0.327 -34.928 1.00321.69 C
The problem is the multiple parenthesis when you append. You are appending tuples.
what you want is to be adding lists - i.e. the ones with square brackets.
import numpy as np
area = []
with open('example2.txt') as filehandle:
for line in filehandle:
if line.strip() == '':continue
line = line.strip().split(',')
area.append([int(line[0]),float(line[1]),float(line[2])])
area = np.array(area)
print(area)
'example2.txt' is the data you provided made into a csv
I didn't really get an answer that enabled me to understand the problem, the one suggested above just prevented to whole code working properly. I did find a work around by including the print command in the loop defining my final output.
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print negdistance
negdistance =[]

Pandas dataframe - pairing off rows within a bucket

I have a dataframe that looks like this:
bucket type v
0 0 X 14
1 1 X 10
2 1 Y 11
3 1 X 15
4 2 X 16
5 2 Y 9
6 2 Y 10
7 3 Y 20
8 3 X 18
9 3 Y 15
10 3 X 14
The desired output looks like this:
bucket type v v_paired
0 1 X 14 nan (no Y coming before it)
1 1 X 10 nan (no Y coming before it)
2 1 Y 11 14 (highest X in bucket 1 before this row)
3 1 X 15 11 (lowest Y in bucket 1 before this row)
4 2 X 16 nan (no Y coming before it in the same bucket)
5 2 Y 9 16 (highest X in same bucket coming before)
6 2 Y 10 16 (highest X in same bucket coming before)
7 3 Y 20 nan (no X coming before it in the same bucket)
8 3 X 18 20 (single Y coming before it in same bucket)
9 3 Y 15 18 (single Y coming before it in same bucket)
10 3 X 14 15 (smallest Y coming before it in same bucket)
The goal is to construct the v_paired column, and the rules are as follows:
Look for rows in the same bucket, coming before this one, that have opposite type(X vs Y), call these 'pair candidates'
If the current row is X, choose the min. v out of the pair candidates to become v_paired for the current row, if the current row is Y, choose the max. v out of the pair candidates to be the v_paired for the current row
Thanks in advance.
I believe this should be done in a sequential manner...
first group by bucket
groups = df.groupby('bucket', group_keys=False)
this function will be applied to each bucket group
def func(group):
y_value = None
x_value = None
result = []
for _, (_, value_type, value) in group.iterrows():
if value_type == 'X':
x_value = max(filter(None,(x_value, value)))
result.append(y_value)
elif value_type == 'Y':
y_value = min(filter(None,(y_value, value)))
result.append(x_value)
return pd.DataFrame(result)
df['v_paired'] = groups.apply(func)
hopefuly this will do the job

Fraction of values in (x, y) space

I have a data frame that looks like this, but with several hundred thousand rows:
df
D x y
0 y 5.887672 6.284714
1 y 9.038657 10.972742
2 n 2.820448 6.954992
3 y 5.319575 15.475197
4 n 1.647302 7.941926
5 n 5.825357 13.747091
6 n 5.937630 6.435687
7 y 7.789661 11.868023
8 n 2.669362 11.300062
9 y 1.153347 17.625158
I want to know what proportion of values ("D") in each x:y grid space is "n".
I can do it by brute force, by stepping through x and y and calculating the percentage:
zonexy = {}
for x in np.arange(0,10,2.5):
dfx = df[(df['x'] >= x) & (df['x'] < x+2.5)]
zonexy[x] = {}
for y in np.arange(0,24,6):
dfy = dfx[(dfx['y'] >= y) & (dfx['y'] < y+6)]
try:
pctn = len(dfy[dfy['Descr']=='n'])/len(dfy) * 100.0
except ZeroDivisionError:
pctn = 0
zonexy[x][y] = pctn
Output:
pd.DataFrame(zonexy)
0.0 2.5 5.0 7.5
0 0 0 0 0
6 100 100 50 0
12 0 0 50 0
18 0 0 0 0
But this, and all the variations on this theme that I've tried, is very slow. It seems like there should be a much more efficient way (probably via numpy), but I'm blanking on it.
One way would be to use the 2D histogram function of numpy:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram2d.html
Then,
Run it once on the data where the criteria is matched (here, "D" is "n")
Run it again on all of the data.
Divide the first result, element-by-element, with the second result.

Cumulative style calculations of entries in a data table using class initialisation in Python

I am trying to determine the optimum value of Z in a data table using Python. The optimum of Z occurs when the difference in Y values is greater than 10. In my code I am assigning the elements of each entry into a class. In order to determine the optimum I therefore need to access the previously calculated value of Y and subtract it from the new value. This all seems very cumbersome to me so if you know of a better way I can perform these type of calculations please let me know. My sample data table is:
X Y Z
1 5 10
2 3 20
3 4 30
4 6 40
5 12 50
6 12 60
7 34 70
8 5 80
My code so far is:
class values:
def __init__(self, X, Y, Z):
self.X = X
self.Y = Y
self.Z = Z
#Diff = Y2 - Y1
#if Diff > 10:
#optimum = Z
#else:
#pass
#optimum
valueLst = []
f = open('sample.txt','r')
for i in f:
X = i.split('\t')[0]
Y = i.split('\t')[1]
Z = i.split('\t')[2]
x = values(X,Y,Z)
valueLst.append(x)
An example of the operation I would like to achieve is shown in the following table. The difference in Y values is calculated in the third column, I would like to return value of Z when the difference is 22 i.e. Z value of 70.
1 2 10
2 3 1 20
3 4 1 30
4 6 2 40
5 12 6 50
6 12 0 60
7 34 22 70
8 35 1 80
Any help would be much appreciated.
A class seems like overkill for this. Why not a list of (x, y, z) tuples?
valueLst = []
for i in f:
valueLst.append(tuple(i.split('\t')))
You can then determine the differences between the y values and get the last item z from the 3-tuple corresponding to the largest delta-y:
yDiffs = [0] + list(valueLst[i][1] - valueLst[i-1][1]
for i in range(1, len(valueLst)))
bestZVal = valueLst[yDiffs.index(max(yDiffs))][2]
To start, you can put the columns into a list data structure:
f = open('sample.txt','r')
x, y, z = [], [], []
for i in f:
ix, iy, iz = map(int, i.split('\t')) # the map function changes each number
# to an integer from a string
y.append(iy)
z.append(iz)
When you have data structures, you can use them together to get other data structures you want.
Then you can get each difference starting from the second y:
differences = [y[i] - y[i+1] for i in range(1,len(y))]
What you want is the z at the same index as the max of the differences, so:
maxIndex = y.index(max(differences))
answer = z[maxIndex]
Skipping the building of tuples x, y and z
diffs = [curr-prev for curr, prev in izip(islice(y, 1, None), islice(y, len(y)-1))]
max_diff = max(diffs)
Z = y[diffs.index(max_diff)+1]
Given a file with this content:
1 5 10
2 3 20
3 4 30
4 6 40
5 12 50
6 12 60
7 34 70
8 5 80
You can read the file and convert to a list of tuples like so:
data=[]
with open('value_list.txt') as f:
for line in f:
x,y,z=map(int,line.split())
data.append((x,y,z))
print(data)
Prints:
[(1, 5, 10), (2, 3, 20), (3, 4, 30), (4, 6, 40), (5, 12, 50), (6, 12, 60), (7, 34, 70), (8, 5, 80)]
Then you can use that data to find tuples that meet your criteria using a list comprehension. In this case y-previous y>10:
tgt=10
print([data[i][2] for i in range(1,len(data)) if data[i][1]-data[i-1][1]>tgt])
[70]

Count number of 0s from [1,2,....num]

We are given a large number 'num', which can have upto 10^4 digits ,( num<= 10^(10000) ) , we need to find the count of number of zeroes in the decimal representation starting from 1 upto 'num'.
eg:
countZeros('9') = 0
countZeros('100') = 11
countZeros('219') = 41
The only way i could think of is to do brute force,which obviously is too slow for large inputs.
I found the following python code in this link ,which does the required in O(L),L being length of 'num'.
def CountZeros(num):
Z = 0
N = 0
F = 0
for j in xrange(len(num)):
F = 10*F + N - Z*(9-int(num[j]))
if num[j] == '0':
Z += 1
N = 10*N + int(num[j])
return F
I can't understand the logic behind it..Any kind of help will be appreciated.
from 0 - 9 : 0 zeros
from 10 - 99: 9 zeros ( 10, 20, ... 90)
--100-199 explained-----------------------
100, 101, ..., 109 : 11 zeros (two in 100)
110, 120, ..., 199: 9 zeros (this is just the same as 10-99) This is important
Total: 20
------------------------------------------
100 - 999: 20 * 9 = 180
total up to 999 is: 180 + 9: 189
CountZeros('999') -> 189
Continu this pattern and you might start to see the overall pattern and eventually the algorithm.
Does the following help you're understanding:
>>> for i in range(10, 100, 10):
... print(CountZeros(str(i)))
...
1
2
3
4
5
6
7
8
9
>>>
What about this:
>>> CountZeros("30")
j Z N F
0 0 0 0
j Z N F
0 0 3 0
j Z N F
1 0 3 0
j Z N F
1 1 30 3
3

Categories