Resampling (upsampling, interpolating) a series of numbers - python

I have a comma separated series of integer values that I'd like to resample so that I have twice as many, where a new value is added half way between each of the existing values. For example, if this is my source:
1,5,11,9,13,21
the result would be:
1,3,5,8,11,10,9,11,13,17,21
In case that's not clear, I'm trying to add a number between each of the values in my source series, like this:
1 5 11 9 13 21
1 3 5 8 11 10 9 11 13 17 21
I've searched quite a bit and it seems that something like scipy.signal.resample or panda should work, but I'm completely new at this and I haven't been able to get it working. For example, here's one of my attempts with scipy:
import numpy as np
from scipy import signal
InputFileName = "sample.raw"
DATA250 = np.loadtxt(InputFileName, delimiter=',', dtype=int);
print(DATA250)
DATA500 = signal.resample(DATA250, 11)
print(DATA500)
Which outputs:
[ 1 5 11 9 13 21]
[ 1. -0.28829461 6.12324489 10.43251996 10.9108191 9.84503237
8.40293529 10.7641676 18.44182898 21.68506897 12.68267746]
Obviously I'm using signal.resample incorrectly. Is there a way I can do this with signal.resample or panda? Should I be using some other method?
Also, in my example all of source numbers have an integer half way in between. In my actual data, that won't be the case. So if two of the number are 10,15, the new number would be 12.5. However I'd like to have all of the resulting numbers be integers. So the new number that gets inserted would need to either be 12 or 13 (it doesn't matter to me which it is).
Note that once I get this working, the source file will actually be a comma separated list of 2,000 numbers and the output should be 4,000 numbers (or technically 3,999 since there won't be one added to the end). Also, this is going to be used to process something similar to an ECG recording- currently the ECG is sampled at 250 Hz for 8 seconds, which is then passed to a separate process to analyze the recording. However that separate process needs the recording to be sampled at 500 Hz. So the workflow will be that I'll take a 250 Hz recording every 8 seconds and upsample it to 500 Hz, then pass the resulting output to the analysis process.
Thanks for any guidance you can provide.

Since the interpolation is simple, you can do it by hand:
import numpy as np
a = np.array([1,5,11,9,13,21])
b = np.zeros(2*len(a)-1, dtype=np.uint32)
b[0::2] = a
b[1::2] = (a[:-1] + a[1:]) // 2
You can also use scipy.signal.resample this way:
import numpy as np
from scipy import signal
a = np.array([1,5,11,9,13,21])
b = signal.resample(a, len(a) * 2)
b_int = b.astype(int)
The trick is to have exactly twice the number of elements, so that odd points match your initial points. Also I think that the Fourier interpolation done by scipy.signal.resample is better for your ECG signal than the linear interpolation you're asking for.

Although I probably would just use NumPy here, pretty similar to J. Martinot-Lagarde's answer, you don't actually have to.
First, you can read a single row of comma-separated numbers with just the csv module:
with open(path) as f:
numbers = map(int, next(csv.reader(f))
… or just string operations:
with open(path) as f:
numbers = map(int, next(f).split(','))
And then you can interpolate that easily:
def interpolate(numbers):
last = None
for number in numbers:
if last is not None:
yield (last+number)//2
yield number
last=number
If you want it to be fully general and reusable, just take a function argument and yield function(last, number), and replace None with sentinel = object().
And now, all you need to do is join the results and write them:
with open(outpath, 'w') as f:
f.write(','.join(map(str, interpolate(numbers))))
Are there any advantages to this solution? Well, other than the read/split and join/write, it's purely lazy. And we can write lazy split and join functions pretty easily (or just do it manually). So if you ever had to deal with a billion comma-separated numbers instead of a thousand, that's all you'd have to change.
Here's a lazy split:
def isplit(s, sep):
start = 0
while True:
nextpos = s.find(sep, start)
if nextpos == -1:
yield s[start:]
return
yield s[start:nextpos]
start=nextpos+1
And you can use an mmap as a lazily-read string (well, bytes, but our data are pure ASCII, so that's fine):
with open(path, 'rb') as f:
with mmap.mmap(inf.fileno(), 0, access=mmap.ACCESS_READ) as mm:
numbers = map(int, isplit(mm, b','))
And let's use a different solution for lazy writing, just for variety:
def icsvwrite(f, seq, sep=','):
first = next(seq, None)
if not first: return
f.write(first)
for value in seq:
f.write(sep)
f.write(value)
So, putting it all together:
with open(inpath, 'rb') as inf, open(outpath, 'w') as outf:
with mmap.mmap(inf.fileno(), 0, access=mmap.ACCESS_READ) as mm:
numbers = map(int, isplit(mm, b','))
icsvwrite(outf, map(str, interpolate(numbers)))
But, even though I was able to slap this together pretty quickly, and all of the pieces are nicely reusable, I'd still probably use NumPy for your specific problem. You're not going to read a row of a billion numbers. You already have NumPy installed on the only machine that's ever going to run this script. The cost of importing it every 8 seconds (which you can solve by just having the script sleep between runs). So, it's hard to beat an elegant 3-line solution.

Since you suggested a pandas solution, here is one possibility:
import pandas as pd
import numpy as np
l = [1,4,11,9,14,21]
n = len(l)
df = pd.DataFrame(l, columns = ["l"]).reindex(np.linspace(0, n-1, 2*n-1)).interpolate().astype(int)
print(df)
It feels unnecessary complicated, though. I tag in pandas, so people more familiar with pandas functionality see it.

Related

Going out of memory for python dictionary when the numbers are integer

I have a python code that suppose to read large files into a dictionary in memory and do some operations. What puzzles me is that in only one case it goes out of memory: when the values in the file are integer...
The structure of my file is like this:
string value_1 .... value_n
The files I have varies in size from 2G to 40G. I have 50G memory that I try to read the file in. When I have something like this:
string 0.001334 0.001473 -0.001277 -0.001093 0.000456 0.001007 0.000314 ... with the n=100 and number of rows equal to 10M, I'll be able to read it into memory relatively fast. The file size is about 10G. However, when I have string 4 -2 3 1 1 1 ... with the same dimension (n=100) and the same number of rows, I'm not able to read it to the memory.
for line in f:
tokens = line.strip().split()
if len(tokens) <= 5: #ignore w2v first line
continue
word = tokens[0]
number_of_columns = len(tokens)-1
features = {}
for dim, val in enumerate(tokens[1:]):
val = float(val)
features[dim] = val
matrix[word] = features
This will result Killed in the second case while will work in the first case.
I know this does not answer the question specifically, but probably offers a better solution to the problem looking to be resolved:
May i suggest you use Pandas for this kind of work?
It seems a lot more appropriate for what you're trying to do. http://pandas.pydata.org/index.html
import pandas as pd
pd.read_csv('file.txt', sep=' ', skiprows=1)
then do all your manipulations
Pandas is a package designed specifically to handle large datasets and process them. it has tons of useful features you probably will end up needing if you're dealing with big data.

Most Efficient Way of Chunking a Large Iterable in Python for Brute Forcing

I am trying to develop a way to address large parallel tasks for bruteforcing a keyspace. I'd like to be able to come up with a way to pass a worker a value in such a way that given a chunk size, that value would tell the work what to output.
Simply Speaking:
given a charset (a-z) and a max legth of 1 (basically a-z) and a chunk size of 5
If I send worker 1 the number 0 then it will take 0-4 of the iterator, a, b, d, e, f) if I send worker 2 the number 1 it will take 5-9 etc. I have that code basically working:
#!/usr/bin/python
import itertools
maxlen = 5
charset = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
chunksize = 1000
chunkpart = 5
for s in itertools.islice(itertools.chain.from_iterable((''.join(l) for l in itertools.product(charset, repeat=i)) for i in range(1, maxlen + 1)), chunksize*chunkpart, chunksize*(chunkpart + 1)):
print s
Ok that works great, if I send chunkpart 5 to worker 1 it will do what it needs to do on that chunkpart.
The issue comes into play when I need to get a small chunk (1000 records) but way far into a large set. Let's say the maxlen was 10 and the chunkpart was 50,000,000. It takes a LONG time for Python to get to that point.
So, I think I know WHY this happens, it needs to do quite a bit of math to figure out where to get to in the iterator; what I am wondering, is there a better way for me to do this that shortcuts something? My gut tells me itertools has the answer, my brain says you need to understand itertools better.

Efficiency with very large numpy arrays

I'm working with some very large arrays. An issue that I'm dealing with of course is running out of RAM to work with, but even before that my code is running slowly so that, even if I had infinite RAM, it would still take way too long. I'll give a bit of my code to show what I'm trying to do:
#samplez is a 3 million element 1-D array
#zfit is a 10,000 x 500 2-D array
b = np.arange((len(zfit))
for x in samplez:
a = x-zfit
mask = np.ma.masked_array(a)
mask[a <= 0] = np.ma.masked
index = mask.argmin(axis=1)
# These past 4 lines give me an index array of the smallest positive number
# in x - zift
d = zfit[b,index]
e = zfit[b,index+1]
f = (x-d)/(e-d)
# f is the calculation I am after
if x == samplez[0]:
g = f
index_stack = index
else:
g = np.vstack((g,f))
index_stack = np.vstack((index_stack,index))
I need to use g and index_stack, each of which are 3million x 10,000 2-D arrays, in a further calculation. Each iteration of this loop takes almost 1 second, so 3 million seconds total, which is way too long.
Is there anything I can do so that this calculation will run much faster? I've tried to think how I can do without this for loop, but the only way I can imagine is making 3 million copies of zfit, which is unfeasible.
And is there someway I can work with these arrays by not keeping everything in RAM? I'm a beginner and everything I've searched about this is either irrelevant or something I can't understand. Thanks in advance.
It is good to know that the smallest positive number will never show up in the end of rows.
In samplez there are 1 million unique values but in zfit, each row can only have 500 unique values at most. The entire zfit can have as much as 50 million unique values. The algorithm can be greatly sped up, if the number of 'finding the smallest positive number > each_element_in_samplez' calculation can be greatly reduced. Doing all 5e13 comparisons are probably an overkill and careful planing will be able to get rid of a large proportion of it. That will heavy depend on your actual underlying mathematics.
Without knowing it, there are still some small things can be done. 1, there are not so many of possible (e-d) so that can be taken out of the loop. 2, The loop can be eliminated by map. These two small fix, on my machine, result in about 22% speed-up.
def function_map(samplez, zfit):
diff=zfit[:,:-1]-zfit[:,1:]
def _fuc1(x):
a = x-zfit
mask = np.ma.masked_array(a)
mask[a <= 0] = np.ma.masked
index = mask.argmin(axis=1)
d = zfit[:,index]
f = (x-d)/diff[:,index] #constrain: smallest value never at the very end.
return (index, f)
result=map(_fuc1, samplez)
return (np.array([item[1] for item in result]),
np.array([item[0] for item in result]))
Next: masked_array can be avoided completely (which should bring significant improvement). samplez needs to be sorted as well.
>>> x1=arange(50)
>>> x2=random.random(size=(20, 10))*120
>>> x2=sort(x2, axis=1) #just to make sure the last elements of each col > largest val in x1
>>> x3=x2*1
>>> f1=lambda: function_map2(x1,x3)
>>> f0=lambda: function_map(x1, x2)
>>> def function_map2(samplez, zfit):
_diff=diff(zfit, axis=1)
_zfit=zfit*1
def _fuc1(x):
_zfit[_zfit<x]=(+inf)
index = nanargmin(zfit, axis=1)
d = zfit[:,index]
f = (x-d)/_diff[:,index] #constrain: smallest value never at the very end.
return (index, f)
result=map(_fuc1, samplez)
return (np.array([item[1] for item in result]),
np.array([item[0] for item in result]))
>>> import timeit
>>> t1=timeit.Timer('f1()', 'from __main__ import f1')
>>> t0=timeit.Timer('f0()', 'from __main__ import f0')
>>> t0.timeit(5)
0.09083795547485352
>>> t1.timeit(5)
0.05301499366760254
>>> t0.timeit(50)
0.8838210105895996
>>> t1.timeit(50)
0.5063929557800293
>>> t0.timeit(500)
8.900799036026001
>>> t1.timeit(500)
4.614129018783569
So, that is another 50% speed-up.
masked_array is avoided and that saves some RAM. Can't think of anything else to reduce RAM usage. It may be necessary to process samplez in parts. And also, dependents on the data and the required precision, if you can use float16 or float32 instead of the default float64 that can save you a lot of RAM.

Project Euler #13 in Python, trying to find smart solution

I'm trying to solve problem 13 from Euler project, and I'm trying to make the solution beautiful (at least, not ugly). Only "ugly thing" I do is that I'm pre-formating the input and keep it in the solution file (due to some technical reasons, and 'cause I want to concentrate on numeric part of problem)
The problem is "Work out the first ten digits of the sum of the following one-hundred 50-digit numbers."
I wrote some code, that should work, as far as I know, but it gives wrong result. I've checked input several times, it seems to be OK...
nums=[37107287533902102798797998220837590246510135740250,
46376937677490009712648124896970078050417018260538,
74324986199524741059474233309513058123726617309629,
91942213363574161572522430563301811072406154908250,
23067588207539346171171980310421047513778063246676,
89261670696623633820136378418383684178734361726757,
28112879812849979408065481931592621691275889832738,
44274228917432520321923589422876796487670272189318,
47451445736001306439091167216856844588711603153276,
70386486105843025439939619828917593665686757934951,
62176457141856560629502157223196586755079324193331,
64906352462741904929101432445813822663347944758178,
92575867718337217661963751590579239728245598838407,
58203565325359399008402633568948830189458628227828,
80181199384826282014278194139940567587151170094390,
35398664372827112653829987240784473053190104293586,
86515506006295864861532075273371959191420517255829,
71693888707715466499115593487603532921714970056938,
54370070576826684624621495650076471787294438377604,
53282654108756828443191190634694037855217779295145,
36123272525000296071075082563815656710885258350721,
45876576172410976447339110607218265236877223636045,
17423706905851860660448207621209813287860733969412,
81142660418086830619328460811191061556940512689692,
51934325451728388641918047049293215058642563049483,
62467221648435076201727918039944693004732956340691,
15732444386908125794514089057706229429197107928209,
55037687525678773091862540744969844508330393682126,
18336384825330154686196124348767681297534375946515,
80386287592878490201521685554828717201219257766954,
78182833757993103614740356856449095527097864797581,
16726320100436897842553539920931837441497806860984,
48403098129077791799088218795327364475675590848030,
87086987551392711854517078544161852424320693150332,
59959406895756536782107074926966537676326235447210,
69793950679652694742597709739166693763042633987085,
41052684708299085211399427365734116182760315001271,
65378607361501080857009149939512557028198746004375,
35829035317434717326932123578154982629742552737307,
94953759765105305946966067683156574377167401875275,
88902802571733229619176668713819931811048770190271,
25267680276078003013678680992525463401061632866526,
36270218540497705585629946580636237993140746255962,
24074486908231174977792365466257246923322810917141,
91430288197103288597806669760892938638285025333403,
34413065578016127815921815005561868836468420090470,
23053081172816430487623791969842487255036638784583,
11487696932154902810424020138335124462181441773470,
63783299490636259666498587618221225225512486764533,
67720186971698544312419572409913959008952310058822,
95548255300263520781532296796249481641953868218774,
76085327132285723110424803456124867697064507995236,
37774242535411291684276865538926205024910326572967,
23701913275725675285653248258265463092207058596522,
29798860272258331913126375147341994889534765745501,
18495701454879288984856827726077713721403798879715,
38298203783031473527721580348144513491373226651381,
34829543829199918180278916522431027392251122869539,
40957953066405232632538044100059654939159879593635,
29746152185502371307642255121183693803580388584903,
41698116222072977186158236678424689157993532961922,
62467957194401269043877107275048102390895523597457,
23189706772547915061505504953922979530901129967519,
86188088225875314529584099251203829009407770775672,
11306739708304724483816533873502340845647058077308,
82959174767140363198008187129011875491310547126581,
97623331044818386269515456334926366572897563400500,
42846280183517070527831839425882145521227251250327,
55121603546981200581762165212827652751691296897789,
32238195734329339946437501907836945765883352399886,
75506164965184775180738168837861091527357929701337,
62177842752192623401942399639168044983993173312731,
32924185707147349566916674687634660915035914677504,
99518671430235219628894890102423325116913619626622,
73267460800591547471830798392868535206946944540724,
76841822524674417161514036427982273348055556214818,
97142617910342598647204516893989422179826088076852,
87783646182799346313767754307809363333018982642090,
10848802521674670883215120185883543223812876952786,
71329612474782464538636993009049310363619763878039,
62184073572399794223406235393808339651327408011116,
66627891981488087797941876876144230030984490851411,
60661826293682836764744779239180335110989069790714,
85786944089552990653640447425576083659976645795096,
66024396409905389607120198219976047599490197230297,
64913982680032973156037120041377903785566085089252,
16730939319872750275468906903707539413042652315011,
94809377245048795150954100921645863754710598436791,
78639167021187492431995700641917969777599028300699,
15368713711936614952811305876380278410754449733078,
40789923115535562561142322423255033685442488917353,
44889911501440648020369068063960672322193204149535,
41503128880339536053299340368006977710650566631954,
81234880673210146739058568557934581403627822703280,
82616570773948327592232845941706525094512325230608,
22918802058777319719839450180888072429661980811197,
77158542502016545090413245809786882778948721859617,
72107838435069186155435662884062257473692284509516,
20849603980134001723930671666823555245252804609722,
53503534226472524250874054075591789781264330331690]
result_sum = []
tmp_sum = 0
for j in xrange(50):
for i in xrange(100):
tmp_sum += nums[i] % 10
nums[i] =nums[i] / 10
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum = tmp_sum / 10
for i in xrange(10):
print result_sum[i]
Your code works by adding all the numbers in nums like a person would: adding column by column. Your code does not work because when you are summing the far left column, you treat it like every other column. Whenever people get to the far left, they write down the entire sum. So this line
result_sum.insert(0,int(tmp_sum % 10))
doesn't work for the far left column; you need to insert something else into result_sum in that case. I would post the code, but 1) I'm sure you don't need it, and 2) it's agains the Project-Euler tag rules. If you would like, I can email it to you, but I'm sure that won't be necessary.
You could save the numbers in a file (with a number on each line), and read from it:
nums = []
with open('numbers.txt', 'r') as f:
for num in f:
nums.append(int(num))
# nums is now populated with all of the numbers, so do your actual algorithm
Also, it looks like you want to store the sum as an array of digits. The cool thing about Python is that it automatically handles large integers. Here is a quote from the docs:
Plain integers (also just called integers) are implemented using long in C, which gives them at least 32 bits of precision (sys.maxint is always set to the maximum plain integer value for the current platform, the minimum value is -sys.maxint - 1). Long integers have unlimited precision.
So using an array of digits isn't really necessary if you are working with Python. In C, it is another story...
Also, regarding your code, you need to factor in the digits in tmp_sum, which contains your carry-over digits. You can add them into result_sum like this:
while tmp_sum:
result_sum.insert(0,int(tmp_sum % 10))
tmp_sum /= 10
This will fix your issue. Here, it works.
Since you already have all the numbers in a list, you should be able to take the sum of them pretty easily. Then you just need to take the first ten digits of the sum. I won't put any code here, though.
As Simple as this :
Values.txt will contain all digits.
nums = []
with open("values.txt",'r') as f:
for num in f:
nums.append(int(num))
print(str(sum(nums))[:10])
Just as easy is storing it in csv and using pandas:
def foo():
import pandas as pd
table = pd.read_csv("data.txt", header = None, usecols = [0])
and then iterate through panda dataframe:
sum = 0
for x in range(len(table)):
sum += int(table[0][x])
return str(sum)[:10]
just keep in mind that Python handles the large digits for you.

Fast way to read interleaved data?

I've got a file containing several channels of data. The file is sampled at a base rate, and each channel is sampled at that base rate divided by some number -- it seems to always be a power of 2, though I don't think that's important.
So, if I have channels a, b, and c, sampled at divders of 1, 2, and 4, my stream will look like:
a0 b0 c0 a1 a2 b1 a3 a4 b2 c1 a5 ...
For added fun, the channels can independently be floats or ints (though I know for each one), and the data stream does not necessarily end on a power of 2: the example stream would be valid without further extension. The values are sometimes big and sometimes little-endian, though I know what I'm dealing with up-front.
I've got code that properly unpacks these and fills numpy arrays with the correct values, but it's slow: it looks something like (hope I'm not glossing over too much; just giving an idea of the algorithm):
for sample_num in range(total_samples):
channels_to_sample = [ch for ch in all_channels if ch.samples_for(sample_num)]
format_str = ... # build format string from channels_to_sample
data = struct.unpack( my_file.read( ... ) ) # read and unpack the data
# iterate over data tuple and put values in channels_to_sample
for val, ch in zip(data, channels_to_sample):
ch.data[sample_num / ch.divider] = val
And it's slow -- a few seconds to read a 20MB file on my laptop. Profiler tells me I'm spending a bunch of time in Channel#samples_for() -- which makes sense; there's a bit of conditional logic there.
My brain feels like there's a way to do this in one fell swoop instead of nesting loops -- maybe using indexing tricks to read the bytes I want into each array? The idea of building one massive, insane format string also seems like a questionable road to go down.
Update
Thanks to those who responded. For what it's worth, the numpy indexing trick reduced the time required to read my test data from about 10 second to about 0.2 seconds, for a speedup of 50x.
The best way to really improve the performance is to get rid of the Python loop over all samples and let NumPy do this loop in compiled C code. This is a bit tricky to achieve, but it is possible.
First, you need a bit of preparation. As pointed out by Justin Peel, the pattern in which the samples are arranged repeats after some number of steps. If d_1, ..., d_k are the divisors for your k data streams and b_1, ..., b_k are the sample sizes of the streams in bytes, and lcm is the least common multiple of these divisors, then
N = lcm*sum(b_1/d_1+...+b_k/d_k)
will be the number of bytes which the pattern of streams will repeat after. If you have figured out which stream each of the first N bytes belongs to, you can simply repeat this pattern.
You can now build the array of stream indices for the first N bytes by something similar to
stream_index = []
for sample_num in range(lcm):
stream_index += [i for i, ch in enumerate(all_channels)
if ch.samples_for(sample_num)]
repeat_count = [b[i] for i in stream_index]
stream_index = numpy.array(stream_index).repeat(repeat_count)
Here, d is the sequence d_1, ..., d_k and b is the sequence b_1, ..., b_k.
Now you can do
data = numpy.fromfile(my_file, dtype=numpy.uint8).reshape(-1, N)
streams = [data[:,stream_index == i].ravel() for i in range(k)]
You possibly need to pad the data a bit at the end to make the reshape() work.
Now you have all the bytes belonging to each stream in separate NumPy arrays. You can reinterpret the data by simply assigning to the dtype attribute of each stream. If you want the first stream to be intepreted as big endian integers, simply write
streams[0].dtype = ">i"
This won't change the data in the array in any way, just the way it is interpreted.
This may look a bit cryptic, but should be much better performance-wise.
Replace channel.samples_for(sample_num) with a iter_channels(channels_config) iterator that keeps some internal state and lets you read the file in one pass. Use it like this:
for (chan, sample_data) in izip(iter_channels(), data):
decoded_data = chan.decode(sample_data)
To implement the iterator, think of a base clock with a period of one. The periods of the various channels are integers. Iterate the channels in order, and emit a channel if the clock modulo its period is zero.
for i in itertools.count():
for chan in channels:
if i % chan.period == 0:
yield chan
The grouper() recipe along with itertools.izip() should be of some help here.

Categories