I'm trying to create a program that will read and analyze .wav files, find the magnitude + frequency in regards to time, set it into an array ([time magnitude frequency] (id est: [x y z])) and run it through a neutral network. My issue is that I need a constant number of values, but the audio clips have differing lengths.
How can I 'average' (in a sense) the data in order to create an array of, for example, 20 000 values out of say 22050, as well as doing the same to an array of 17500?
Also would it be best to do this (if possible) with the raw .wav data, or with the magnitude/frequency
Edit: For clarification, I'm looking to keep the audio unchanged, so not speeding it up/slowing it down as I'm using it for a voice recognition program for my voice specifically.
I'm also looking to avoid adding null values to the end of the array, but I might have to resort to this.
I'd suggest pre-processing the audio files in this case. If you are using the python "wave" library. You can use this to read a file and then write it out with a modified rate.
Eg:
...
outfile = wave.open("processed.wav", "rb")
outfile.setframerate(original_rate/factor)
...
If factor is > 1 then you will shorten the file (fastplay) if it's < 1 you will "stretch" (slowplay). You can calculate the factor easily based on goal_length/actual_length.
Related
I have a data acquisition system which takes measurements for a few minutes and generates a csv file with 10 million rows and 10 columns. Then I import this csv file in Python (csv.reader), perform a bunch of operations on the acquired numeric data (but ‘only’ 10000 rows at a time, otherwise the computer memory would be overwhelmed). In the end, I export the results in another much smaller csv file (csv.writer).
The problem is that the runtime is very long and I want to speed it up. When I open the original csv file with Notepad I see that the numbers have up to 16 digits each, like 0.0015800159870059, 12.0257771094508 etc. I know that the accuracy of the DAQ is 0.1% at best and most of the trailing digits are noise. Is there an elegant way of forcing Python to operate globally with only 7-8 digits from start to finish, to speed up the calculations? I know about error propagation and I’m going to try different settings for the number of digits to see what the optimum is.
Please note that it is not enough for me to build a temporary csv file with ‘truncated’ data (e.g. containing 0.0015800, 12.0257771 etc) and simply import that into Python. The calculations in Python should use reduced precision as well. I looked into decimal module, with no success so far.
with open(‘datafile’,newline='') as DAQfile:
reader=csv.reader(DAQfile,delimiter=',')
for row in reader:
… calculate stuff…
with open('results.csv','w',newline='') as myfile:
mywriter = csv.writer(myfile)
…write stuff…
Adding some details, based on the comments so far:
The program calculates the peak of the moving average of the 'instantaneous power'. The data in the csv file can be described like this, where 'col' means column, V means voltage and I means current: col1=time, col2=V1, col3=I1, col4=V2, col5=I2 and so on until col11=V10, col12=I10. So each row represents a data sample taken by the DAQ.
The instantaneous power is Pi=V1*I1+V2*I2+...+V11*I11
To calculate moving average over 10000 rows at a time, I built a buffer (initialized with Buffer=[0]*10000). This buffer will store the Pi's for 10000 consecutive rows and will be updated every time when csv.reader moves to the next row. The buffer works exactly like a shift register.
This way the memory usage is insignificant (verified). In summary, the calculations are multiplications, additions, min(a,b) function (to detect the peak of the moving average) and del/append for refreshing the buffer. The moving average itself is iterative too, something like newavg=oldavg+(newlast-oldfirst)/bufsize.
My thinking is that it does not make any sense to let Python work with all those decimals when I know that most of the trailing figures are garbage.
Forgot to mention that the size of the csv file coming from the DAQ is just under 1Gb.
Yes, there is a way - use NumPy. First, there are tons of vector/vector operations, which could be performed with one command
a = b + c
will efficiently sum two vector.
Second, which is the answer to your question, you could specify 4bytes float type, greatly reducing memory reqs and increasing speed.
You should read your file directly using
from numpy import genfromtxt
data = genfromtxt('datafile.csv', dtype=numpy.float32, delimiter=',')
...
data would made up from standard 32bits floats, circa 7digits precision.
CSV file could be read by parts/bunches
numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None,
skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None,
usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_',
autostrip=False, case_sensitive=True, defaultfmt='f%i',
unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None,
encoding='bytes')
here is full list of parameters. If max_rows is set to, say, 10, only 10 rows will be read. Default is to read whole file. You could read anything in the middle of the files by skipping some initial records, via skip_header option.
Use DyZ's comment. if there is a way to speed up the calculations, (i.e. using << or >> for multiplications or divisions respectiveley if the second operand or the dividend is a power of 2, you should take it.
example:
>>> 22 * 16
352
>>> 22 << 4
352
in that scenario, I did the exact same operation with marginal decrease in time. However, if that will equate to 100 trillion calculations, the difference is much more notable.
I am dealing with a data frame with over 8000 rows. Each tuple consists of a path to an audio file for which a spectrogram needs to be created. One solution would be to use itertuples, iterating row-wise and converting each audio sample one at a time.
for row in valid_data.itertuples():
y, sr = librosa.load('code/UrbanSound8K/audio/' + row.path, duration=2.97)
ps = librosa.feature.melspectrogram(y=y, sr=sr)
D.append( (ps, row.classID) )
But I'm not a fan of iterating over a dataframe. Is there a more efficient way to do this by applying an operation over all rows at the same time?
I don't know if it is going to help you, but my idea is to use threads (available in many OO languages) I don't know if you computer is going to be able to handle all the rows, but you can work for example with 4 rows at the same time, of course you will have to modify the "for" counter.
I am new to tensorflow. I have a large amount of data in my data base and I want a way to train a tensorflow model on the data. I understand how to do this if I was writing the data to a csv file and then reading the data from csv.
But how do I do this directly from the data base. I can connect to the database from my script(python) and run an SQL query to retrieve the data but if I want to learn in batches or epochs and mix the data?
Also the data is too big to hold in memory all at once.
any tips on where to start?
Thank you
Let's reiterate the problem :
it is impossible to load all the data into memory (even if the data is trimmed of all unneeded meta data)
it is not possible (for technical or policy reasons) to first query the database then save the results to disk as a csv file then work with the csv file.
If we could implement either of the above then we wouldn't have the problem. We are stuck with querying the database somehow and we want to:
get the data in smallish chunks
Well, that is easy enough! Let's say that our database has a primary key that is numeric. Simply decide how many chunks you want the data in and use a function like modulus
# for 7 batches
key % 7 == 0 gets you the first batch
key % 7 == 1 gets you the second batch
... etc
Okay, so you want to add another requirement
get the data in random smallish chunks
Well, that's not too difficult. Let's just change pick 2 random numbers X (preferably a prime number) and Y (less than the number of batches) and do the same thing but like so
# for 7 batches
( key * X + Y ) % 7 == 0 gets you the first batch
( key * X + Y ) % 7 == 1 gets you the second batch
... etc
You don't have a list of primes handy? No problem, just get a whole bunch and pick one at random.
For the next epoch use a different X and Y and you'll get different batches.
i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!
No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.
I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.
As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)
I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value,
such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i') and extending them on append.
I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).
data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';
I wanted to know if there was any simpler way of doing this.
Update:
keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.
The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].
For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.
Solution which I had used:
class AdjDict(dict):
"""
A special Dictionary Class to hold adjecancy list
"""
def __missing__(self, key):
"""
Missing is changed such that when a key is not found an integer array is initialized
"""
self.__setitem__(key,array.array('i'))
return self[key]
Adj= AdjDict()
for line in file("net.txt"):
entry = line.strip().split('\t')
node = int(entry[1])
follower = int(entry[0])
if node < 10 ** 6:
Adj[node].append(follower)
# Code for writting Adj matrix to the file:
Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.
With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).
If I had to solve this on my current hardware, I'd probably write a few small programs:
The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)
Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)
Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)
Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:
5 1
5 2
12 1
12 2
and you'll need to return:
5 1 2
12 1 2
(This won't take much memory.)
By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.
Maybe you can do a multi-pass through the file.
Do a range of keys each pass through the file, for example if you picked a range size of 100
1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.
for your sample, the 1st pass would output
5 1 2
12 1 2
and the 4th pass would output
341 2
Choose the range size so that the dict you are creating fits into your RAM
I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk
If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.
THAT SAID HOWEVER, I really do think that:
(a) for such a large data set, exact resolution is not appropriate and that
(b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.
Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.
For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.
I had a similar requirement and you just require one more pig statement to remove the redundancies in 5 (1,5) (2,5).
a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int);
b = GROUP a BY user;
x = FOREACH b GENERATE group.user, a.following;
store x INTO 'following-list';