I need to do voice activity detection as a step to classify audio files.
Basically, I need to know with certainty if a given audio has spoken language.
I am using py-webrtcvad, which I found in git-hub and is scarcely documented:
https://github.com/wiseman/py-webrtcvad
Thing is, when I try it on my own audio files, it works fine with the ones that have speech but keeps yielding false positives when I feed it with other types of audio (like music or bird sound), even if I set aggressiveness at 3.
Audios are 8000 sample/hz
The only thing I changed to the source code was the way I pass the arguments to main function (excluding sys.args).
def main(file, agresividad):
audio, sample_rate = read_wave(file)
vad = webrtcvad.Vad(int(agresividad))
frames = frame_generator(30, audio, sample_rate)
frames = list(frames)
segments = vad_collector(sample_rate, 30, 300, vad, frames)
for i, segment in enumerate(segments):
path = 'chunk-%002d.wav' % (i,)
print(' Writing %s' % (path,))
write_wave(path, segment, sample_rate)
if __name__ == '__main__':
file = 'myfilename.wav'
agresividad = 3 #aggressiveness
main(file, agresividad)
I'm seeing the same thing. I'm afraid that's just the extent to which it works. Speech detection is a difficult task and webrtcvad wants to be light on resources so there's only so much you can do. If you need more accuracy then you would need different packages/methods that will necessarily take more computing power.
On aggressiveness, you're right that even on 3 there are still a lot of false positives. I'm also seeing false negatives however so one trick I'm using is running three instances of the detector, one for each aggressiveness setting. Then instead of classifying a frame 0 or 1 I give it the value of the highest aggressiveness that still said it was speech. In other words each sample now has a score of 0 to 3 with 0 meaning even the least strict detector said it wasn't speech and 3 meaning even the strictest setting said it was. I get a little bit more resolution like that and even with the false positives it is good enough for me.
Related
I am trying to find the profanity score of a given text which is received on the chats.
For this is I went through a couple of python(base) libraries and found some of the relevant ones as:
profanity-check
alt-profanity-check -- (currently using)
profanity-filter
detoxify
Now, The one which I am using (profanity-check) is giving me proper results when using
predict and predict_prob against the calibrated_classifier used underhood after training.
The problem is that I am unable to identify the words which were used to give the prediction or calculate the probability. In short the list of feature names (profane words) used in the test data when passed as an input.
I know there are no methods to return the same, but I would like to fork and use the library.
I wanted to understand if we can add something to this place (edit) to create a method for the same.
e.g
text = ["this is crap"]
predict([text]) - array([1])
predict_prob([text]) - array([0.99868968])
> predict_words([text]) - array(["crap"]) ---- (NEED THIS)
I'm writing a code for Tacotron 2 where it would get transcripts from youtube & format it in a file. Unfortunately the data it recieves from YT doesn't specify where sentences end. So, I tried adding full stop in the end but most of the sentences isn't a full sentence. So, how can I make it only add full stops at the finish of a sentence. The only other data it recieves are timestamps.
# Batch file for Tacotron 2
from youtube_transcript_api import YouTubeTranscriptApi
transcript_txt = YouTubeTranscriptApi.get_transcript('DY0ekRZKtm4')
def write_transcript():
with open('transcript.txt', 'a+') as transcript_object:
transcript_object.seek(0)
subtitles = transcript_object.read(100)
if len(subtitles) > 0:
transcript_object.write('\n')
for i in transcript_txt:
ii = i['text']
if ii[-1] != '.':
iii = ii + '.'
else:
iii = ii
print(iii)
transcript_object.write(iii + '\n')
transcript_object.close()
write_transcript()
Here's an example:
What it saves:
sometimes it was possible to completely.
fall.
out of the world if the lag was bad.
enough.
What I want:
sometimes it was possible to completely
fall
out of the world if the lag was bad
enough.
There is no easy solution. The least effort way I can think of is to set up spaCy, nlp the whole transcript and hope for the best. It's not trained on data without punctuation though, so don't expect perfect results, but it will detect some sentence boundaries (based on syntax for the most part).
import spacy
nlp = spacy.load('en_core_web_trf')
text = """sometimes it was possible to completely
fall
out of the world if the lag was bad
enough
we solved that by
adding more test data"""
doc = nlp(text)
for s in doc.sents:
print(f"'{s}'")
Output:
'sometimes it was possible to completely
fall
out of the world if the lag was bad
enough
'
'we solved that by
adding more test data'
So in this case, it worked. Once you have that, you could do some additional processing, add punctuation manually, etc.
I am using Python and Music21 to code an algorithm that composes melodies from input music files of violin accompanied by piano pieces.My problem is that when I input a midi file that has two instruments, the output is only in one instrument. I can currently change the output instrument to a guitar, trumpet etc. even though those instruments are not present in my original input files. I would like to know whether I could write some code that identifies the instruments in the input files and outputs those specific instruments. Alternatively, is there any way that I could code for two output instruments rather than one? I have tired to copy the existing code with another instrument but the algorithm only outputs the last instrument detected in the code.Below is my current running code:
def convert_to_midi(prediction_output):
offset=0
output_notes=[]
#Create note and chord objects based on the values generated by the model
for pattern in prediction_output:
#Pattern is a chord
if ('.' in pattern) or pattern.isdigit():
notes_in_chord=pattern.split('.')
notes=[]
for current_note in notes_in_chord:
output_notes.append(instrument.Guitar())
cn=int(current_note)
new_note=note.Note(cn)
notes.append(new_note)
new_chord=chord.Chord(notes)
new_chord.offset=offset
output_notes.append(new_note)
#Pattern is a note
else:
output_notes.append(instrument.Guitar())
new_note=note.Note(pattern)
new_note.offset=offset
output_notes.append(new_note)
Instrument objects go directly into the Stream object, not on a Note, and each Part can have only one Instrument object active at a time.
I have a video file and I'd like to cut out some scenes (either identified by a time position or a frame). As far as I understand that should be possible with gnonlin but so far I wasn't able to find a sample how to that (ideally using Python). I don't want to modify the video/audio parts if possible (but conversion to mp4/webm would be acceptable).
Am I correct that gnonlin is the right component in the gstreamer universe to do that? Also I'd be glad for some pointers/recipes how to approach the problem (gstreamer newbie).
Actually it turns out that "gnonlin" is too low-level and still requires a lot of gstreamer knowledge. Luckily there is "gstreamer-editing-services" (gst-editing-services) which is a
library offering a higher level API on top of gstreamer and gnonlin.
With a tiny bit of RTFM reading and a helpful blog post with a Python example I was able to solve my basic problem:
Load the asset (video)
Create a Timeline with a single layer
add the asset multiple times to the layer, adjusting start, inpoint and duration so only the relevant parts of a video are present in the output video
Most of my code is directly taken from the referenced blog post above so I don't want to dump all of that here. The relevant stuff is this:
asset = GES.UriClipAsset.request_sync(source_uri)
timeline = GES.Timeline.new_audio_video()
layer = timeline.append_layer()
start_on_timeline = 0
start_position_asset = 10 * 60 * Gst.SECOND
duration = 5 * Gst.SECOND
# GES.TrackType.UNKNOWN => add every kind of stream to the timeline
clip = layer.add_asset(asset, start_on_timeline, start_position_asset,
duration, GES.TrackType.UNKNOWN)
start_on_timeline = duration
start_position_asset = start_position_asset + 60 * Gst.SECOND
duration = 20 * Gst.SECOND
clip2 = layer.add_asset(asset, start_on_timeline, start_position_asset,
duration, GES.TrackType.UNKNOWN)
timeline.commit()
The resulting video includes the segments 10:00–10:05 and 11:05-11:25 so essentially there are two cuts: One in the beginning and one in the middle.
From what I have seen this worked perfectly fine, audio and video in sync, no worries about key frames and whatnot. The only part left is to find out if I can translate the "frame number" into a timing reference for gst editing services.
How do I go about extracting move information from a pgn file on Python? I'm new to programming and any help would be appreciated.
Try pgnparser.
Example code:
import pgn
import sys
f = open(sys.argv[1])
pgn_text = f.read()
f.close()
games = pgn.loads(pgn_text)
for game in games:
print game.moves
#Dennis Golomazov
I like what Denis did above. To add on to what he did, if you want to extract move information from more than 1 game in a png file, say like in games in chess database png file, use chess.pgn.
import chess.pgn
png_folder = open('sample.pgn')
current_game = chess.pgn.read_game(png_folder)
png_text = str(current_game.mainline_moves())
read_game() method acts as an iterator so calling it again will grab the next game in the pgn.
I can't give you any Python-specific directions, but I wrote a PGN converter recently in java, so I'll try offer some advice. The main disadvantage of Miku's link is the site doesn't allow for variance in .pgn files, which every site seems to vary slightly on the exact format.
Some .pgn have the move number attached to the move itself (1.e4 instead of 1. e4) so if you tokenise the string, you could check the placement of the dot since it only occurs in move numbers.
Work out all the different move combinations you can have. If a move is 5 characters long it could be 0-0-0 (queenside castles), Nge2+ (Knight from g to e2 with check(+)/ checkmate(#)), Rexb5 (Rook on e takes b5).
The longest string a move could be is 7 characters (for when you must specify origin rank AND file AND a capture AND with check). The shortest is 2 characters (a pawn advance).
Plan early for castling and en passant moves. You may realise too late that the way you have built your program doesn't easily adapt for them.
The details given at the start(ELO ratings, location, etc.) vary from file to file.
I dont have PGN parser for python but You can get source code of PGN parser for XCode from this place it can be of assistance