Introduction

Thank you for choosing VocaliD to meet your custom voice needs. To ensure that we can provide you with the highest quality voice, we request that the data you provide to us meet or exceed the following specifications:

Duration

Ideally we would like 2000+ recorded sentences of the target speaker totally to at least 2 hours of audio. This audio should be of the target speaker alone with other speakers and sounds removed.

Each recording should be between 1 and 15 seconds, but it is best if most of the recordings are between 3 and 10 seconds. If a sentence is too long it can be divided into multiple recordings at a logical break in the sentence such as a pause between phrases.

Audio Quality

The audio should be clean and be free of:

Consistent background hiss
Continual background noise such as TVs or crowd noise
Other foreground talkers (e.g. interviewers). The presence of other talkers will require VocaliD to extract the target speaker audio and incur additional costs
Random interjections of noise or background sounds (e.g. ambulance sirens)

Transcripts

An accurate transcript should be provided for each audio file in a comma (“,”) separated spreadsheet form (i.e. .csv) named data_mapping.csv, with the file names in one column, the associated transcript in a second column and the source name or link in the third column. Header strings are required to match as described below.

Every utterance should be transcribed.
If the speaker repeats a word, record each utterance.
Filler words such as “ah” and “um” should be transcribed.
Use standard English capitalization for sentences, proper names, etc.
Standard English punctuation such as commas and periods should be used.
Spell out numbers (both ordinal and cardinal numbers)
Do not abbreviate the spoken words.

(See below for an example of the csv file format and content.)

Header names and column values

Column name	Description
filepath	The filepath of audio file relative to the csv file.
transcript	The text transcript for the audio file.
source_identifier	A string identifier to group audio files from the same source, recorded under the same conditions. For example a source name, link or arbitrary string. If all files are from the same source then this value can be left blank for each row. In any case the header is stil required.


filepath,transcript,source_identifier
recordings/source_a/Your_audio_sample_001.wav,This is a sample one of a transcript.,A
recordings/source_a/Your_audio_sample_002.wav,This this is the second example of a transcript and includes a repeated word.,A
recordings/source_b/Your_audio_sample_003.wav,"Wow, um, so many transcripts.",B
recordings/source_c/Your_audio_sample_004.wav,"""Thank you"", said Doctor Doolittle.",C

Example data

filepath	transcript	source_identifier
recordings/source_a/audio_sample_001.wav	This is a sample one of a transcript.	A
recordings/source_a/audio_sample_002.wav	This this is the second example of a transcript and includes a repeated word.	A
recordings/source_b/audio_sample_003.wav	Wow, um, so many transcripts.	B
recordings/source_c/audio_sample_004.wav	"Thank you", said Doctor Doolittle.	C

Uploading

The csv file and the audio files should be uploaded to VocaliD together as a zipped file. The file paths should be relative to the spreadsheet’s location. The filesize limit for uploaded zip files is 1 Gb.

audio_uploads.zip

├── data_mapping.csv

├── recordings

│ ├── source_a

│ │ ├── audio_sample_001.wav

│ │ ├── audio_sample_002.wav

│ ├── source_b

│ │ ├── audio_sample_003.wav

│ ├── source_c

│ │ ├── audio_sample_004.wav