数据集:
jamescalam/youtube-transcriptions
语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
YouTube转录数据集包含使用 OpenAI's Whisper (大量)转录的技术教程(当前来自 James Briggs , Daniel Bourke 和 AI Coffee Break )。每一行代表着与视频URL和时间戳相对应的大致一句长度的文本块。
请注意,数据集中的每个项目仅包含短文本块。对于大多数用例,您可能需要合并多行文本以创建更多实质性的文本块,如果需要这样做,这个代码片段将会有所帮助:
from datasets import load_dataset
# first download the dataset
data = load_dataset(
'jamescalam/youtube-transcriptions',
split='train'
)
new_data = [] # this will store adjusted data
window = 6 # number of sentences to combine
stride = 3 # number of sentences to 'stride' over, used to create overlap
for i in range(0, len(data), stride):
i_end = min(len(data)-1, i+window)
if data[i]['title'] != data[i_end]['title']:
# in this case we skip this entry as we have start/end of two videos
continue
# create larger text chunk
text = ' '.join(data[i:i_end]['text'])
# add to adjusted data list
new_data.append({
'start': data[i]['start'],
'end': data[i_end]['end'],
'title': data[i]['title'],
'text': text,
'id': data[i]['id'],
'url': data[i]['url'],
'published': data[i]['published']
})