jamescalam/youtube-transcriptions | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

jamescalam/youtube-transcriptions

任务:

对话

问答

文本检索

子任务:

open-domain-qa extractive-qa document-retrieval

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

其他:

youtube technical speech to text speech+to+text

许可:

afl-3.0

数据集介绍文件清单

英文

YouTube转录数据集包含使用 OpenAI's Whisper （大量）转录的技术教程（当前来自 James Briggs ， Daniel Bourke 和 AI Coffee Break ）。每一行代表着与视频URL和时间戳相对应的大致一句长度的文本块。

请注意，数据集中的每个项目仅包含短文本块。对于大多数用例，您可能需要合并多行文本以创建更多实质性的文本块，如果需要这样做，这个代码片段将会有所帮助：

from datasets import load_dataset

# first download the dataset
data = load_dataset(
    'jamescalam/youtube-transcriptions',
    split='train'
)

new_data = []  # this will store adjusted data

window = 6  # number of sentences to combine
stride = 3  # number of sentences to 'stride' over, used to create overlap

for i in range(0, len(data), stride):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    # create larger text chunk
    text = ' '.join(data[i:i_end]['text'])
    # add to adjusted data list
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published']
    })

作者:

jamescalam

数据集大小:

76.09 MB