Dataset Card for "stack-smol-xxl"

This is a subset of the deduplicated Stack dataset

It was generated like so:

from datasets import load_dataset, Dataset
languages = ["css", "prolog", "c", "fortran", "solidity", "kotlin", "literate-agda", "julia", "java-server-pages",
             "isabelle", "idris", "lean", "powershell", "go", "erlang", "f-sharp", "ada", "pascal", "perl", "r", "protocol-buffer",
             "cmake", "sas", "ruby", "rust", "rmarkdown", "c-sharp", "smalltalk", "haskell", "maple", "mathematica", "ocaml",
             "makefile", "lua", "literate-coffeescript", "literate-haskell", "restructuredtext", "racket", "standard-ml",
             "systemverilog", "tex", "awk", "assembly", "alloy", "agda", "emacs-lisp", "dart", "cuda", "bluespec", "augeas", "batchfile",
             "tcsh", "stan", "scala", "tcl", "stata", "applescript", "shell", "clojure", "scheme", "antlr", "sparql", "sql",
             "glsl", "elm", "dockerfile", "cpp", "coffeescript", "common-lisp", "elixir", "groovy", "html", "java", "javascript",
             "markdown", "php", "python", "typescript", "verilog", "visual-basic", "vhdl", "thrift", "matlab", "yacc", "zig", "xslt", "json", "yaml"]

def dset_gen():
    for language in languages:
        dset = load_dataset("bigcode/the-stack-dedup", data_dir=f"data/{language}", streaming=True, split="train")
        sample = dset.take(250_000)
        for row in sample:
            yield row

dset = Dataset.from_generator(dset_gen)

Dataset Structure

num_examples: 11658586
download_size: 28807934580
dataset_size: 78577965159

Data Instances

Each data instance corresponds to one file. The content of the file is in the content feature, and other features ( repository_name , licenses , etc.) provide some metadata. Note that a given file can appear in several different repositories that satisfy our safe-license criterion. If that is the case, only the first – in alphabetical order -- of these repositories is shown for simplicity.

Data Fields

content (string): the content of the file.
size (integer): size of the uncompressed file.
lang (string): the programming language.
ext (string): file extension
avg_line_length (float): the average line-length of the file.
max_line_length (integer): the maximum line-length of the file.
alphanum_fraction (float): the fraction of characters in the file that are alphabetical or numerical characters.
hexsha (string): unique git hash of file
max_{stars|forks|issues}_repo_path (string): path to file in repo containing this file with maximum number of {stars|forks|issues}
max_{stars|forks|issues}_repo_name (string): name of repo containing this file with maximum number of {stars|forks|issues}
max_{stars|forks|issues}_repo_head_hexsha (string): hexsha of repository head
max_{stars|forks|issues}_repo_licenses (string): licenses in repository
max_{stars|forks|issues}_count (integer): number of {stars|forks|issues} in repository
max_{stars|forks|issues}_repo_{stars|forks|issues}_min_datetime (string): first timestamp of a {stars|forks|issues} event
max_{stars|forks|issues}_repo_{stars|forks|issues}_max_datetime (string): last timestamp of a {stars|forks|issues} event

作者:

cakiki

数据集大小:

8.25 GB