In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes:
(a). Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence.
(b). Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair, without requiring a separate input anymore.
However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in dealing with instances that require additional input text (e.g., reading comprehension tasks).
This work introduces MUFFIN, a new scheme of instruction-following dataset curation:
(c). Scale Tasks per Input (Ours): Scale task for each input by using various input facets.
Experimental results across four zero-shot benchmarks, spanning both (a) and (b) schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.
Table 1 lists the detailed statistics of the 68K (instruction, input, output) instances of MUFFIN. It is worth mentioning that some inputs will share the same instructions because of our "instruction rematching" mechanism.
To assess data quality, we randomly selected 200 instances from our dataset (200 random inputs, 1 random instruction per input). Two NLP graduate students were involved in the evaluation. Following previous works, each annotator answered three questions for each instance: i) determining if the instruction describes a valid task (A1) when only the instruction is provided, ii) assessing if the instruction appropriately matches the input when only the instruction-input pair is presented (A2), and iii) evaluating if the output correctly responds to the instruction and input (A3). We also recorded instances where all three fields were correct (A4). Figure 3 shows the correct ratio for each question.
We evaluate models (T5-3B & T5-11B) trained with MUFFIN on four widely adopted zero-shot benchmarks, including SuperNI-Test, MMLU, T0-Eval and BBH. Compared with the previous LLM-generated datasets, the models tuned on our MUFFIN consistently achieve better performance across 3 out of 4 benchmarks, under various metrics. We also find that the larger model benefits more from tuning on MUFFIN (4.42 ave. performance improvement of T5-3B ⇒ 8.03 ave. performance improvement of T5-11B, compared with the strongest baseline). Notably, MUFFIN can gain the comparable or even better performances, compared with the models trained on huam-crafted datasets (e.g., SuperNI-Train, T0, T0++).
Additionally, we add results based on Llama2 to further enhance the aforementioned conclusion, where we fine-tune Llama2 with LoRA (low-rank adaption) on various datasets. We can find that our performances are consistent across different models (i.e., encoder-decoder and decoder-only).
Table 4 shows the human evaluation results (acceptance ratios). Compared with the strongest results among all the 8 baselines, MUFFIN mostly results in better human acceptance ratios with large margins (except for the tiny lower result on T0-Eval).
Table 5 shows the pair-wise comparison results between MUFFIN and other three baselines. Compared with the LLM-generated datasets (Self-Instruction and Unnatural Instruction), our MUFFIN consistently gains higher human preferences across four benchmarks, aligning with the previous conclusion. Notebly, owing to the diverse tasks and our novel paradigm, MUFFIN can even surpass the high-quality human-crafted SuperNI on three out of four benchmarks.
For all those baseline datasets from the "direct comparison" in Table 2, we randomly sample subsets from them and train T5-3B on the subsets to show the performance trends (10%, 30%, 50%, 80%, 100%). As shown in Figure 4, MUFFIN exceeds the baselines by a noteworthy margin (average scores on four evaluation benchmarks). Other baselines may only be comparable to our data results when they continue to be scaled to several times the size of our data. Therefore, we conjecture that our paradigm is more efficient than simply collecting more data from the other paradigms.
We mix MUFFIN with the training set of SuperNI, w.r.t. various proportions, where the proportion means how many instances are from MUFFIN. Then, we train T5-3B on these mixtures and report the average performances on the four benchmarks. As illustrated in Figure 5, interestingly, after mixing MUFFIN and the human-annotated SuperNI, the performance drops even worse than only using MUFFIN for training. We conjecture possible reasons for this phenomenon: the dataset paradigm does affect the learning efficiency --- different dataset paradigms do obviously have various impacts on the model's performance. Therefore, combining two paradigms can even hurt the model's instruction-following performance, meaning the paradigm is critical.
@inproceedings{Lou2023MUFFIN,
title={{MUFFIN}: Curating Multi-Faceted Instructions for Improving Instruction Following},
author={Renze Lou and Kai Zhang and Jian Xie and Yuxuan Sun and Janice Ahn and Hanzi Xu and Yu su and Wenpeng Yin},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=1vrS1zwekw}
}