MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following

♠️Renze Lou, Kai Zhang, Jian Xie, Yuxuan Sun, ♠️Janice Ahn, Hanzi Xu, Yu Su, ♠️Wenpeng Yin
♠️The Pennsylvania State University, The Ohio State University, Fudan University, Westlake University, Temple University

{renze.lou, wenpeng}@psu.edu

Abstract

Figure 1: Three different paradigms for designing instruction-following datasets.

In the realm of large language models (LLMs), enhancing instruction-following capability often involves curating expansive training data. This is achieved through two primary schemes:

(a). Scaling-Inputs: Amplifying (input, output) pairs per task instruction, aiming for better instruction adherence.

(b). Scaling Input-Free Tasks: Enlarging tasks, each composed of an (instruction, output) pair, without requiring a separate input anymore.

However, LLMs under Scaling-Inputs tend to be overly sensitive to inputs, leading to misinterpretation or non-compliance with instructions. Conversely, Scaling Input-Free Tasks demands a substantial number of tasks but is less effective in dealing with instances that require additional input text (e.g., reading comprehension tasks).

This work introduces MUFFIN, a new scheme of instruction-following dataset curation:

(c). Scale Tasks per Input (Ours): Scale task for each input by using various input facets.

Experimental results across four zero-shot benchmarks, spanning both (a) and (b) schemes, reveal that LLMs, at various scales, trained on MUFFIN generally demonstrate superior instruction-following capabilities compared to those trained on the two aforementioned schemes.

Data Curation

Figure 2: The overall data curation pipeline of MUFFIN.

As shown in Figure 2, the data curation strategy of MUFFIN mainly consists of following steps:
  1. Input Collection:
        - We randomly sample input texts from two distinct sources to ensure textual diversity
  2. Instruction Collection (we use two different instruction collection methods):
        - Instruction Brainstorm Based on Inputs' Facets. We first prompt LLMs (i.e., ChatGPT) to generate various textual facets (or attributes) for the given input text. Next, regarding the facets as "hint", we further prompt LLMs to brainstorm corresponding tasks instructions, ensuring the input-instruction relevance.
  3.     - Instruction Rematching. We gather suitable instructions from existing human-annotated sets. For each input-instruction pair, we ask LLMs (i.e., GPT-4) to determine if the instruction can align with the input to form a valid task.
  4. Output Annotation:
        - We use LLMs (i.e., ChatGPT) to annotate outputs, or mark them as “None” for instances that ChatGPT deems unanswerable
  5. Instruction Filtering:
        - We filter those overlapped task instructions (the ROUGE-L score with any existing instructions is larger than 0.7) and non-answerable instructions (the outputs are marked as “None” by LLMs).
  6. Classification Expansion:
        - For each brainstormed instrucion, we ask LLMs (i.e., ChatGPT) to generate some "wrong outputs" that should be suboptimal compared to the correct output. We then combine the correct output with these wrong outputs to form the whole output space of this task (i.e., transforming it into a classification paradigm).

Data Analysis

data_statistics

Table 1: Statistics of MUFFIN.

Table 1 lists the detailed statistics of the 68K (instruction, input, output) instances of MUFFIN. It is worth mentioning that some inputs will share the same instructions because of our "instruction rematching" mechanism.

data_quality

Figure 3: Human evaluation on the data quality.

To assess data quality, we randomly selected 200 instances from our dataset (200 random inputs, 1 random instruction per input). Two NLP graduate students were involved in the evaluation. Following previous works, each annotator answered three questions for each instance: i) determining if the instruction describes a valid task (A1) when only the instruction is provided, ii) assessing if the instruction appropriately matches the input when only the instruction-input pair is presented (A2), and iii) evaluating if the output correctly responds to the instruction and input (A3). We also recorded instances where all three fields were correct (A4). Figure 3 shows the correct ratio for each question.

Automatic Evaluation Results

MainTable

Table 2: Main results.

We evaluate models (T5-3B & T5-11B) trained with MUFFIN on four widely adopted zero-shot benchmarks, including SuperNI-Test, MMLU, T0-Eval and BBH. Compared with the previous LLM-generated datasets, the models tuned on our MUFFIN consistently achieve better performance across 3 out of 4 benchmarks, under various metrics. We also find that the larger model benefits more from tuning on MUFFIN (4.42 ave. performance improvement of T5-3B ⇒ 8.03 ave. performance improvement of T5-11B, compared with the strongest baseline). Notably, MUFFIN can gain the comparable or even better performances, compared with the models trained on huam-crafted datasets (e.g., SuperNI-Train, T0, T0++).

MainTable_llama

Table 3: Results based on Llama2-13B.

Additionally, we add results based on Llama2 to further enhance the aforementioned conclusion, where we fine-tune Llama2 with LoRA (low-rank adaption) on various datasets. We can find that our performances are consistent across different models (i.e., encoder-decoder and decoder-only).

Human Evaluation Results

Human_eval_1

Table 4: Human evaluation acceptance ratio. We randomly sample 200 instances from each benchmark and let workers evaluate different systems’ outputs.

Table 4 shows the human evaluation results (acceptance ratios). Compared with the strongest results among all the 8 baselines, MUFFIN mostly results in better human acceptance ratios with large margins (except for the tiny lower result on T0-Eval).

Human_eval_2

Table 5: Pair-wise comparison between MUFFIN (Ours) and three strong baselines, namely Self-Instruction (Self-Inst.), Unnatural Instruction (Unnatural), and SuperNI, across four benchmarks.

Table 5 shows the pair-wise comparison results between MUFFIN and other three baselines. Compared with the LLM-generated datasets (Self-Instruction and Unnatural Instruction), our MUFFIN consistently gains higher human preferences across four benchmarks, aligning with the previous conclusion. Notebly, owing to the diverse tasks and our novel paradigm, MUFFIN can even surpass the high-quality human-crafted SuperNI on three out of four benchmarks.

Further Analysis

scale_trend

Figure 4: The scaling trends comparison between MUFFIN and the previous baseline datasets (average performances on all four benchmarks).

For all those baseline datasets from the "direct comparison" in Table 2, we randomly sample subsets from them and train T5-3B on the subsets to show the performance trends (10%, 30%, 50%, 80%, 100%). As shown in Figure 4, MUFFIN exceeds the baselines by a noteworthy margin (average scores on four evaluation benchmarks). Other baselines may only be comparable to our data results when they continue to be scaled to several times the size of our data. Therefore, we conjecture that our paradigm is more efficient than simply collecting more data from the other paradigms.

mix_human

Figure 5: The performance of the mixture of MUFFIN and SuperNI (human-craft data).

We mix MUFFIN with the training set of SuperNI, w.r.t. various proportions, where the proportion means how many instances are from MUFFIN. Then, we train T5-3B on these mixtures and report the average performances on the four benchmarks. As illustrated in Figure 5, interestingly, after mixing MUFFIN and the human-annotated SuperNI, the performance drops even worse than only using MUFFIN for training. We conjecture possible reasons for this phenomenon: the dataset paradigm does affect the learning efficiency --- different dataset paradigms do obviously have various impacts on the model's performance. Therefore, combining two paradigms can even hurt the model's instruction-following performance, meaning the paradigm is critical.

Reference

Please kindly cite our paper if you use our code, data, models or results:


@inproceedings{Lou2023MUFFIN,
  title={{MUFFIN}: Curating Multi-Faceted Instructions for Improving Instruction Following},
  author={Renze Lou and Kai Zhang and Jian Xie and Yuxuan Sun and Janice Ahn and Hanzi Xu and Yu su and Wenpeng Yin},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=1vrS1zwekw}
}