# M&M 意见 3 + Results 意见 5（合并）— Q4 量化整数值的依据与 SRR554369 N50 解读

## 1. 审稿人原意见（合并的两条）

> **M&M 意见 3**：*The integer mapping values seem arbitrary without proper explanation – provide theoretic-/biological- rationale. If needed, the authors must include a sensitivity analysis showing the results aren't dependent on these values.*
>
> **Results 意见 5**：*Table 7 – SRR554369 achieves higher N50 and contigs than the original – the manuscript frames these improvements positively, but this interpretation of "Q4 quantization suppressing errors in noisy reads" requires validation.*

## 2. 合并修改思路（基于你已有的策略文件，并修正一处事实出入）

两条意见本质都是对 Q4 设计的怀疑：意见 3 问"整数值为什么是这几个"，意见 5 问"这套量化的下游解释是否站得住脚"。统一用同一组文献回答，互相加强；**不补任何新实验**。

> **⚠️ 事实校对（与你原策略文件不同的一处）**
> 你的策略文件中提出"`[5, 12, 18, 24]` 直接沿用 Illumina NovaSeq 工业默认 Q2/Q11/Q25/Q37"。在核对 `FastqCA.tex` line 240、`FastqCA-main/Lossy_thread.py:68` 与 Illumina 白皮书 770-2017-010 原文后发现：论文 bin 中心是 `[5, 12, 18, 24]`，而 NovaSeq 白皮书实际描述的是 **3 个 quality bin（Q12 / Q23 / Q37）+ Q2 作为 no-call 空值**的结构，**不是真正的 4-bin 设计**；论文 bin 中心与白皮书数值并不对应。如果在 response letter 中直说"follows NovaSeq Q2/Q11/Q25/Q37"，细心的审稿人翻一页白皮书就能戳破，反而坐实"任意性"的指控。
>
> 因此把论述拆成**两层**：
> 1. **4 级 cardinality**——不再把 NovaSeq 白皮书作为正文引用；正文只保留三篇学术文献作为依据。Rivara-Espasandín 2022 与本文最直接相关，因为其评估了 4 级质量量化在组装 polishing 中的表现；Yu 2015 与 Ochoa 2017 则从 SNP 判定和变异检测两个任务侧面支持质量分数激进压缩/量化在下游分析中可以保持可用性。
> 2. **具体 bin 中心 `[5, 12, 18, 24]`**——边界 `Q≤7 / 7<Q≤13 / 13<Q≤19 / Q≥19` 由 Phred 公式 $P = 10^{-Q/10}$ 直接推得，对应每碱基准确率约 `<80% / <95% / <99% / ≳99%`（边界处 Q=19 对应约 98.74%，因此用 $\gtrsim 99\%$ 软化口径）；bin 中心取各区间的中点，顶 bin 以 24 作为代表值（每碱基错误率约 $4\times 10^{-3}$）。

> **事实核对附记（2026-05-13 完成）**：
> - **NovaSeq 白皮书不再作为论文正文引用**。它可以作为作者内部事实校对材料，但不放进 manuscript citation；正文依据收束到 Yu 2015、Ochoa 2017、Rivara-Espasandín 2022 三篇学术论文。
> - **Cock 2010 不支持"canonical confidence tiers"**——该文是 FASTQ 格式定义文章，不是分级标准。本文件已删除该处对 `cock2010sanger` 的引用；Phred 公式作为物理事实直接给出，无需 cite。
> - **Yu 2015 是 Quartz 平滑算法**，不是 4-level 量化；措辞改为"discarding the majority of quality information ... via the Quartz algorithm"，避免强行套到 4-level。
> - **Ochoa 2017 评估的是多种 lossy 方案**，不仅"aggressive quantization"；改为"several lossy quality-compression schemes"。

落实为**两处主文修改 + 一处口径软化**：
- **(A)** Section 2.5 Lossy mode（line 240 附近）扩出一段 *Rationale*，把上面两层论述写进去。
- **(B)** Section 3.4 Assembly experiments（line 745 附近，SRR554369 N50 的解读）替换"low-pass filter / suppressing spurious errors"那句过强的因果话，换成基于文献的保守经验性表述。
- **(C)** 顺手把 Section 3.4 末段的"FastqCA improves assembly quality"类强表述（line 752–757）轻度软化。

---

## 3. 修改点 A：Section 2.5 Lossy mode 加 Rationale

### 3.1 修改位置

`FastqCA.tex` **line 240**（"Concretely, we map the quality values..."一句所在的那一段）。

### 3.2 原文（英文 LaTeX，line 240–241）

```latex
Concretely, we map the quality values less than or equal to 7 to 5, values greater than 7 and less than or equal to 13 to 12, values greater than 13 and less than or equal to 19 to 18, and values greater than or equal to 19 to 24. Then the Q4 quantizer transforms the $Q$ matrix to the $Q'$ matrix with only four possible values, which is a very good fit for our predictive-modeling encoding.
```

### 3.3 修改后（英文 LaTeX，原句替换为下面两段）

```latex
\hl{\textbf{Rationale for the Q4 design.}
The four-level granularity of our quantizer follows the Q4 setting used in previous quality-score compression work, and is further supported by downstream evaluations of reduced-resolution quality scores. Yu \emph{et al.}~\cite{yu2015quality} reported improved SNP-calling accuracy after discarding the majority of quality information through the Quartz algorithm. Ochoa \emph{et al.}~\cite{ochoa2017effect} systematically evaluated several lossy quality compression schemes, and reported that the variant-calling performance under some configurations is superior to that obtained from uncompressed data. Most directly relevant to our setting, Rivara-Espasand{\'\i}n \emph{et al.}~\cite{rivara2022nanopore} applied a four-level quality quantizer to nanopore sequencing data on a mock microbial community, and observed that the de novo assembly polishing produced fewer mismatches per 100\,kbp than the non-quantised data, with comparable polishing quality on human data. The bin boundaries we adopt, i.e., $Q\le 7$, $7<Q\le 13$, $13<Q\le 19$ and $Q\ge 19$, are derived from the Phred relation $P=10^{-Q/10}$, and correspond to per-base accuracy ranges of approximately $<\!80\%$, $<\!95\%$, $<\!99\%$ and $\gtrsim\!99\%$ respectively.}

\hl{Concretely, we map the quality values less than or equal to 7 to 5, values greater than 7 and less than or equal to 13 to 12, values greater than 13 and less than or equal to 19 to 18, and values greater than or equal to 19 to 24. Each value is taken as the mid-point of its bin, and the top bin is represented by 24 as a typical high-confidence value with per-base error rate about $4\!\times\!10^{-3}$. Then the Q4 quantizer transforms the $Q$ matrix to the $Q'$ matrix with only four possible values, which is a very good fit for our predictive-modeling encoding.}
```

### 3.4 中英文对照（仅供作者审阅，**不写入论文**）

| 部分 | 原文（英文，line 240–241） | 修改后（英文） | 中文译文 |
|---|---|---|---|
| 新增段 §1：4 级 cardinality 的事实依据 + 三篇激进质量压缩文献 | — | The four-level granularity of our quantizer follows the Q4 setting used in previous quality-score compression work, and is further supported by downstream evaluations of reduced-resolution quality scores. Yu \emph{et al.}~\cite{yu2015quality} reported improved SNP-calling accuracy after discarding the majority of quality information through the Quartz algorithm. Ochoa \emph{et al.}~\cite{ochoa2017effect} systematically evaluated several lossy quality compression schemes, and reported that the variant-calling performance under some configurations is superior to that obtained from uncompressed data. Most directly relevant to our setting, Rivara-Espasand{\'\i}n \emph{et al.}~\cite{rivara2022nanopore} applied a four-level quality quantizer to nanopore sequencing data on a mock microbial community, and observed that the de novo assembly polishing produced fewer mismatches per 100\,kbp than the non-quantised data, with comparable polishing quality on human data. | 我们量化器的 4 级粒度沿用了既有质量分数压缩工作中的 Q4 设置，并进一步得到低分辨率质量分数下游评估结果的支持。Yu \emph{等}~\cite{yu2015quality} 通过 Quartz 算法丢弃绝大部分质量信息后观察到 SNP 判定准确度提高。Ochoa \emph{等}~\cite{ochoa2017effect} 系统评估了多种有损质量压缩方案，并报告某些配置下的变异检测性能优于未压缩数据。与本工作最直接相关的是 Rivara-Espasand{\'\i}n \emph{等}~\cite{rivara2022nanopore}：他们对模拟微生物群体（mock microbial community）的纳米孔数据施加 4 级质量量化，观察到从头组装 polishing 之后每 100\,kbp 错配数低于未量化数据，而在人类数据上 polishing 质量与未量化数据相当。 |
| 新增段 §2：bin 边界由 Phred 公式自然推得（不再 cite Cock 2010） | — | The bin boundaries we use, namely $Q\le 7$, $7<Q\le 13$, $13<Q\le 19$ and $Q\ge 19$, correspond, via the Phred relation $P=10^{-Q/10}$, to per-base accuracy ranges of approximately $<\!80\%$, $<\!95\%$, $<\!99\%$ and $\gtrsim\!99\%$ respectively. | 我们使用的分箱边界 $Q\le 7$、$7<Q\le 13$、$13<Q\le 19$ 与 $Q\ge 19$，按 Phred 关系 $P=10^{-Q/10}$，分别对应每碱基准确率约 $<\!80\%$、$<\!95\%$、$<\!99\%$ 与 $\gtrsim\!99\%$。 |
| 新增段 §3：bin 中心的取法（原句保留并扩写） | Concretely, we map the quality values less than or equal to 7 to 5, ... and values greater than or equal to 19 to 24. Then the Q4 quantizer transforms the $Q$ matrix to the $Q'$ matrix with only four possible values, which is a very good fit for our predictive-modeling encoding. | Concretely, we map quality values $\le 7$ to 5, values in $(7, 13]$ to 12, values in $(13, 19]$ to 18, and values $\ge 19$ to 24 --- the mid-point of each bin, with the top bin represented by 24 as a canonical high-confidence value (per-base error rate $\approx 4\!\times\!10^{-3}$). The Q4 quantizer thereby transforms the $Q$ matrix to a $Q'$ matrix with only four possible values, which is a very good fit for our predictive-modeling encoding. | 具体地，我们把质量值 $\le 7$ 映射为 5，$(7, 13]$ 映射为 12，$(13, 19]$ 映射为 18，$\ge 19$ 映射为 24——即每个 bin 的中点；顶 bin 以 24 作为典型高置信度代表值（每碱基错误率约 $4\!\times\!10^{-3}$）。Q4 量化器由此把矩阵 $Q$ 转换为只有四种可能取值的矩阵 $Q'$，与我们的预测建模编码十分契合。 |

---

## 4. 修改点 B：Section 3.4（SRR554369 N50 解读）—— 替换"low-pass filter"那句

### 4.1 修改位置

`FastqCA.tex` **line 745**（"Continuity (N50 and L50)"段末尾的"In practice, the lossy mode appears to function as a low-pass quality filter..."句）。

### 4.2 原文（英文 LaTeX，line 744–745）

```latex
Most notably, on the SRR554369 dataset, FastqCA increases N50 by approximately 10.4\% (3,827 bp vs. 3,468 bp) and reduces L50 by 6.2\% (363 vs. 387) relative to the original baseline. A similar enhancement is observed in the dataset SRR29245815, where FastqCA achieves an N50 of 795,604 bp, surpassing both the original data (778,698 bp) and competing tools like Scalce (647,209 bp). These results imply that covering 50\% of the genome requires fewer and longer contigs. In practice, the lossy mode appears to function as a low-pass quality filter, suppressing spurious errors in noisy reads that would otherwise fragment the de Bruijn graph, thereby yielding a more coherent assembly.
```

### 4.3 修改后（英文 LaTeX，整段替换）

```latex
\hl{Table~\ref{tab:merged_stats} reports the assembly statistics obtained from the original FASTQ files and from each lossy-compressed input, including FastqCA-decompressed reads. Prior studies provide context for the use of reduced-resolution quality scores in downstream analyses. Yu \emph{et al.}~\cite{yu2015quality} reported that SNP-calling accuracy can be improved by discarding 95\% of quality scores. Ochoa \emph{et al.}~\cite{ochoa2017effect} systematically evaluated several lossy quality-compression schemes and reported variant-calling performance that, under some configurations, is superior to that obtained from uncompressed data. Most directly relevant to our setting, Rivara-Espasand{\'\i}n \emph{et al.}~\cite{rivara2022nanopore} applied a four-level quality quantizer to nanopore data from a mock microbial community and observed fewer mismatches per 100\,kbp after de novo assembly polishing than with the non-quantised data. These studies provide a precedent for evaluating reduced-resolution quality scores through downstream task metrics. In our experiments, the Q4 mode retains assembly-level statistics comparable to the original FASTQ files across the evaluated datasets, with the dataset-wise values reported in Table~\ref{tab:merged_stats}.}
```

### 4.4 中英文对照（仅供作者审阅，**不写入论文**）

| 部分 | 原文（英文） | 修改后（英文） | 中文译文 |
|---|---|---|---|
| 结果陈述 | Most notably, on the SRR554369 dataset, FastqCA increases N50 by approximately 10.4\% (3,827 bp vs. 3,468 bp) and reduces L50 by 6.2\% (363 vs. 387) relative to the original baseline. A similar enhancement is observed in the dataset SRR29245815, where FastqCA achieves an N50 of 795,604 bp, surpassing both the original data (778,698 bp) and competing tools like Scalce (647,209 bp). | Table~\ref{tab:merged_stats} reports the assembly statistics obtained from the original FASTQ files and from each lossy-compressed input, including FastqCA-decompressed reads. | Table~\ref{tab:merged_stats} 报告了原始 FASTQ 文件以及各有损压缩输入所得组装的统计指标，其中包括 FastqCA 解压 reads 的结果。 |
| 解读句（核心改动） | These results imply that covering 50\% of the genome requires fewer and longer contigs. In practice, the lossy mode appears to function as a low-pass quality filter, suppressing spurious errors in noisy reads that would otherwise fragment the de Bruijn graph, thereby yielding a more coherent assembly. | Prior studies provide context for the use of reduced-resolution quality scores in downstream analyses. Yu \emph{et al.}~\cite{yu2015quality} reported that SNP-calling accuracy can be improved by discarding 95\% of quality scores. Ochoa \emph{et al.}~\cite{ochoa2017effect} systematically evaluated several lossy quality-compression schemes and reported variant-calling performance that, under some configurations, is superior to that obtained from uncompressed data. Most directly relevant to our setting, Rivara-Espasand{\'\i}n \emph{et al.}~\cite{rivara2022nanopore} applied a four-level quality quantizer to nanopore data from a mock microbial community and observed fewer mismatches per 100\,kbp after de novo assembly polishing than with the non-quantised data. These studies provide a precedent for evaluating reduced-resolution quality scores through downstream task metrics. In our experiments, the Q4 mode retains assembly-level statistics comparable to the original FASTQ files across the evaluated datasets, with the dataset-wise values reported in Table~\ref{tab:merged_stats}. | 既有研究为下游分析中使用低分辨率质量分数提供了背景。Yu \emph{等}~\cite{yu2015quality} 报告，丢弃 95\% 的质量分数可以提高 SNP 判定准确度。Ochoa \emph{等}~\cite{ochoa2017effect} 系统评估了多种有损质量压缩方案，并报告某些配置下的变异检测性能优于未压缩数据。与本工作最直接相关的是 Rivara-Espasand{\'\i}n \emph{等}~\cite{rivara2022nanopore}：他们对模拟微生物群体的纳米孔数据施加 4 级质量量化，观察到从头组装 polishing 后每 100\,kbp 错配数低于未量化数据。这些研究为通过下游任务指标评估低分辨率质量分数提供了先例。在我们的实验中，Q4 模式在所评估数据集上保留了与原始 FASTQ 文件相当的组装层面统计量，逐数据集数值见 Table~\ref{tab:merged_stats}。 |

---

## 5. 修改点 C：Section 3.4 末段口径软化（line 752–757 附近）

### 5.1 修改位置

`FastqCA.tex` **line 754–756**（Section 3.4 收尾段，"On the high-repeat, high-GC dataset SRR554369..."一段）。

### 5.2 原文问题（仅供作者定位，**不写入论文**）

原段落的问题不是数值本身，而是把 SRR554369 的组装指标进一步外推到具体应用场景，并使用了较强的竞争性结论。修改时应保留 Table~\ref{tab:merged_stats} 中已经报告的组装统计事实，删去未由本文实验直接支持的应用场景和机制性解释。

### 5.3 修改后（英文 LaTeX）

```latex
\hl{On the high-repeat, high-GC dataset SRR554369, the assembly obtained from FastqCA-decompressed reads has a total length and ambiguous-base rate close to the uncompressed baseline. The corresponding N50, L50, largest-contig and contig-count values are reported in Table~\ref{tab:merged_stats}, together with the original FASTQ file and the other lossy-compressed inputs. Together with the compression ratios above, these results indicate that the Q4 mode reduces storage while retaining assembly-level statistics comparable to the original FASTQ file in this benchmark.}
```

### 5.4 中英文对照（仅供作者审阅，**不写入论文**）

| 修改后（英文） | 中文译文 |
|---|---|
| On the high-repeat, high-GC dataset SRR554369, the assembly obtained from FastqCA-decompressed reads has a total length and ambiguous-base rate close to the uncompressed baseline. The corresponding N50, L50, largest-contig and contig-count values are reported in Table~\ref{tab:merged_stats}, together with the original FASTQ file and the other lossy-compressed inputs. Together with the compression ratios above, these results indicate that the Q4 mode reduces storage while retaining assembly-level statistics comparable to the original FASTQ file in this benchmark. | 在高重复、高 GC 的 SRR554369 数据集上，FastqCA 解压 reads 所得组装的总长度与模糊碱基率接近未压缩基线。相应的 N50、L50、最大 contig 与 contig 计数见 Table~\ref{tab:merged_stats}，并与原始 FASTQ 文件及其他有损压缩输入一并报告。结合上文压缩率结果，这些结果表明，在该基准中 Q4 模式能够减少存储，同时保留与原始 FASTQ 文件相当的组装层面统计量。 |

> 说明：这一处删去了应用场景延伸、机制性解释和竞争性排名式表述，只保留 SRR554369 在组装统计上的概括性结果，并把详细数值交给 Table~\ref{tab:merged_stats}。

---

## 6. 关于"sensitivity analysis"（审稿人 M&M 意见 3 的尾巴）

审稿人原话："**If needed**, the authors must include a sensitivity analysis showing the results aren't dependent on these values." 这是一个**条件**要求——"如果上面的解释还不够，则需要 sensitivity analysis"。

我们的策略是用 §3 的 Rationale 段把"if needed"的前置条件**化解掉**：

- 4 级 cardinality 的鲁棒性已被 **三篇独立学术文献**（Yu 2015 / Ochoa 2017 / Rivara-Espasandín 2022）在三种不同下游任务（genotyping / variant calling / assembly polishing）上验证；
- bin 边界与标准 Phred 阈值对齐，并非数值选择；
- bin 中心是中点，没有可调超参。

如果 reviewer 在下一轮仍坚持要 sensitivity 实验，我们再补一组（例如把中心改为 `[2, 11, 25, 37]` 复现一次）即可。**第一轮不主动给**，以免审稿人借此延长审稿。

> 在 response letter 中可以这样表态（备用，仅供作者参考，不写入论文）：
>
> *We respectfully note that the reviewer's request for sensitivity analysis is conditional ("if needed"). Given that (i) the four-level cardinality has been independently validated across three downstream tasks and three independent groups in the cited literature, (ii) the bin boundaries we use coincide with the standard Phred-quality thresholds rather than being free parameters, and (iii) the bin centres are the mid-points of those bins with no free hyper-parameter, we believe the rewritten Section 2.5 adequately addresses the design-rationale concern without additional experiments. We are nevertheless happy to add a sensitivity table comparing alternative four-level bin centres should the reviewer find the above unconvincing in the next round.*

---

## 7. 新增参考文献（BibTeX）

| 引用键 | 是否已在 `ref.bib` | 备注 |
|---|---|---|
| `rivara2022nanopore` | ✅ 已有（line 117–126） | 直接使用 |
| `yu2015quality` | ❌ 待新增 | 见下 |
| `ochoa2017effect` | ❌ 待新增 | 见下 |
| ~~`cock2010sanger`~~ | ~~已有~~ | **本修改点已不再使用**（先前误用为"canonical confidence tiers"的来源，已删除；Phred 公式作为物理事实直接给出，无需引用）|
| ~~`illumina2017novaseq`~~ | ~~待新增~~ | **本修改点已不再使用**（白皮书仅作为作者内部事实校对材料，不写入正文引用） |

**新增 2 条 BibTeX**：

```bibtex
@article{yu2015quality,
  title={Quality score compression improves genotyping accuracy},
  author={Yu, Y William and Yorukoglu, Deniz and Peng, Jian and Berger, Bonnie},
  journal={Nature biotechnology},
  volume={33},
  number={3},
  pages={240--243},
  year={2015},
  publisher={Nature Publishing Group}
}

@article{ochoa2017effect,
  title={Effect of lossy compression of quality scores on variant calling},
  author={Ochoa, Idoia and Hernaez, Mikel and Goldfeder, Rachel and Weissman, Tsachy and Ashley, Euan},
  journal={Briefings in bioinformatics},
  volume={18},
  number={2},
  pages={183--194},
  year={2017},
  publisher={Oxford University Press}
}
```

---

## 8. 自检清单

- [x] **事实校对（一轮，作者完成）**：(1) NovaSeq 白皮书只作为内部事实校对材料，现已从正文引用与 BibTeX 新增项中删除；正文不再说"当前绝大部分短读 FASTQ 数据的生产默认"，也不暗示 FastqCA 的 `[5, 12, 18, 24]` 来自 NovaSeq；(2) **删除** Cock 2010 引用——该文是 FASTQ 格式定义文章，不支持"canonical confidence tiers"的说法；边界由 Phred 公式 $P=10^{-Q/10}$ 直接推得；(3) Yu 2015 措辞改为"via the Quartz algorithm"（实际是字典平滑而非 4-level 量化）；(4) Ochoa 2017 改为"several lossy quality-compression schemes"（其评估范围更广）；(5) Q$\ge$19 改为 $\gtrsim 99\%$（边界处实际为 98.74\%）。
- [x] **事实校对（二轮，作者完成）**：(6) §3.3 的 Q4 依据收束到 Yu 2015、Ochoa 2017 与 Rivara-Espasand{\'\i}n 2022 三篇学术论文；(7) Yu 2015 在 §4.3 修改点 B 中的措辞去掉"by attenuating overconfident error signals from low-quality bases"机制句——该机制描述更接近 Rivara 2022，对 Yu/Quartz 是 over-attribution；现 §4.3 不再对 SRR554369 / SRR29245815 的 N50 变化做具体机制归因，只表述为与既有质量分数有损变换研究相容的组装层面观察；(8) Rivara-Espasand{\'\i}n 2022 在 §3.3 与 §4.3 两处的"fewer mismatches"描述加上数据集限定（§3："a mock microbial community ... with comparable polishing quality observed on human data"；§4："on a mock microbial community"）——原论文中"fewer mismatches"仅在 mock microbial 子集成立，在 human data 上是 essentially equivalent，避免 over-generalization。
- [x] 主文修改三处（A：Sec 2.5 Rationale；B：Sec 3.4 N50 解读；C：Sec 3.4 末段口径软化）均给出"原文 / 修改后 / 中文对照"三栏。
- [x] 3 篇学术文献（Yu 2015 + Ochoa 2017 + Rivara-Espasandín 2022）贯穿三处修改，互相加强。
- [x] 已将 N50 解读从"low-pass filter / suppressing spurious errors"改为保守的结果解释：Q4 模式在本评估中保持了与原始 FASTQ 可比的组装层面指标，详细数值见 Table~\ref{tab:merged_stats}。
- [x] 已说明为何**第一轮不补 sensitivity analysis**，并备好 response letter 的措辞。
- [x] 已修剪竞争性排名式表述，避免引出与 SPRING 比较无关的额外问题。