上传文件至「/」

2026-05-14 01:54:20 +09:00
parent 345ce58f36
commit 684d85b4fe
2 changed files with 283 additions and 0 deletions
--- a/Introduction-意见2-致谢先前工作.md
+++ b/Introduction-意见2-致谢先前工作.md
@@ -0,0 +1,117 @@
+# Introduction 意见 2 — 致谢先前关于"跨 reads / 跨位置冗余利用"的工作
+
+## 1. 审稿人原意见
+
+> The idea of exploiting redundancy across reads and positions (FQSqueezer, SPRING and FaStore do this in different contexts) – other than the spatial aspect is not new. The distinction must be made explicit by the authors, and they should acknowledge prior work in this area without overstating novelty of this aspect in the manuscript.
+
+## 2. 修改思路
+
+审稿人的核心要求有两点：
+1. **承认**：跨 reads 利用冗余这件事 FQSqueezer / SPRING / FaStore 已经做过（虽然机制各不相同），FastqCA 不能把"跨 reads/跨位置利用冗余"作为自己的原创点。
+2. **澄清**：FastqCA 真正新的是"把 reads 集合显式地视为一个**二维空间矩阵**，用**元胞自动机**对每个位置基于其邻居进行预测"这一具体角度。
+
+具体落实做两件事：
+
+- **(a)** 在 Introduction 中"In this paper, we propose FastqCA..."这一段（当前 `FastqCA.tex` line 108）**之前**新增一段。该段集中讨论三个工作的做法、共同点，并明确划出 FastqCA 与它们的差异。这是回复审稿人的主战场。
+- **(b)** 对当前第二条 Contribution（line 114–115）的措辞做一处温和收缩，把 novelty 收敛到"2D + cellular automaton"这一具体角度，不再泛泛地宣称"利用空间上下文"为新意。
+
+> **事实精度核对结果（已用原文逐条对照）**：
+> - **FaStore**：similarity-based clustering + cluster 内字典/差分编码 + 可选 reorder——描述属实。
+> - **SPRING**：**不是** "BWT/MEM-style"。SPRING 继承自 HARC，是 **hash-based read reordering**，通过按 read **前缀和后缀**建索引的哈希表查找把相似 reads 聚到一起（SPRING Supplementary §1.2 原文："matches the prefix or the suffix of the current read"）。BWT 在 SPRING 里只出现在最终阶段的 BSC 通用压缩器中，不是 read 对齐阶段。
+> - **FQSqueezer**：**不是** "context-mixing"。Deorowicz 2020 原文明确写："*make use of the ideas from the prediction by partial matching (PPM) and dynamic Markov coder (DMC) ... we designed a few fixed-k dictionaries for k-mers found in the reads.*" context-mixing 是 PAQ/LPAQ 那一族的特定术语，与 FQSqueezer 的实际机制不符。
+> - **FQSqueezer BibTeX DOI**：正确 DOI 是 `10.1038/s41598-020-57452-6`（2020 年前缀），而非曾用过的 `10.1038/s41598-019-57452-3`；`Scientific Reports` 不用 issue number，应删 `number = {1}`。
+
+---
+
+## 3. 修改点 A：Introduction 新增一段
+
+### 3.1 插入位置
+
+`FastqCA.tex` **line 107（"Hybrid schemes..."段）之后、line 108（"In this paper, we propose FastqCA..."段）之前**。
+
+### 3.2 原文（上下文，说明在哪两段之间插入新段）
+
+```latex
+% ---------- line 106 (段末) ----------
+... and that calibrated binning is applied to quality values where appropriate.
+
+% ---------- ↑ 在此段之后、↓ 在此段之前插入新段 ----------
+
+% ---------- line 108 ----------
+In this paper, we propose FastqCA, a reference-free, cellular-automaton–based compressor for FASTQ files that reduces storage by exploiting two-dimensional spatial redundancy across reads and positions while preserving the original read order. ...
+```
+
+### 3.3 修改后（新增段，英文 LaTeX）
+
+```latex
+\hl{The idea of exploiting redundancy across multiple reads instead of coding each read in isolation has been explored by several reference-free compressors. FaStore performs similarity-based clustering of reads and applies dictionary substitution within each cluster, with an optional read reordering stage that increases local similarity for the back-end coder \cite{roguski2018fastore}. SPRING brings similar reads together by hash-based read reordering, where reads are looked up in a hash table indexed by their prefixes and suffixes, and then applies specialized component-wise compression to identifiers (token-based), nucleotide sequences (HARC-derived encoding plus BSC), and quality scores (BSC, optionally with QVZ or 8-level binning quantization), all in a reference-free setting \cite{chandak2019spring}. FQSqueezer follows the prediction by partial matching (PPM) and dynamic Markov coding ideas, and uses fixed-$k$ $k$-mer dictionaries to aggregate statistical evidence across the read stream for the prediction of each base \cite{deorowicz2020fqsqueezer}. These three methods all aim at the redundancy across reads, but they treat the data as a one-dimensional sequence, either through pairwise read similarity, dictionary lookup of read prefixes and suffixes, or fixed-$k$ statistics on a linear symbol stream. FastqCA takes a different view of the data. The reads in a chunk are organized into a two-dimensional matrix indexed by (read, position), and the value at every position is predicted from its above, above-left and left neighbours by a cellular automaton whose rule table is dynamically updated during the scanning. The novelty of this work lies in this two-dimensional spatial formulation based on cellular automaton, rather than in the use of cross-read redundancy itself.}
+```
+
+### 3.4 中文对照（仅供作者审阅，**不写入论文**）
+
+> 在本工作之前，"在多条 reads 之间利用冗余、而非孤立编码每条 read"这一思路已经被若干无参考压缩器探索过。FaStore 对 reads 进行基于相似度的聚类，在每个聚类内做字典替换，并辅以可选的 read 重排序阶段，以提升后端编码器面对的局部相似性 \cite{roguski2018fastore}。SPRING 通过按 read 前缀**和后缀**建索引的哈希表查找，进行哈希式 read 重排序把相似 reads 聚到一起，随后在无参考设定下对各组件施加专门的压缩——标识符使用基于 token 的方式，核苷酸序列使用源自 HARC 的编码加 BSC，质量分数使用 BSC（可选 QVZ 或 8 级分箱量化）\cite{chandak2019spring}。FQSqueezer 沿用部分匹配预测（PPM）与动态马尔可夫编码（DMC）的思想，使用固定 $k$ 的 $k$-mer 字典跨 reads 流聚合统计证据来对每个碱基进行预测 \cite{deorowicz2020fqsqueezer}。这三种方法都瞄准 reads 之间的冗余，但都把数据看作一维序列——或是 read 之间的两两相似度，或是按 read 前缀/后缀的字典查找，或是线性符号流上的固定 $k$ 统计。FastqCA 采取了不同的数据视角：它把一个 chunk 内的 reads 组织成以 (read, position) 为索引的二维矩阵，并通过一张随扫描过程动态更新规则表的元胞自动机，依据每个位置的上方、左上方与左方邻居预测该位置的取值。本工作的新意在于这种"基于元胞自动机的二维空间形式化"，而不在于"利用跨 reads 冗余"本身。
+
+---
+
+## 4. 修改点 B：Contributions 第二条措辞收缩
+
+### 4.1 修改位置
+
+`FastqCA.tex` **line 114–115**（Contributions 列表中的第二条）。
+
+### 4.2 原文（英文 LaTeX）
+
+```latex
+\item A \textbf{predictive-modeling encoding algorithm} is proposed for transforming nucleotide sequences and quality scores to compact encodings through making use of spatial context to model their occurrences. The modeling procedure is predictive and data streams are encoded as a sparse matrix of the difference between the original data and the predicted data. 
+```
+
+### 4.3 修改后（英文 LaTeX）
+
+```latex
+\item \hl{A \textbf{two-dimensional predictive-modeling encoding algorithm based on a cellular automaton} is proposed. Existing reference-free compressors exploit cross-read redundancy through similarity-based clustering \cite{roguski2018fastore}, hash-based read reordering \cite{chandak2019spring}, or PPM/DMC-style $k$-mer statistical modeling on a linear symbol stream \cite{deorowicz2020fqsqueezer}. Differently, FastqCA organizes the reads in a chunk into a two-dimensional matrix indexed by (read, position), and predicts each symbol from its above, above-left and left neighbours through a dynamically updated rule table. The data stream is then encoded as a sparse matrix of the difference between the original and predicted values.}
+```
+
+### 4.4 中英文对照（句对句，仅供作者审阅，**不写入论文**）
+
+| # | 原文（修改前） | 修改后（英文） | 修改后（中文译文） |
+|---|---|---|---|
+| 1 | A **predictive-modeling encoding algorithm** is proposed for transforming nucleotide sequences and quality scores to compact encodings through making use of spatial context to model their occurrences. | A **2D-spatial predictive-modeling encoding algorithm based on a cellular automaton** is proposed. | 提出一种**基于元胞自动机的二维空间预测建模编码算法**。 |
+| 2 | （原文未与现有工作对比） | Unlike prior reference-free compressors that exploit cross-read redundancy through similarity-based clustering \cite{roguski2018fastore}, hash-based read reordering \cite{chandak2019spring}, or one-dimensional PPM/DMC-style $k$-mer statistical modeling \cite{deorowicz2020fqsqueezer}, FastqCA views the read collection as a two-dimensional matrix indexed by (read, position), and predicts each symbol from its immediate row, column and diagonal neighbours through a dynamically updated rule table. | 与已有的、通过基于相似度的聚类 \cite{roguski2018fastore}、基于哈希的 read 重排序 \cite{chandak2019spring}、或一维 PPM/DMC 式 $k$-mer 统计建模 \cite{deorowicz2020fqsqueezer} 来利用跨 reads 冗余的无参考压缩器不同，FastqCA 把 reads 集合视作以 (read, position) 为索引的二维矩阵，并通过一张动态更新的规则表，依据每个符号的行、列、对角邻居对其进行预测。 |
+| 3 | The modeling procedure is predictive and data streams are encoded as a sparse matrix of the difference between the original data and the predicted data. | The data stream is then encoded as a sparse residual matrix between the original and predicted values. | 数据流随后被编码为原始值与预测值之间的稀疏残差矩阵。 |
+
+---
+
+## 5. 新增参考文献（BibTeX）
+
+`roguski2018fastore`（FaStore）和 `chandak2019spring`（SPRING）已存在于 `ref.bib`，**只需新增 FQSqueezer 一条**：
+
+```bibtex
+@article{deorowicz2020fqsqueezer,
+  title   = {{FQSqueezer}: $k$-mer-based compression of sequencing data},
+  author  = {Deorowicz, Sebastian},
+  journal = {Scientific Reports},
+  volume  = {10},
+  pages   = {578},
+  year    = {2020},
+  doi     = {10.1038/s41598-020-57452-6},
+  publisher = {Nature Publishing Group}
+}
+```
+
+> **核对说明**：
+> - DOI 前缀 `10.1038/s41598-020-...` 中的 `020` 对应 2020 年发表，与 article number 578 在 PMC 上的元数据一致；曾用过的 `10.1038/s41598-019-57452-3` 的 `019` 是 2019 年的 DOI 前缀，与本文实际发表年份矛盾，已修正。
+> - `Scientific Reports` 使用 article number（pages 即 578），不使用 issue number，故删除 `number = {1}`。
+> - `url` 字段为指向 GitHub 实现仓库的指针，与期刊条目分离更清晰，已删除；如需保留实现位置，可在正文脚注另行给出。
+
+---
+
+## 6. 自检清单
+
+- [x] 已在新增段落中明确引用 FaStore (`\cite{roguski2018fastore}`)、SPRING (`\cite{chandak2019spring}`) 和 FQSqueezer (`\cite{deorowicz2020fqsqueezer}`)。
+- [x] 已声明"不主张跨 reads 冗余利用的原创性"——直接回应审稿人 "without overstating novelty"。
+- [x] 已划清 FastqCA 与三者的差异：1D vs 2D，相似度聚类 / 哈希式 reordering / 固定 $k$ 的 $k$-mer 统计 vs 元胞自动机邻域预测。
+- [x] **事实精度修正 1**：SPRING 描述由 "BWT/MEM-style procedure" 改为 "hash-based read reordering through hash-table lookups indexed by read prefixes and suffixes"，与 SPRING 论文 §2 / Supplementary §1.2（"matches the prefix or the suffix of the current read"）及其前作 HARC 的双向匹配改进一致。
+- [x] **事实精度修正 2**：FQSqueezer 描述由 "high-order context-mixing predictor" 改为 "PPM/DMC with fixed-$k$ $k$-mer dictionaries"，与 Deorowicz 2020 原文措辞一致。
+- [x] **事实精度修正 3**：FQSqueezer BibTeX DOI 由 `10.1038/s41598-019-57452-3` 修正为 `10.1038/s41598-020-57452-6`，并删除 `number = {1}` 与 `url`。
+- [x] Contributions 第二条同步采用与新增段一致的口径（clustering / hash-based reordering / PPM/DMC-style $k$-mer modeling），避免主文与 contribution 列表内部口径不一致。
+- [x] 每段 LaTeX 修改均给出"原文 / 修改后 / 中文对照"三栏。