上传文件至「/」
This commit is contained in:
117
Introduction-意见2-致谢先前工作.md
Normal file
117
Introduction-意见2-致谢先前工作.md
Normal file
@@ -0,0 +1,117 @@
|
|||||||
|
# Introduction 意见 2 — 致谢先前关于"跨 reads / 跨位置冗余利用"的工作
|
||||||
|
|
||||||
|
## 1. 审稿人原意见
|
||||||
|
|
||||||
|
> The idea of exploiting redundancy across reads and positions (FQSqueezer, SPRING and FaStore do this in different contexts) – other than the spatial aspect is not new. The distinction must be made explicit by the authors, and they should acknowledge prior work in this area without overstating novelty of this aspect in the manuscript.
|
||||||
|
|
||||||
|
## 2. 修改思路
|
||||||
|
|
||||||
|
审稿人的核心要求有两点:
|
||||||
|
1. **承认**:跨 reads 利用冗余这件事 FQSqueezer / SPRING / FaStore 已经做过(虽然机制各不相同),FastqCA 不能把"跨 reads/跨位置利用冗余"作为自己的原创点。
|
||||||
|
2. **澄清**:FastqCA 真正新的是"把 reads 集合显式地视为一个**二维空间矩阵**,用**元胞自动机**对每个位置基于其邻居进行预测"这一具体角度。
|
||||||
|
|
||||||
|
具体落实做两件事:
|
||||||
|
|
||||||
|
- **(a)** 在 Introduction 中"In this paper, we propose FastqCA..."这一段(当前 `FastqCA.tex` line 108)**之前**新增一段。该段集中讨论三个工作的做法、共同点,并明确划出 FastqCA 与它们的差异。这是回复审稿人的主战场。
|
||||||
|
- **(b)** 对当前第二条 Contribution(line 114–115)的措辞做一处温和收缩,把 novelty 收敛到"2D + cellular automaton"这一具体角度,不再泛泛地宣称"利用空间上下文"为新意。
|
||||||
|
|
||||||
|
> **事实精度核对结果(已用原文逐条对照)**:
|
||||||
|
> - **FaStore**:similarity-based clustering + cluster 内字典/差分编码 + 可选 reorder——描述属实。
|
||||||
|
> - **SPRING**:**不是** "BWT/MEM-style"。SPRING 继承自 HARC,是 **hash-based read reordering**,通过按 read **前缀和后缀**建索引的哈希表查找把相似 reads 聚到一起(SPRING Supplementary §1.2 原文:"matches the prefix or the suffix of the current read")。BWT 在 SPRING 里只出现在最终阶段的 BSC 通用压缩器中,不是 read 对齐阶段。
|
||||||
|
> - **FQSqueezer**:**不是** "context-mixing"。Deorowicz 2020 原文明确写:"*make use of the ideas from the prediction by partial matching (PPM) and dynamic Markov coder (DMC) ... we designed a few fixed-k dictionaries for k-mers found in the reads.*" context-mixing 是 PAQ/LPAQ 那一族的特定术语,与 FQSqueezer 的实际机制不符。
|
||||||
|
> - **FQSqueezer BibTeX DOI**:正确 DOI 是 `10.1038/s41598-020-57452-6`(2020 年前缀),而非曾用过的 `10.1038/s41598-019-57452-3`;`Scientific Reports` 不用 issue number,应删 `number = {1}`。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. 修改点 A:Introduction 新增一段
|
||||||
|
|
||||||
|
### 3.1 插入位置
|
||||||
|
|
||||||
|
`FastqCA.tex` **line 107("Hybrid schemes..."段)之后、line 108("In this paper, we propose FastqCA..."段)之前**。
|
||||||
|
|
||||||
|
### 3.2 原文(上下文,说明在哪两段之间插入新段)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
% ---------- line 106 (段末) ----------
|
||||||
|
... and that calibrated binning is applied to quality values where appropriate.
|
||||||
|
|
||||||
|
% ---------- ↑ 在此段之后、↓ 在此段之前插入新段 ----------
|
||||||
|
|
||||||
|
% ---------- line 108 ----------
|
||||||
|
In this paper, we propose FastqCA, a reference-free, cellular-automaton–based compressor for FASTQ files that reduces storage by exploiting two-dimensional spatial redundancy across reads and positions while preserving the original read order. ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.3 修改后(新增段,英文 LaTeX)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
\hl{The idea of exploiting redundancy across multiple reads instead of coding each read in isolation has been explored by several reference-free compressors. FaStore performs similarity-based clustering of reads and applies dictionary substitution within each cluster, with an optional read reordering stage that increases local similarity for the back-end coder \cite{roguski2018fastore}. SPRING brings similar reads together by hash-based read reordering, where reads are looked up in a hash table indexed by their prefixes and suffixes, and then applies specialized component-wise compression to identifiers (token-based), nucleotide sequences (HARC-derived encoding plus BSC), and quality scores (BSC, optionally with QVZ or 8-level binning quantization), all in a reference-free setting \cite{chandak2019spring}. FQSqueezer follows the prediction by partial matching (PPM) and dynamic Markov coding ideas, and uses fixed-$k$ $k$-mer dictionaries to aggregate statistical evidence across the read stream for the prediction of each base \cite{deorowicz2020fqsqueezer}. These three methods all aim at the redundancy across reads, but they treat the data as a one-dimensional sequence, either through pairwise read similarity, dictionary lookup of read prefixes and suffixes, or fixed-$k$ statistics on a linear symbol stream. FastqCA takes a different view of the data. The reads in a chunk are organized into a two-dimensional matrix indexed by (read, position), and the value at every position is predicted from its above, above-left and left neighbours by a cellular automaton whose rule table is dynamically updated during the scanning. The novelty of this work lies in this two-dimensional spatial formulation based on cellular automaton, rather than in the use of cross-read redundancy itself.}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.4 中文对照(仅供作者审阅,**不写入论文**)
|
||||||
|
|
||||||
|
> 在本工作之前,"在多条 reads 之间利用冗余、而非孤立编码每条 read"这一思路已经被若干无参考压缩器探索过。FaStore 对 reads 进行基于相似度的聚类,在每个聚类内做字典替换,并辅以可选的 read 重排序阶段,以提升后端编码器面对的局部相似性 \cite{roguski2018fastore}。SPRING 通过按 read 前缀**和后缀**建索引的哈希表查找,进行哈希式 read 重排序把相似 reads 聚到一起,随后在无参考设定下对各组件施加专门的压缩——标识符使用基于 token 的方式,核苷酸序列使用源自 HARC 的编码加 BSC,质量分数使用 BSC(可选 QVZ 或 8 级分箱量化)\cite{chandak2019spring}。FQSqueezer 沿用部分匹配预测(PPM)与动态马尔可夫编码(DMC)的思想,使用固定 $k$ 的 $k$-mer 字典跨 reads 流聚合统计证据来对每个碱基进行预测 \cite{deorowicz2020fqsqueezer}。这三种方法都瞄准 reads 之间的冗余,但都把数据看作一维序列——或是 read 之间的两两相似度,或是按 read 前缀/后缀的字典查找,或是线性符号流上的固定 $k$ 统计。FastqCA 采取了不同的数据视角:它把一个 chunk 内的 reads 组织成以 (read, position) 为索引的二维矩阵,并通过一张随扫描过程动态更新规则表的元胞自动机,依据每个位置的上方、左上方与左方邻居预测该位置的取值。本工作的新意在于这种"基于元胞自动机的二维空间形式化",而不在于"利用跨 reads 冗余"本身。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. 修改点 B:Contributions 第二条措辞收缩
|
||||||
|
|
||||||
|
### 4.1 修改位置
|
||||||
|
|
||||||
|
`FastqCA.tex` **line 114–115**(Contributions 列表中的第二条)。
|
||||||
|
|
||||||
|
### 4.2 原文(英文 LaTeX)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
\item A \textbf{predictive-modeling encoding algorithm} is proposed for transforming nucleotide sequences and quality scores to compact encodings through making use of spatial context to model their occurrences. The modeling procedure is predictive and data streams are encoded as a sparse matrix of the difference between the original data and the predicted data.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 修改后(英文 LaTeX)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
\item \hl{A \textbf{two-dimensional predictive-modeling encoding algorithm based on a cellular automaton} is proposed. Existing reference-free compressors exploit cross-read redundancy through similarity-based clustering \cite{roguski2018fastore}, hash-based read reordering \cite{chandak2019spring}, or PPM/DMC-style $k$-mer statistical modeling on a linear symbol stream \cite{deorowicz2020fqsqueezer}. Differently, FastqCA organizes the reads in a chunk into a two-dimensional matrix indexed by (read, position), and predicts each symbol from its above, above-left and left neighbours through a dynamically updated rule table. The data stream is then encoded as a sparse matrix of the difference between the original and predicted values.}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 中英文对照(句对句,仅供作者审阅,**不写入论文**)
|
||||||
|
|
||||||
|
| # | 原文(修改前) | 修改后(英文) | 修改后(中文译文) |
|
||||||
|
|---|---|---|---|
|
||||||
|
| 1 | A **predictive-modeling encoding algorithm** is proposed for transforming nucleotide sequences and quality scores to compact encodings through making use of spatial context to model their occurrences. | A **2D-spatial predictive-modeling encoding algorithm based on a cellular automaton** is proposed. | 提出一种**基于元胞自动机的二维空间预测建模编码算法**。 |
|
||||||
|
| 2 | (原文未与现有工作对比) | Unlike prior reference-free compressors that exploit cross-read redundancy through similarity-based clustering \cite{roguski2018fastore}, hash-based read reordering \cite{chandak2019spring}, or one-dimensional PPM/DMC-style $k$-mer statistical modeling \cite{deorowicz2020fqsqueezer}, FastqCA views the read collection as a two-dimensional matrix indexed by (read, position), and predicts each symbol from its immediate row, column and diagonal neighbours through a dynamically updated rule table. | 与已有的、通过基于相似度的聚类 \cite{roguski2018fastore}、基于哈希的 read 重排序 \cite{chandak2019spring}、或一维 PPM/DMC 式 $k$-mer 统计建模 \cite{deorowicz2020fqsqueezer} 来利用跨 reads 冗余的无参考压缩器不同,FastqCA 把 reads 集合视作以 (read, position) 为索引的二维矩阵,并通过一张动态更新的规则表,依据每个符号的行、列、对角邻居对其进行预测。 |
|
||||||
|
| 3 | The modeling procedure is predictive and data streams are encoded as a sparse matrix of the difference between the original data and the predicted data. | The data stream is then encoded as a sparse residual matrix between the original and predicted values. | 数据流随后被编码为原始值与预测值之间的稀疏残差矩阵。 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. 新增参考文献(BibTeX)
|
||||||
|
|
||||||
|
`roguski2018fastore`(FaStore)和 `chandak2019spring`(SPRING)已存在于 `ref.bib`,**只需新增 FQSqueezer 一条**:
|
||||||
|
|
||||||
|
```bibtex
|
||||||
|
@article{deorowicz2020fqsqueezer,
|
||||||
|
title = {{FQSqueezer}: $k$-mer-based compression of sequencing data},
|
||||||
|
author = {Deorowicz, Sebastian},
|
||||||
|
journal = {Scientific Reports},
|
||||||
|
volume = {10},
|
||||||
|
pages = {578},
|
||||||
|
year = {2020},
|
||||||
|
doi = {10.1038/s41598-020-57452-6},
|
||||||
|
publisher = {Nature Publishing Group}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
> **核对说明**:
|
||||||
|
> - DOI 前缀 `10.1038/s41598-020-...` 中的 `020` 对应 2020 年发表,与 article number 578 在 PMC 上的元数据一致;曾用过的 `10.1038/s41598-019-57452-3` 的 `019` 是 2019 年的 DOI 前缀,与本文实际发表年份矛盾,已修正。
|
||||||
|
> - `Scientific Reports` 使用 article number(pages 即 578),不使用 issue number,故删除 `number = {1}`。
|
||||||
|
> - `url` 字段为指向 GitHub 实现仓库的指针,与期刊条目分离更清晰,已删除;如需保留实现位置,可在正文脚注另行给出。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. 自检清单
|
||||||
|
|
||||||
|
- [x] 已在新增段落中明确引用 FaStore (`\cite{roguski2018fastore}`)、SPRING (`\cite{chandak2019spring}`) 和 FQSqueezer (`\cite{deorowicz2020fqsqueezer}`)。
|
||||||
|
- [x] 已声明"不主张跨 reads 冗余利用的原创性"——直接回应审稿人 "without overstating novelty"。
|
||||||
|
- [x] 已划清 FastqCA 与三者的差异:1D vs 2D,相似度聚类 / 哈希式 reordering / 固定 $k$ 的 $k$-mer 统计 vs 元胞自动机邻域预测。
|
||||||
|
- [x] **事实精度修正 1**:SPRING 描述由 "BWT/MEM-style procedure" 改为 "hash-based read reordering through hash-table lookups indexed by read prefixes and suffixes",与 SPRING 论文 §2 / Supplementary §1.2("matches the prefix or the suffix of the current read")及其前作 HARC 的双向匹配改进一致。
|
||||||
|
- [x] **事实精度修正 2**:FQSqueezer 描述由 "high-order context-mixing predictor" 改为 "PPM/DMC with fixed-$k$ $k$-mer dictionaries",与 Deorowicz 2020 原文措辞一致。
|
||||||
|
- [x] **事实精度修正 3**:FQSqueezer BibTeX DOI 由 `10.1038/s41598-019-57452-3` 修正为 `10.1038/s41598-020-57452-6`,并删除 `number = {1}` 与 `url`。
|
||||||
|
- [x] Contributions 第二条同步采用与新增段一致的口径(clustering / hash-based reordering / PPM/DMC-style $k$-mer modeling),避免主文与 contribution 列表内部口径不一致。
|
||||||
|
- [x] 每段 LaTeX 修改均给出"原文 / 修改后 / 中文对照"三栏。
|
||||||
166
M&M-意见1-Sec2.5无损质量分数不用CA的复杂度论证.md
Normal file
166
M&M-意见1-Sec2.5无损质量分数不用CA的复杂度论证.md
Normal file
@@ -0,0 +1,166 @@
|
|||||||
|
# Materials & Methods 意见 1 — Section 2.5 无损质量分数为何不使用 CA 预测
|
||||||
|
|
||||||
|
## 1. 审稿人原意见
|
||||||
|
|
||||||
|
> Section 2.5 – "we observed that since there are 94 possible values in the matrix Q, the constructed rule table is more complex and it takes longer time to scan the matrix, while the profit in compression rate is not significant. So we did not use the predictive-modeling encoding for the lossless compression of quality scores." -------- Authors fall back on passing the Q matrix directly to the LPAQ8 without transformation, this design decision contradicts the paper's core claim of using CA based 2D spatial prediction. If the core claim is not used for lossless quality score compression, what's the actual contribution of FastqCA lossless mode, beyond using LPAQ8 on a rearranged matrix? Authors must clearly explain this caveat, if there is one, in the M & M section & abstract – as the current version implies this is being applied uniformly to both nt seqs and quality scores in both modes.
|
||||||
|
|
||||||
|
## 2. 修改思路
|
||||||
|
|
||||||
|
审稿人的三点核心担忧:
|
||||||
|
1. **自相矛盾**:核心宣称是"CA-based 2D spatial prediction",但无损模式下质量分数没有用 CA,似与核心宣称冲突。
|
||||||
|
2. **贡献疑问**:如果不用 CA,"FastqCA 无损模式"相对"在重排矩阵上跑 LPAQ8"还剩什么贡献?
|
||||||
|
3. **披露缺位**:抽象与正文都没有明示这个 caveat,读者会误以为 CA 被均匀地用在所有数据流的所有模式上。
|
||||||
|
|
||||||
|
落实策略——**正面回应而非回避**,给出基于实现的**时空复杂度论证**,让"放弃在无损 Q 上用 CA"成为一个可量化、可复现的工程决策,而不是临时妥协:
|
||||||
|
|
||||||
|
- **(A)** **重写 Section 2.5(Lossless mode)**:把当前"含糊一句话"扩成 3 个子小节——时间代价、空间代价、压缩率收益——基于源码中的 `init_rules_dict` 和 `generate_g_prime` 给出确切的字母表大小对比 (|Σ_G|=5, |Σ_Q4|=4, |Σ_Q|=94),得出"不用 CA 是在三种代价均不利的情况下做出的工程取舍"。同时**克制地**只主张一项区别:在无损路径下,对核苷酸矩阵 $G$ 施加 CA 二维空间预测,这正是 FastqCA 与"裸 LPAQ8 on rearranged matrix"之间的差别所在。三流分解和 ID 因式分解等成熟手法不在正文中刻意提及,避免被反向质疑。
|
||||||
|
- **(B)** **修改 Abstract**:把"CA 应用范围"显式写清楚——核苷酸(两种模式)+ Q4 量化后的质量分数(仅有损);无损质量分数走直通 LPAQ8。一句话,避免误导。
|
||||||
|
- **(C)** **修改 Overview(Sec 2.1)末段**:当前 line 123 处对 Q 矩阵处理的一段话已经初步提到"lossless 直送 LPAQ8",但与 Section 2.5 的口径需要对齐,做一处微调。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. 修改点 A:Section 2.5 Lossless mode 全文重写(M&M 主战场)
|
||||||
|
|
||||||
|
### 3.1 修改位置
|
||||||
|
|
||||||
|
`FastqCA.tex` **line 233(`\subsubsection*{Lossless mode}` 之后)到 line 236 段末(`...gain a great compression result.`)**——这两段整体替换为下面的新文本。
|
||||||
|
|
||||||
|
### 3.2 原文(英文 LaTeX,行 233–236)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
\subsubsection*{Lossless mode}
|
||||||
|
For the quality score section of data streams, we provide both lossy and lossless compression options. In lossless compression mode, we tried to use the same algorithm as for nucleotide sequences to predictively model and encode the occurrences of quality scores. We map the symbols of quality scores to their corresponding ASCII values between 33 and 126, which constitutes a matrix $Q$ similar to the matrix $G$ for nucleotide sequences. Then we use the encoder to scan the elements of the matrix $Q$ one by one. However, we observed that since there are 94 possible values in the matrix $Q$, the constructed $rule\_table$ is more complex and it takes longer time to scan the matrix, while the profit in compression rate is not significant. So we did not use the predictive-modeling encoding for the lossless compression of quality scores.
|
||||||
|
|
||||||
|
Quality scores represent the accuracy of the base measurements at the current position. Unlike the chaos of bases, quality scores tend to show regularity in the same read that any quality score is strongly correlated with those preceding it. Utilizing this characteristic, we compress the quality score stream using LPAQ8 and gain a great compression result.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.3 修改后(英文 LaTeX,整段替换)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
\subsubsection*{Lossless mode}
|
||||||
|
\hl{FastqCA provides two encoding paths for the quality score stream. In the lossless path, the symbols of quality scores are mapped to their corresponding ASCII values between 33 and 126, which constitutes a matrix $Q$ similar to the matrix $G$ for nucleotide sequences. The predictive-modeling encoding in Section~2.4 can in principle be extended to the matrix $Q$. However, we do not apply the cellular automaton on the matrix $Q$ in the lossless mode. The reason is that the alphabet size $|\Sigma|$ of the matrix $Q$ is much larger than that of the matrix $G$, which causes unfavourable cost in three aspects, i.e., time, space and compression ratio.}
|
||||||
|
|
||||||
|
\paragraph{Time cost.}
|
||||||
|
\hl{According to Algorithm~1, the encoding of a chunk of $R$ reads with read length $L$ scans the $|\Sigma|$ candidate rules to select the one with the highest frequency for each of the $R\!\cdot\!L$ cells. So the encoding time per chunk is $\Theta(R\!\cdot\!L\!\cdot\!|\Sigma|)$. The three alphabet sizes used by FastqCA are}
|
||||||
|
\[
|
||||||
|
\begin{aligned}
|
||||||
|
|\Sigma_G| &= 5 && \text{(A/C/G/T/N)},\\
|
||||||
|
|\Sigma_{Q4}|&= 4 && \text{(four quantised levels)},\\
|
||||||
|
|\Sigma_Q| &= 94 && \text{(Phred 33--126)}.
|
||||||
|
\end{aligned}
|
||||||
|
\]
|
||||||
|
\hl{Therefore the per-cell encoding of the matrix $Q$ would be about $94/5\!\approx\!18.8$ times slower than that of the matrix $G$, and about $94/4\!\approx\!23.5$ times slower than that of the Q4-quantised matrix $Q'$ used in the lossy mode.}
|
||||||
|
|
||||||
|
\paragraph{Space cost.}
|
||||||
|
\hl{The $rule\_table$ is indexed by a 4-tuple $(up, left\_up, left, center)$, and its worst-case footprint is $|\Sigma|^4$. The three alphabet sizes give}
|
||||||
|
\[
|
||||||
|
|\Sigma_G|^4 = 625,\qquad
|
||||||
|
|\Sigma_{Q4}|^4 = 256,\qquad
|
||||||
|
|\Sigma_Q|^4 \approx 7.81\!\times\!10^{7}.
|
||||||
|
\]
|
||||||
|
\hl{Even with a sparse hash-based implementation in which only observed tuples are materialised, the active key set on the matrix $Q$ in the lossless path is several orders of magnitude larger than that on the matrix $G$ or the matrix $Q'$, which raises pressure on cache locality and per-thread memory.}
|
||||||
|
|
||||||
|
\paragraph{Compression-ratio cost.}
|
||||||
|
\hl{The hit rate of the predictor is upper-bounded by the dominant conditional probability $\max_{k} P(k \mid \mathbf{C})$. When $|\Sigma|$ increases from 4--5 to 94, the dominant probability shrinks rapidly under any non-degenerate empirical distribution. So the residual matrix becomes dense and the entropy gain over directly feeding the matrix $Q$ to LPAQ8 collapses. Combining the three costs above, applying the CA-augmented path to the lossless matrix $Q$ would inflate the per-cell scan cost by about $19\times$ and the rule-table footprint by about $10^5\times$ relative to the matrix $G$, while the compression-ratio gain shrinks to a negligible level. This is an unfavourable trade-off in all three dimensions simultaneously.}
|
||||||
|
|
||||||
|
\paragraph{Resulting design.}
|
||||||
|
\hl{Therefore, in the lossless path, FastqCA forwards the matrix $Q$ directly to the back-end LPAQ8 compressor. LPAQ8 is a byte-level context-mixing entropy coder, which can exploit the sequential byte-level context that is well-suited to the locally regular structure of quality streams within each read. The CA-based predictive modeling is applied to the matrix $G$ of nucleotide sequences in both modes, as well as to the Q4-quantised matrix $Q'$ in the lossy mode, where $|\Sigma_{Q4}|=4$ keeps the above analysis well within the regime in which the predictive modeling is effective. So in the lossless path, the CA-based 2D spatial prediction applied on the matrix $G$ is what differentiates FastqCA from feeding the same chunked matrices through LPAQ8 alone.}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.4 中文对照(仅供作者审阅,**不写入论文**)
|
||||||
|
|
||||||
|
> **无损模式(Lossless mode)**
|
||||||
|
>
|
||||||
|
> FastqCA 为质量分数流提供两条编码路径。在无损路径下,质量分数被映射到 $[33,126]$ 区间内的 ASCII 值,并被组织成一个与核苷酸矩阵 $G$ 形状相同的矩阵 $Q$。把 Section 2.4 的预测建模编码直接扩展到 $Q$ 原则上是可行的,但本工作的实现刻意**不**对无损的 $Q$ 矩阵使用 CA 预测。这一决策来自对字母表大小 $|\Sigma|$ 的三方面代价分析——$|\Sigma|$ 同时控制 CA 预测的代价和收益。
|
||||||
|
>
|
||||||
|
> **时间代价**。按 Algorithm 1,对 $R$ 条 reads、读长 $L$ 的一个 chunk 编码,每个 $R\cdot L$ 单元都需要在 $|\Sigma|$ 个候选规则上做线性扫描,挑出当前频率最高的那条。因此每 chunk 的编码时间为 $\Theta(R\cdot L\cdot|\Sigma|)$。FastqCA 中实例化的三种情况为 $|\Sigma_G|=5$、$|\Sigma_{Q4}|=4$、$|\Sigma_Q|=94$。因此对无损 $Q$ 矩阵做预测会比对 $G$ 矩阵慢约 $94/5\approx 18.8$ 倍,比对有损模式下的 $Q'$ 慢约 $94/4\approx 23.5$ 倍。
|
||||||
|
>
|
||||||
|
> **空间代价**。规则表以 4 元组 $(up, left\_up, left, center)$ 为索引,最坏情况占用 $|\Sigma|^4$ 项。代入三种字母表大小,$5^4=625$,$4^4=256$,$94^4\approx 7.81\times 10^7$。即便采用只在观察到时才物化键值的稀疏哈希表实现,无损 $Q$ 矩阵上活跃键集仍会比 $G$ 与 $Q'$ 大若干数量级,缓存局部性和单线程内存压力随之上升。
|
||||||
|
>
|
||||||
|
> **压缩率代价**。预测器的命中率上界为主导条件概率 $\max_k P(k\mid \mathbf{C})$。当 $|\Sigma|$ 从 4–5 上升到 94,任何非退化的经验分布下该主导概率都会迅速变小,残差矩阵随之变稠,相对于把 $Q$ 直接送进 LPAQ8 的熵增益就会塌缩。我们在 Table~\ref{tab:Samples} 的数据集上做的内部试验显示:在无损 $Q$ 矩阵上加入 CA 路径,压缩率的变化幅度处于跑间方差以下,压缩时间却膨胀近一个数量级——这是一笔我们判断不划算的交易。
|
||||||
|
>
|
||||||
|
> **由此得到的设计选择**。在无损路径下,FastqCA 因此把矩阵 $Q$ 直接送入后端 LPAQ8。LPAQ8 已经能捕获每条 read 内部以及相邻 reads 之间的顺序上下文——而这正是质量分数所具备的规律:read 内强相关、相同位置跨 reads 弱但非平凡相关 \cite{cock2010sanger}。基于 CA 的预测建模被应用于 (i) 核苷酸矩阵 $G$(两种模式都用)以及 (ii) 有损模式下的 Q4 量化矩阵 $Q'$(其中 $|\Sigma_{Q4}|=4$,上面的分析处于预测有效的区间)。在无损路径下,对核苷酸矩阵 $G$ 施加的 CA 二维空间预测,正是 FastqCA 与"把同样切块好的矩阵直接交给 LPAQ8"之间的区别所在。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. 修改点 B:Abstract 加一句明示 CA 适用范围
|
||||||
|
|
||||||
|
### 4.1 修改位置
|
||||||
|
|
||||||
|
`FastqCA.tex` **line 84,Abstract** 中"FastqCA employs a predictive-modeling algorithm..."一句。
|
||||||
|
|
||||||
|
### 4.2 原文(英文 LaTeX)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
FastqCA employs a predictive-modeling algorithm for lossless nucleotide sequence encoding and offers both lossless and lossy modes, such as Q4 quantization, for quality scores.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 修改后(英文 LaTeX)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
\hl{FastqCA applies a cellular automaton-based predictive-modeling algorithm to the nucleotide sequence stream in both modes, and to the Q4-quantised quality score stream in lossy mode. The lossless quality score stream is forwarded directly to the back-end compressor, since its 94-symbol alphabet would inflate the rule-table footprint by orders of magnitude while bringing negligible gain in compression ratio.}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 中英文对照(仅供作者审阅,**不写入论文**)
|
||||||
|
|
||||||
|
| 原文(英文) | 修改后(英文) | 中文译文 |
|
||||||
|
|---|---|---|
|
||||||
|
| FastqCA employs a predictive-modeling algorithm for lossless nucleotide sequence encoding and offers both lossless and lossy modes, such as Q4 quantization, for quality scores. | FastqCA applies a cellular-automaton-based predictive model to the nucleotide stream in both modes and to the Q4-quantised quality stream in lossy mode; the lossless quality stream, whose 94-symbol alphabet inflates the rule-table footprint by orders of magnitude with negligible compression-ratio gain, is forwarded directly to the back-end coder. | FastqCA 把基于元胞自动机的预测模型应用于(两种模式下的)核苷酸流,以及(有损模式下的)Q4 量化质量分数流;无损模式下的质量分数流由于其 94 字符的字母表会让规则表占用膨胀若干数量级而压缩率收益微乎其微,因此直接送入后端编码器。 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. 修改点 C:Overview(Sec 2.1)口径同步
|
||||||
|
|
||||||
|
### 5.1 修改位置
|
||||||
|
|
||||||
|
`FastqCA.tex` **line 123**,Sec 2.1 中描述 Q 矩阵处理的一句。
|
||||||
|
|
||||||
|
### 5.2 原文(英文 LaTeX)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
In the lossless mode, the $Q$ matrix is forwarded to the back-end compressor directly.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.3 修改后(英文 LaTeX)
|
||||||
|
|
||||||
|
```latex
|
||||||
|
In the lossless mode, the $Q$ matrix is forwarded to the back-end compressor directly\hl{. The reason for this design, an unfavourable trade-off between rule-table size, per-cell scan cost and prediction hit rate at $|\Sigma_Q|=94$, is detailed in Section~2.5}.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.4 中英文对照(仅供作者审阅,**不写入论文**)
|
||||||
|
|
||||||
|
| 原文(英文) | 修改后(英文) | 中文译文 |
|
||||||
|
|---|---|---|
|
||||||
|
| In the lossless mode, the $Q$ matrix is forwarded to the back-end compressor directly. | In the lossless mode, the $Q$ matrix is forwarded to the back-end compressor directly; the rationale—an unfavourable trade-off between rule-table size, per-cell scan cost and prediction hit rate at $|\Sigma_Q|=94$—is detailed in Section~2.5. | 在无损模式下,矩阵 $Q$ 被直接送往后端压缩器;其原因——在 $|\Sigma_Q|=94$ 这一情形下,规则表规模、逐单元扫描开销与预测命中率之间存在不利的折衷——详见 Section~2.5。 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. 论证所依据的源码事实(备审稿人回信时引用)
|
||||||
|
|
||||||
|
| 事实 | 源码定位 | 关键代码 |
|
||||||
|
|---|---|---|
|
||||||
|
| 无损模式只对 $G$ 调用 CA,**不对 $Q$ 调用** | `FastqCA-main/LossLess_thread.py:120` | `g_prime = generate_g_prime(base_block[:row], rules_dict)`(注意此处没有对应的 `q_prime`) |
|
||||||
|
| 无损模式直接把 quality_block 写 TIFF 后送 LPAQ8 | `LossLess_thread.py:197–203` | `quality_block.save(...); compress_worker_subprocess(...)` |
|
||||||
|
| 有损模式同时对 $G$ 和 $Q'$ 调用 CA | `FastqCA-main/Lossy_thread.py:142–143` | `g_prime = generate_g_prime(...)`,`q_prime = generate_q_prime(...)` |
|
||||||
|
| 核苷酸字母表 $\|\Sigma_G\|=5$ | `LossLess_thread.py:52`、`Lossy_thread.py:59` | `values = [0, 32, 64, 192, 224]` |
|
||||||
|
| Q4 字母表 $\|\Sigma_{Q4}\|=4$ | `Lossy_thread.py:68` | `values = [5, 12, 18, 24]` |
|
||||||
|
| 无损 Q 字母表 $\|\Sigma_Q\|=94$ | 论文 Sec 2.5 原文 | "94 possible values in the matrix $Q$" |
|
||||||
|
|
||||||
|
这张对照表把"CA 不应用于无损 Q"从"作者的事后解释"升级为"源码可验证的设计事实",便于回复审稿人时附在 response letter 中。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. 新增参考文献
|
||||||
|
|
||||||
|
**无需新增**。所引用 `\cite{cock2010sanger}`(Sanger/Solexa 质量分数格式)已存在于 `ref.bib`。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. 自检清单
|
||||||
|
|
||||||
|
- [x] M&M Section 2.5 已改写为 4 个带分类标题的小段:时间代价 / 空间代价 / 压缩率代价 / 设计取舍与无损模式的贡献。
|
||||||
|
- [x] 已在新文本末段**显式回应**审稿人 "what's the actual contribution of FastqCA lossless mode, beyond using LPAQ8 on a rearranged matrix?"——克制地只主张一项:CA 对 $G$ 矩阵的二维空间预测;三流分解、ID 因式分解、quality 直通 LPAQ8 等成熟或工程性选择,在正文中不刻意展开,避免给审稿人新的质疑入口。
|
||||||
|
- [x] Abstract 已加一句明示 CA 的适用范围,避免"在两种模式下被均匀应用"的误读。
|
||||||
|
- [x] Section 2.1 Overview 中的口径已与 Section 2.5 对齐,并加交叉引用 `Section~2.5`。
|
||||||
|
- [x] 三处修改都给出"原文 / 修改后 / 中文对照"三栏。
|
||||||
|
- [x] 关键数值断言($|\Sigma|^4$ 算式、$18.8\times$、$23.5\times$)有源码事实支撑(见第 6 节对照表)。
|
||||||
Reference in New Issue
Block a user