上传文件至「/」

2026-05-14 01:54:32 +09:00
parent 52a67a85c7
commit 3aa852f35b
2 changed files with 262 additions and 0 deletions
--- a/Results-意见8-paired-end-reads压缩方式.md
+++ b/Results-意见8-paired-end-reads压缩方式.md
@@ -0,0 +1,77 @@
+# Results 意见 8 — 双端（paired-end）数据的压缩方式
+
+## 1. 审稿人原意见
+
+> How are the paired end reads compressed? Explain in the manuscript
+
+## 2. 修改思路（按你的指示：作者未对双端数据做特殊处理，直接独立压缩两个 FASTQ 文件）
+
+策略：在 Section 2.2 (Partitioning) 末段（`FastqCA.tex` line 135）之后**追加一段**短文，把双端处理写清楚——FastqCA 把双端测序的 R1 / R2 视为两个独立的 FASTQ 文件输入，分别独立压缩；由于 FastqCA **保留原始 read 顺序**（line 108 已显式声明），R1 的第 *k* 条 read 与 R2 的第 *k* 条 read 仍按位置一一对应，配对关系无需任何额外的"双端联编"结构即可由读出顺序自然保持。论文表 1 / 表 2 中 `SRR14139158_1` 和 `SRR14139158_2`、`SRR14626645_1` 和 `SRR14626645_2` 的独立报告方式仅作为作者内部核对依据，不写入新增正文。
+
+> **依据为何站得住脚**（写给作者参考，**不进入论文**）：
+> - **事实 1（`main_new.py:49` 与 `LossLess_thread.py:691` / `Lossy_thread.py` 中的 argparse）**：CLI 仅暴露一个 `--input_path` 参数，没有 `--input_1` / `--input_2` 这种双端联输入；从未出现任何 R1↔R2 联编逻辑。换言之，"独立压缩两个文件"就是当前实现的唯一路径。
+> - **事实 2（论文 line 108）**："Preserving read order enables strictly streaming operations..."——保序是 FastqCA 已经在 Intro 强调过的核心设计取舍。它本身就是支撑"双端无需特殊处理"的关键前提：只要每个文件内部保序，R1[k] ↔ R2[k] 的位置对应就自动成立。
+> - **事实 3（论文 line 101）**："reordering of reads may cause the mapping of paired-end sequencing data failed."——论文已经把"reorder 会破坏双端配对"作为反面案例引述过；现在补充正面陈述（FastqCA 不 reorder，因此双端可独立压缩）只是把已有论证闭环。
+> - **事实 4（论文 Tables 2 / 3 / 5 / 6，line 456–502 等）**：`_1` 与 `_2` 文件全程以独立行报告，从未合并为单一"paired-end CR"——表格本身就是"独立压缩两个文件"的可验证证据。
+
+> 这些事实**不需要在论文中逐条铺陈**，但论文新增段所述的"独立压缩 + 顺序保留 ⇒ 配对自动维持"在源码与已发表表格中均可被审稿人二次核对，不会自相矛盾。
+
+---
+
+## 3. 修改位置
+
+`FastqCA.tex` **line 135 后**（Section 2.2 末段"...before feeding them to the back-end compressor."之后），**新增一段**。
+
+> 注：本段建议放在 `M&M-意见4-128MB-chunk-size-依据.md` 提议的 chunk-size 新增段**之后**，使 2.2 节末尾两段依次为：(i) chunk 大小的权衡，(ii) 双端数据的处理方式。两段都是对"输入端如何被切分与喂入后端"的补充说明，主题相邻，叙述自然衔接。
+
+## 4. 原文（英文 LaTeX，上下文 line 134–135 段末）
+
+```latex
+% ---------- line 135 段末 ----------
+Since the IDs, nucleotide sequences and quality scores in the reads have different characteristics, we split the reads into chunks of 128 MB by default and then partition them into streams of IDs, nucleotide sequences and quality scores. Our idea of partitioning compression is to extract the characteristics of different types of data streams and compress them separately before feeding them to the back-end compressor.
+
+% ---------- ↑ 此段之后，且在 M&M-意见4 新增的 chunk-size 段之后 ↓ Section 2.3 之前插入新段 ----------
+
+\subsection{Identifier compression}
+...
+```
+
+## 5. 修改后（英文 LaTeX，新增段）
+
+```latex
+\hl{For synchronized paired-end FASTQ files, that is, the standard convention where the $k$-th read in R1 is the mate of the $k$-th read in R2, FastqCA does not implement a dedicated joint coding scheme between the two mate files. The R1 and R2 FASTQ files of a paired-end run are each supplied as an independent input, and compressed separately under exactly the same pipeline described above. Since FastqCA preserves the original read order within every file, the $k$-th read in the decompressed R1 file always corresponds to the $k$-th read in the decompressed R2 file. So the mate-pair correspondence is maintained by positional alignment alone, and no auxiliary pairing index is required.}
+```
+
+## 6. 中文对照（仅供作者审阅，**不写入论文**）
+
+| 部分 | 英文（修改后） | 中文译文 |
+|---|---|---|
+| 总述：未实现联编、双端两文件独立压缩 | For synchronized paired-end FASTQ files, that is, the standard convention where the $k$-th read in R1 is the mate of the $k$-th read in R2, FastqCA does not implement a dedicated joint coding scheme between the two mate files. The R1 and R2 FASTQ files of a paired-end run are each supplied as an independent input, and compressed separately under exactly the same pipeline described above. | 对于已同步的双端 FASTQ 文件（即 R1 的第 $k$ 条 read 与 R2 的第 $k$ 条配对——这是标准约定），FastqCA 没有为两个配对文件实现专门的联合编码。R1 与 R2 两个 FASTQ 文件分别作为独立输入，按照上文所述完全相同的流程独立压缩。 |
+| 配对关系如何被维持：靠保序 | Because FastqCA preserves the original read order within every file, the $k$-th read in the decompressed R1 file is guaranteed to correspond to the $k$-th read in the decompressed R2 file, so the mate-pair correspondence is maintained by positional alignment alone and no auxiliary pairing index is required. | 由于 FastqCA 在每个文件内部都保留原始 read 顺序，解压后 R1 文件中第 $k$ 条 read 与 R2 文件中第 $k$ 条 read 必然一一对应，配对关系仅靠位置对齐即可维持，不需要任何额外的配对索引。 |
+
+---
+
+## 7. 与已有正文的口径自检
+
+| 已有正文断言 | 行号 | 与新增段是否一致 |
+|---|---|---|
+| "reordering of reads may cause the mapping of paired-end sequencing data failed" | line 101 | ✅ 新增段把这一反面论证转为正面陈述——FastqCA 不 reorder，所以配对位置自动保持 |
+| "Preserving read order enables strictly streaming operations..." | line 108 | ✅ 新增段第二句明引 Section 1 此处声明作为"位置对齐即配对"的依据 |
+| Tables 2–6 中 `_1` / `_2` 文件以独立行分列 | line 456–502 等 | ✅ 作为作者内部核对依据保留，不写入论文新增段 |
+| CLI 只接受单个 `--input_path`（源码 `main_new.py:49`） | — | ✅ 新增段叙述与代码实现一致，不会与源码自相矛盾 |
+
+---
+
+## 8. 新增参考文献
+
+**无需新增**。新增段全部基于论文本身已有事实（line 101 / line 108 / Tables 2–6）与源码实现，未引入外部文献。
+
+---
+
+## 9. 自检清单
+
+- [x] 直接回应审稿人意见——把"双端怎么压缩"的事实在 M&M 里写明：两文件独立压缩，靠保序维持配对。
+- [x] 与作者口述指示一致："没有对双端数据特殊处理，直接压缩的"。
+- [x] 与论文已有 line 101 / line 108 / Tables 2–6 的口径完全自洽；与源码 `main_new.py:49` 的 CLI 设计一致。
+- [x] 不主动承诺补"R1+R2 联编"实现或额外双端实验——审稿人只要求"explain in the manuscript"，叙述层面已经回答；如下一轮 reviewer 进一步追问联编是否能再提升 CR，再讨论是否补做。
+- [x] 给出"原文上下文 / 修改后 / 中英文对照"三栏，与 `M&M-意见4-128MB-chunk-size-依据.md` 等已有修改文档保持同一格式。