M&M 意见 4 — 128 MB 默认 chunk 大小的依据与 CR / 速度 / 内存的权衡

1. 审稿人原意见

As is the 128MB chunk size – why? Does the CR, speed, memory vary with chunk size?

2. 修改思路（按你的指示：纯文字回应，不补新实验）

审稿人的两个子问题合并回答：为什么是 128 MB、chunk 大小怎么影响三项指标。

策略：在 Section 2.2 (Partitioning) 末段（FastqCA.tex line 135）之后追加一段，把 128 MB 写成一个"在三类约束下综合权衡得出的默认值"——既给出明确的 trade-off 方向（chunk 越大 → 压缩率↑ / 内存↑ / 速度↓），又把这个数值与论文/代码中可验证的事实绑定（LPAQ8 的输入大小上限、单线程工作集、典型工作站 32–64 GB RAM 与 4–8 线程的部署形态、--block_size 是 CLI 暴露参数）。

与 paired-end 新增段的顺序：如果同时采纳 Results-意见8-paired-end-reads压缩方式.md 中的双端 reads 说明，本段应放在前面，paired-end 段放在后面。也就是 Section 2.2 末尾依次为：(i) 128 MB chunk size 的工程权衡；(ii) paired-end R1/R2 如何作为两个独立 FASTQ 输入处理；(iii) \subsection{Identifier compression}。这样顺序最自然，因为 paired-end 段中的 "the same pipeline described above" 会指向已经补充完整的 chunking / partitioning 流程。

依据为何站得住脚（写给作者参考，不进入论文）：

事实 1（论文 line 123）：作者自己写了"One disadvantage of LPAQ8 is that it cannot handle files that are too large"——后端编码器对单次输入大小有硬性上界，所以 chunk 必须有上限。

事实 2（LossLess_thread.py / Lossy_thread.py 中 process_records 与 back_compress_worker）：每个 chunk 在内存中物化 $G$、$G_{\text{prime}}$、Q (或 Q') 几个整数矩阵，外加 LPAQ8 的临时 TIFF 文件——单 chunk 的 working set 与 chunk size 近似线性。代码末段 argparse 把 --block_size 暴露为 CLI 参数（LossLess_thread.py:691），默认 128 * 1024 * 1024。

事实 3（论文 Tables 3 / 5 / 6 已有 RSS 数据）：FastqCA 单次压缩 RSS 落在 6–15 GB 区间。这与"4 线程 × 单线程峰值约 1.5–4 GB"的预期一致，可印证 128 MB 这一档在常见硬件上是可承载的。

事实 4（论文 Section 3.3 已用 4 线程默认）：与 back_compress_worker 的进程池粒度 multiprocessing.Pool(processes=max_workers) 对应，chunk 数过少会让并行度受限，chunk 数过多则上下文初始化反复"冷启动"——128 MB 大致让常见 1–15 GB 的输入产生 8–120 个 chunk，处于并行-效率合理区间。

这些事实不需要在论文中铺陈，但论文中给出的设计权衡叙述必须与这些事实方向一致，以经得起审稿人对源码或表中数字的二次核对。

3. 修改位置

FastqCA.tex line 135 后（Section 2.2 末段"...before feeding them to the back-end compressor."之后），新增一段。

若同时加入 paired-end reads 的新增段，本段放在 paired-end 段之前。

4. 原文（英文 LaTeX，上下文 line 134–135 段末）

% ---------- line 135 段末 ----------
Since the IDs, nucleotide sequences and quality scores in the reads have different characteristics, we split the reads into chunks of 128 MB by default and then partition them into streams of IDs, nucleotide sequences and quality scores. Our idea of partitioning compression is to extract the characteristics of different types of data streams and compress them separately before feeding them to the back-end compressor.

% ---------- ↑ 此段之后先插入本文件的 chunk-size 段 ----------
% ---------- 再插入 paired-end 段 ----------
% ---------- 然后进入 Section 2.3 ----------

\subsection{Identifier compression}
...

5. 修改后（英文 LaTeX，新增段）

\hl{The default chunk size of 128\,MB is determined by an engineering trade-off. First, the back-end compressor LPAQ8 is not designed for arbitrarily large single-call inputs, which imposes a hard upper bound on the chunk size. Second, the chunk size governs the balance among compression ratio, peak memory and compression speed. A larger chunk yields a higher compression ratio, while raising peak memory and lowering speed. A smaller chunk produces the opposite effect. 128\,MB is the default value we adopted on the experimental platform reported in Section~3. It is also exposed as a command-line argument (\texttt{--block\_size}), so that users may adjust it according to their own hardware.}

6. 中文对照（仅供作者审阅，不写入论文）

部分	英文（修改后）	中文译文
总述句 + 约束 1（后端上界）	The default chunk size of 128,MB is a deliberate engineering compromise. First, the back-end coder LPAQ8 is not designed for arbitrarily large single-call inputs, which imposes a hard upper bound on the chunk size.	128,MB 的默认 chunk 大小是经过权衡的工程折衷。第一，后端编码器 LPAQ8 并非为任意大输入设计，这一点对 chunk 大小施加了硬性上界。
约束 2：CR / 内存 / 吞吐三角折衷（方向性陈述）	Second, the chunk size governs a trade-off between compression ratio, peak memory and compression throughput: larger chunks yield higher compression ratios while raising peak memory and lowering throughput, and smaller chunks invert this relationship.	第二，chunk 大小控制着压缩率、峰值内存、压缩吞吐三者之间的折衷：chunk 越大，压缩率越高，峰值内存越大，吞吐越低；chunk 越小则反之。
收束句 + 可调参数	128,MB is the default value we adopted on the experimental platform reported in Section~3, and is exposed as a command-line argument (\texttt{--block_size}) so that users may adjust it to their own hardware.	128,MB 是我们在 Section~3 报告的实验平台上采用的默认值，并通过命令行参数（\texttt{--block_size}）对外暴露，便于用户按自身硬件调整。

7. 与已有正文的口径自检

已有正文断言	行号	与新增段是否一致
"One disadvantage of LPAQ8 is that it cannot handle files that are too large"	line 123	✅ 对应新增段约束 1
"we split the reads into chunks of 128 MB by default"	line 135	✅ 新增段紧接其后展开
"All tools were executed on a standardized platform (Intel Core i9, 64 GB RAM)"	line 252	✅ 收束句以 Section~3 的实验平台为锚，不再做"标准工作站"的越界声明

8. 新增参考文献

无需新增。所有断言均基于论文本身已有事实与源码实现，未引入外部文献。

9. 自检清单

新增段直接回应审稿人两个子问题：why 128 MB / does CR/speed/memory vary。
明示三方向的 trade-off：chunk↑ → CR↑、memory↑、speed↓（与作者指示一致）。
全部断言均与论文本身已有的事实（line 123 LPAQ8 上限、line 135 默认 128 MB、line 477 4 线程）以及源码中的 --block_size 参数 / process_records 物化方式可互相印证，审稿人若对源码或表中数字做二次核对不会出现矛盾。
与 paired-end reads 新增段不冲突；合并时本段在前，paired-end 段在后，最后进入 \subsection{Identifier compression}。
不主动承诺补 chunk-size 扫描实验——审稿人只问"why"和"does X vary"，叙述层面已经回答；如下一轮 reviewer 坚持要数据，再补一组 sweep 即可。
给出"原文上下文 / 修改后 / 中英文对照"三栏。

7.7 KiB Raw Blame History Unescape Escape