From 52a67a85c7a7bb3f9c44d7c811a3a271e7cddb5a Mon Sep 17 00:00:00 2001 From: XXhaos Date: Thu, 14 May 2026 01:54:26 +0900 Subject: [PATCH] =?UTF-8?q?=E4=B8=8A=E4=BC=A0=E6=96=87=E4=BB=B6=E8=87=B3?= =?UTF-8?q?=E3=80=8C/=E3=80=8D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-意见1-AND-Results-意见2-SPRING基准对比.md | 141 ++++++++++++++++++ Results-意见3-成本效益分析.md | 108 ++++++++++++++ 2 files changed, 249 insertions(+) create mode 100644 Intro-意见1-AND-Results-意见2-SPRING基准对比.md create mode 100644 Results-意见3-成本效益分析.md diff --git a/Intro-意见1-AND-Results-意见2-SPRING基准对比.md b/Intro-意见1-AND-Results-意见2-SPRING基准对比.md new file mode 100644 index 0000000..c1a486e --- /dev/null +++ b/Intro-意见1-AND-Results-意见2-SPRING基准对比.md @@ -0,0 +1,141 @@ +# Intro 意见 1 + Results 意见 2(合并)— 加入 SPRING 作为现代基准的对比 + +## 1. 审稿人原意见(合并的两条) + +> **Intro 意见 1**:*I do not understand the rationale to compare a current tool with most of the tools from 2012-2014 period (NAF being the exception), as the reference-free compression tools have been improved since then – SPRING, FaStore, FQSqueezer to name a few. SPRING is mentioned but not benchmarked. ... Comparison against SPRING is bare minimum, ideally must also include FaStore and FQSqueezer to assess where FastqCA stands in the current landscape.* +> +> **Results 意见 2**:*Tables 2-6 – excludes all newer tools and claims superior compression ratios compared to the standard state of the art tools – which cannot be verified without benchmarking against these tools. Direct comparison with SPRING may be a bare minimum, directly competes with FastqCA's design constraints.* + +## 2. 修改思路 + +两条意见本质相同——基准集中缺少现代工具,SPRING 必须加入。作者已经补做了 SPRING 在十个基准数据集上的实验,结果列入 `FastqCA.tex` 的 Table~\ref{tab:Ratio}(line 427–468,已是更新后的表)。新增 SPRING 列后的事实如下: + +**有损模式(SPRING vs FastqCA)**:FastqCA 在 10 个数据集中的 9 个上优于 SPRING,仅在 DRR057887\_1 上 SPRING(16.09)高于 FastqCA(14.52)。 + +**无损模式(SPRING vs FastqCA)**:FastqCA 在 10 个数据集中的 6 个上优于 SPRING,SPRING 在 4 个数据集(SRR14139158\_1、SRR14139158\_2、SRR14626645\_1、SRR14626645\_2,即 Illumina HiSeq 4000 平台、覆盖度较高的转基因监测与环境监测数据)上高于 FastqCA。 + +落实策略——**正面承认**而非回避: +- **(A)** 重写 Section 3.3 引言段(line 418)的工具选择叙述,把 SPRING 写进基准集,并在末尾加一句"FaStore 与 FQSqueezer 因 X 原因未纳入,留作 future work"——按你的口径自由调整。 +- **(B)** 重写 Section 3.3 Compression Ratio 段(line 421–425)的三段,按事实分别给出有损与无损模式下与 SPRING 的对比,**承认无损模式在 4 个数据集上不如 SPRING**,并对该模式给出基于设计差异的解释(FastqCA 保序、SPRING 重排序),避免被审稿人质疑 over-claim。 + +> **事实精度核对结果(已对照 `FastqCA.tex` line 454–463 表中的 SPRING 数据)**: +> - 有损模式:FastqCA 在 9/10 上胜出,仅 DRR057887\_1 落后于 SPRING。 +> - 无损模式:FastqCA 在 6/10 上胜出;SPRING 在 SRR14139158\_1/\_2、SRR14626645\_1/\_2 这 4 个数据集上胜出。 +> - SPRING 这 4 个胜出的数据集均为两个 Illumina HiSeq 4000 paired-end accession 的 mate 文件,且属于本基准集中较大的 paired-end 文件。SPRING 文献可支撑的是:其 short-read 路径会通过哈希表查找 read prefix/suffix 匹配来进行内部 read 重排序。因此正文仅保留这些可核实事实,不写成"高覆盖度 / SPRING 最有效区间"这样的强因果判断。 +> - SPRING 在这 4 个数据集上的有损与无损 CR 数值相同(如 SRR14139158\_1:lossy 13.71,lossless 13.71),只能说明本次 SPRING 有损配置在这些数据集上未带来额外的 $CR$ 提升;不能据此单独断定其相对 FastqCA 的无损优势"主要增益来自 read 重排序"。 +> - FastqCA 的"二维空间预测 + Q4 量化"使其在有损模式下仍优于 SPRING,差距最显著的是高重复 / 高 GC 数据集 SRR554369(FastqCA 15.19 vs SPRING 6.94)。 + +--- + +## 3. 修改点 A:Section 3.3 工具选择叙述(line 418) + +### 3.1 修改位置 + +`FastqCA.tex` **line 418**,Section 3.3 Compression experiments 的引言段。 + +### 3.2 原文(英文 LaTeX) + +```latex +To provide a comprehensive evaluation of current FASTQ compression capabilities, we selected a diverse set of compressors representing different algorithmic strategies and evolutionary stages in genomic data handling. We utilized gzip as the general-purpose baseline standard. This was compared against a spectrum of specialized genomic compressors, ranging from established high-performance tools such as FQZcomp, DSRC2, fastqz, Scalce, and Quip, to recent state-of-the-art algorithms like NAF. Our primary goal for lossless compression was to pursue the optimal compression ratio. For lossy compression, we aimed to further maximize this ratio by strategically reducing the precision of quality scores, thereby alleviating the storage burden while minimizing the impact on downstream analyses. Detailed compression parameters for all tested tools are listed in Supplementary File 1. +``` + +### 3.3 修改后(英文 LaTeX) + +```latex +To provide a comprehensive evaluation of current FASTQ compression capabilities, we selected a diverse set of compressors representing different algorithmic strategies and evolutionary stages in genomic data handling. We utilized gzip as the general-purpose baseline standard. This was compared against a spectrum of specialized genomic compressors, ranging from established high-performance tools such as FQZcomp, DSRC2, fastqz, Scalce, and Quip, to reference-free FASTQ compressors such as NAF\hl{ and SPRING~\cite{chandak2019spring}}. Our primary goal for lossless compression was to pursue the optimal compression ratio. For lossy compression, we aimed to further maximize this ratio by strategically reducing the precision of quality scores, thereby alleviating the storage burden while minimizing the impact on downstream analyses. Detailed compression parameters for all tested tools are listed in Supplementary File 1. +``` + +### 3.4 中英文对照(仅供作者审阅,**不写入论文**) + +| 原文(英文) | 修改后(英文) | 中文译文 | +|---|---|---| +| ...to recent state-of-the-art algorithms like NAF. | ...to reference-free FASTQ compressors such as NAF and SPRING~\cite{chandak2019spring}. | ……以及 NAF 和 SPRING~\cite{chandak2019spring} 等无参考 FASTQ 压缩器。 | + +--- + +## 4. 修改点 B:Section 3.3 Compression Ratio 三段重写(line 421–425) + +### 4.1 修改位置 + +`FastqCA.tex` **line 421–425**,Compression Ratio 小节正文三段。 + +### 4.2 原文(英文 LaTeX) + +```latex +Table~\ref{tab:Ratio} presents the compression ratio ($CR$) results for ten diverse FASTQ datasets, with the optimal values for each dataset highlighted in bold. In the lossy compression mode, FastqCA consistently outperforms all existing genomic compressors across the entire datasets. The $CR$ achieved by FastqCA ranges from 13.46$\times$ on SRR2093872\_1 to 18.48$\times$ on SRR14139158\_1. Notably, on the SRR554369 dataset, FastqCA delivers a $CR$ of 15.19$\times$, representing an improvement of approximately 88.7\% over FQZcomp (8.05$\times$) and more than doubling the CR of Scalce (7.41$\times$). For datasets with higher sequence redundancy, such as SRR14139158\_1 and SRR14626645\_1, FastqCA yields substantial gains with ratios of 18.48$\times$ and 17.26$\times$ respectively, significantly surpassing specialized tools like fastqz and Scalce. + +In the lossless compression mode, FastqCA maintains its competitive edge, achieving the highest $CR$ in all cases, with values ranging from 5.43$\times$ to 11.11$\times$. It systematically exceeds general-purpose tools like gzip and industry-standard specialized compressors including DSRC2, Quip, and NAF. Specifically, FastqCA achieves a $CR$ of 11.11$\times$ on SRR14139158\_1, outperforming Quip (10.42$\times$) and nearly doubling the CR of NAF (5.95$\times$). Even on the SRR554369 dataset, which is recognized by the ISO/IEC 23092 (MPEG-G) committee for compression evaluation, FastqCA still attains a $CR$ of 5.43$\times$, surpassing both Quip (4.63$\times$) and DSRC2 (5.42$\times$). + +Overall, the experimental results demonstrate that FastqCA provides superior and stable compression performance across both lossy and lossless modes. By effectively capturing two-dimensional spatial redundancy through its cellular automaton-based predictive modeling, FastqCA offers a more robust solution for large-scale genomic data archival compared to current state-of-the-art methods. +``` + +### 4.3 修改后(英文 LaTeX,三段整体替换) + +```latex +\hl{Table~\ref{tab:Ratio} presents the compression ratio ($CR$) results for ten diverse FASTQ datasets, with the optimal values for each dataset highlighted in bold. In the lossy compression mode, FastqCA achieves the highest $CR$ on nine of the ten datasets, with values ranging from 13.46$\times$ on SRR2093872\_1 to 18.48$\times$ on SRR14139158\_1. The only exception is DRR057887\_1, on which SPRING attains 16.09$\times$ versus the 14.52$\times$ of FastqCA. On the SRR554369 dataset, FastqCA delivers a $CR$ of 15.19$\times$, which is more than twice the $CR$ of SPRING (6.94$\times$) and Scalce (7.41$\times$), and represents an improvement of about 88.7\% over FQZcomp (8.05$\times$). On SRR14139158\_1 and SRR14626645\_1, FastqCA reaches 18.48$\times$ and 17.26$\times$ respectively, surpassing specialized tools like fastqz, Scalce and SPRING.} + +\hl{In the lossless compression mode, FastqCA achieves the highest $CR$ on six of the ten datasets, with values ranging from 5.43$\times$ to 11.11$\times$. It systematically exceeds general-purpose tools like gzip and industry-standard specialized compressors including DSRC2, Quip and NAF. On SRR554369, the $CR$ of FastqCA reaches 5.43$\times$, surpassing SPRING (4.84$\times$), Quip (4.63$\times$) and DSRC2 (5.42$\times$). On the remaining four datasets (SRR14139158\_1, SRR14139158\_2, SRR14626645\_1 and SRR14626645\_2), the $CR$ of SPRING is higher than that of FastqCA, e.g., 13.71$\times$ versus 11.11$\times$ on SRR14139158\_1. These four files are the mate files from two Illumina HiSeq 4000 paired-end accessions and belong to the larger paired-end datasets in our benchmark. With respect to read-order handling, SPRING's short-read compressor uses hash-table searches for prefix/suffix matches to reorder reads internally and stores order information for reconstruction, whereas FastqCA keeps reads in input order throughout and does not include a global read-overlap/reordering stage.} + +\hl{Across the ten datasets, FastqCA records the highest lossy $CR$ in nine cases, whereas SPRING records the highest lossy $CR$ on DRR057887\_1. In the lossless setting, FastqCA records the highest $CR$ in six cases, while SPRING records the highest $CR$ on SRR14139158\_1, SRR14139158\_2, SRR14626645\_1 and SRR14626645\_2. The corresponding values are listed in Table~\ref{tab:Ratio}.} +``` + +### 4.4 中英文对照(句对句,仅供作者审阅,**不写入论文**) + +| # | 段 | 修改后(英文) | 中文译文 | +|---|---|---|---| +| 1 | §1:有损模式总结 | Table~\ref{tab:Ratio} presents the compression ratio ($CR$) results for ten diverse FASTQ datasets, with the optimal values for each dataset highlighted in bold. In the lossy compression mode, FastqCA achieves the highest $CR$ on nine of the ten datasets, with values ranging from 13.46$\times$ on SRR2093872\_1 to 18.48$\times$ on SRR14139158\_1. The only exception is DRR057887\_1, on which SPRING attains 16.09$\times$ versus the 14.52$\times$ of FastqCA. | Table~\ref{tab:Ratio} 给出十个 FASTQ 数据集上的压缩率($CR$)结果,各数据集上的最优值以粗体标出。在有损模式下,FastqCA 在十个数据集中的九个上取得最高的 $CR$,数值从 SRR2093872\_1 上的 13.46$\times$ 到 SRR14139158\_1 上的 18.48$\times$。唯一例外是 DRR057887\_1,SPRING 在其上达到 16.09$\times$,高于 FastqCA 的 14.52$\times$。 | +| 2 | §1:有损 SRR554369 + 高 CR 子集亮点 | On the SRR554369 dataset, FastqCA delivers a $CR$ of 15.19$\times$, which is more than twice the $CR$ of SPRING (6.94$\times$) and Scalce (7.41$\times$), and represents an improvement of about 88.7\% over FQZcomp (8.05$\times$). On SRR14139158\_1 and SRR14626645\_1, FastqCA reaches 18.48$\times$ and 17.26$\times$ respectively, surpassing specialized tools like fastqz, Scalce and SPRING. | 在 SRR554369 数据集上,FastqCA 给出 15.19$\times$ 的 $CR$,是 SPRING(6.94$\times$)与 Scalce(7.41$\times$)的两倍以上,比 FQZcomp(8.05$\times$)提升约 88.7\%。在 SRR14139158\_1 与 SRR14626645\_1 上,FastqCA 分别达到 18.48$\times$ 与 17.26$\times$,优于 fastqz、Scalce 与 SPRING 等专门工具。 | +| 3 | §2:无损模式总结,**承认 4 个数据集落后于 SPRING** | In the lossless compression mode, FastqCA achieves the highest $CR$ on six of the ten datasets, with values ranging from 5.43$\times$ to 11.11$\times$. It systematically exceeds general-purpose tools like gzip and industry-standard specialized compressors including DSRC2, Quip and NAF. On SRR554369, the $CR$ of FastqCA reaches 5.43$\times$, surpassing SPRING (4.84$\times$), Quip (4.63$\times$) and DSRC2 (5.42$\times$). On the remaining four datasets (SRR14139158\_1, SRR14139158\_2, SRR14626645\_1 and SRR14626645\_2), the $CR$ of SPRING is higher than that of FastqCA, e.g., 13.71$\times$ versus 11.11$\times$ on SRR14139158\_1. These four files are the mate files from two Illumina HiSeq 4000 paired-end accessions and belong to the larger paired-end datasets in our benchmark. With respect to read-order handling, SPRING's short-read compressor uses hash-table searches for prefix/suffix matches to reorder reads internally and stores order information for reconstruction, whereas FastqCA keeps reads in input order throughout and does not include a global read-overlap/reordering stage. | 在无损模式下,FastqCA 在十个数据集中的六个上取得最高 $CR$,数值在 5.43$\times$ 到 11.11$\times$ 之间,系统性地优于通用工具 gzip 与工业标准的专门压缩器 DSRC2、Quip 和 NAF。在 SRR554369 上,FastqCA 的 $CR$ 达到 5.43$\times$,高于 SPRING(4.84$\times$)、Quip(4.63$\times$)与 DSRC2(5.42$\times$)。在另外四个数据集(SRR14139158\_1、SRR14139158\_2、SRR14626645\_1、SRR14626645\_2)上,SPRING 的 $CR$ 高于 FastqCA——例如 SRR14139158\_1 上为 13.71$\times$ vs 11.11$\times$。这四个文件来自两个 Illumina HiSeq 4000 双端测序 accession 的 mate 文件,并且属于本基准集中较大的双端数据文件。在 read 顺序处理方面,SPRING 的 short-read 压缩器通过哈希表查找 read prefix/suffix 匹配来进行内部 read 重排序,并保存重建所需的顺序信息;而 FastqCA 全程保留输入 read 顺序,且不包含全局 read overlap / 重排序步骤。 | +| 4 | §3:总结段,事实性归纳 | Across the ten datasets, FastqCA records the highest lossy $CR$ in nine cases, whereas SPRING records the highest lossy $CR$ on DRR057887\_1. In the lossless setting, FastqCA records the highest $CR$ in six cases, while SPRING records the highest $CR$ on SRR14139158\_1, SRR14139158\_2, SRR14626645\_1 and SRR14626645\_2. The corresponding values are listed in Table~\ref{tab:Ratio}. | 在十个数据集上,FastqCA 在九个数据集上取得最高的有损 $CR$,而 SPRING 在 DRR057887\_1 上取得最高的有损 $CR$。在无损设置下,FastqCA 在六个数据集上取得最高 $CR$,SPRING 则在 SRR14139158\_1、SRR14139158\_2、SRR14626645\_1 和 SRR14626645\_2 上取得最高 $CR$。相应数值列于 Table~\ref{tab:Ratio}。 | + +--- + +## 5. 关于 FaStore 与 FQSqueezer 的处理 + +审稿人原意见提到"ideally must also include FaStore and FQSqueezer"。本轮只补做了 SPRING——这与审稿人的"bare minimum"措辞一致:SPRING 是 minimum 必须做的;FaStore 与 FQSqueezer 是 ideally 应该做的,按照学术惯例第一轮 revision 满足 minimum 即可。 + +> 在 response letter 中可以这样表态(备用,仅供作者参考,**不写入论文**): +> +> *We thank the reviewer for the suggestion. We have benchmarked FastqCA against SPRING on all ten datasets and incorporated the results into Table~\ref{tab:Ratio}. SPRING is the most directly competing modern reference-free compressor, as the reviewer notes. FaStore and FQSqueezer rely on substantially different design constraints (FaStore on similarity-based clustering with optional reordering, FQSqueezer on PPM/DMC-style $k$-mer prediction without read order preservation), and we have therefore acknowledged their place in the literature in the rewritten Introduction (see our response to Intro Q2) rather than including them in the head-to-head benchmark. We will report extended benchmarks against these two tools in a follow-up study.* + +--- + +## 6. 与已有正文 / 表格的口径自检 + +| 已有正文 / 表格断言 | 行号 | 与新增段是否一致 | +|---|---|---| +| Table~\ref{tab:Ratio} 已添加 SPRING 列(lossy + lossless) | line 445, 451, 454–463 | ✅ 新增段所有数字均与表中数值逐项一致 | +| Introduction-意见 2 新增段已承认"跨 reads 利用冗余非新意",并引用 SPRING | `Introduction-意见2-致谢先前工作.md` §3.3 | ✅ 与本文档中关于 SPRING read reordering 与 FastqCA order-preserving 的事实性描述一致 | +| Introduction-意见 2 中对 SPRING 的描述:bidirectional prefix/suffix hash-based reordering | `Introduction-意见2-致谢先前工作.md` §3.3 | ✅ 本文档只保留 SPRING prefix/suffix hash-based reordering 这一机制事实,不再用它解释四个数据集上的 CR 差异 | +| 论文 line 108 自述"preserving the original read order" | line 108 | ✅ 本文档无损段中"FastqCA keeps reads in input order throughout"与 line 108 闭环 | +| 论文 line 101 已讨论 read reordering 对 paired-end 场景的潜在影响 | line 101 | ✅ 本文档无损段不再泛化为"SPRING 不适合双端比对",也不再把保序优势作为 SPRING 四项胜出的反向解释 | + +--- + +## 7. 是否需要修改速度 / 内存表? + +**不需要**。审稿人的 Result Q2 明确写"Tables 2-6 ... excludes all newer tools",但 Result Q3(速度)和 Result Q4(内存)已分别由 `Results-意见3-成本效益分析.md` 与 `Results-意见4-per-thread内存与可扩展性.md` 处理,且作者未补做 SPRING 在速度 / 内存表上的实验。 + +如果作者后续也补做了 SPRING 的速度 / 内存数据,本文档可扩展。当前版本只覆盖压缩率表(Table~\ref{tab:Ratio}),对应回应 Result Q2 中"superior compression ratios ... cannot be verified without benchmarking"的核心质询点。 + +> 在 response letter 中可以这样表态(备用,仅供作者参考,**不写入论文**): +> +> *Following the reviewer's suggestion, we have added SPRING to the compression ratio benchmark (Table~\ref{tab:Ratio}) since this is the metric on which the original concern of "superior compression ratios ... cannot be verified" was raised. We did not extend the benchmark to the speed and memory tables, as the trade-offs in those dimensions are already discussed in Section~3.3 with the rationale (compression-time cost vs.\ storage savings, per-thread memory scaling) explained in concrete terms.* + +--- + +## 8. 新增参考文献 + +**无需新增**。`chandak2019spring` 已存在于 `ref.bib` 并已被多处引用。 + +--- + +## 9. 自检清单 + +- [x] 直接回应审稿人 Intro Q1 与 Result Q2 的核心质询:基准集中加入 SPRING("bare minimum")。 +- [x] 三段重写都基于 `FastqCA.tex` 表中已有的 SPRING 数据,所有数字逐项核对(DRR057887\_1 16.09 vs 14.52、SRR14139158\_1 13.71 vs 11.11 等)。 +- [x] **正面承认无损模式在 4 个数据集上落后于 SPRING**——避免被审稿人质疑 over-claim;只陈述 SPRING 与 FastqCA 在 read 顺序处理上的已知设计差异,不再引入 coverage / overlap / reordering ablation 之类的机制归因讨论。 +- [x] 与 `Introduction-意见2-致谢先前工作.md` 中"SPRING bidirectional prefix/suffix"措辞、line 108 的"preserving the original read order"、line 101 的"reordering of reads may cause the mapping of paired-end sequencing data failed"全部口径一致。 +- [x] 给出"原文 / 修改后 / 中英文对照 / 口径自检"四栏。 +- [x] 处理 FaStore 与 FQSqueezer 的边界——本轮不做、留为 future work,并备好 response letter 措辞。 +- [x] 已删除第 4.3 节中"more suitable for streaming workflows and paired-end alignment"以及后续"keeps the streamability and positional mate correspondence..."这类容易被读成机制解释的句子;正文只保留 read-order 处理方式的事实差异。 diff --git a/Results-意见3-成本效益分析.md b/Results-意见3-成本效益分析.md new file mode 100644 index 0000000..b50e8a0 --- /dev/null +++ b/Results-意见3-成本效益分析.md @@ -0,0 +1,108 @@ +# Results 意见 3 — 速度劣势的成本效益分析(vs 云存储价格) + +## 1. 审稿人原意见 + +> Speed performance is severely uncompetitive and justification seem insufficient – for a 15GB file, it requires 1.7 hours for compression – time cost is not trivial for 1000s of samples in production genomic context. Authors should provide a concrete cost-benefit analysis comparing compression time costs vs storage cost savings at realistic cloud storage prices. + +## 2. 修改思路 + +审稿人的关键词是 "**concrete cost-benefit analysis**" 和 "**realistic cloud storage prices**"——意思是当前 Section 3.3 中那段定性叙述("compression is a one-time investment while storage is recurring...")不够,必须给出**带具体数字的算式**:(a) 时间成本折合多少钱、(b) 存储节省按现行云价折合多少钱、(c) 多久回本。 + +### 策略: + +- **不补任何新实验**,直接基于论文已有的 Table 2(CR)、Table 4(压缩速度)以及一组**公开的 AWS 标价**做四则运算。 +- 选**审稿人自己点名的那个 15 GB 文件 SRR1210085\_1** 作为示例,可信度最高、审稿人无法挑剔基准选取。 +- 写法:保留 line 476–477 已有的定性叙述,在其后**新增一个 `\paragraph{Cost--benefit illustration.}` 段**,给出一个具体到 ¢ 的算例,然后用一句话泛化到"1000-样本队列"以呼应审稿人 "1000s of samples in production" 的关切。 +- 价格基准选 **AWS S3 Standard 区**(最常被引用、最容易复核);同时点一下 internet egress 单价,让"传输节省"也露面(这与论文 line 108 自述"bandwidth-limited transmission pipelines"的定位互相印证)。 + +### 数字一致性核对(写给作者参考,**不进入论文**): + +| 量 | 数值 | 来源 | +|---|---|---| +| SRR1210085\_1 原始大小 | 15.1\,GB(精确 15,126,475,212 bytes) | `FastqCA.tex` Table 1 line 275 | +| FastqCA 有损 CR | 14.32× | Table 2 line 455 | +| FastqCA 有损压缩速度 | 2.52\,MB/s | Table 4 line 496 | +| 推算压缩耗时 | $15{,}100/2.52 \approx 5{,}992$ s $\approx 1.66$\,h | 与审稿人 "1.7 hours" 完全吻合 | +| 压缩后大小 | $15.1/14.32 \approx 1.05$\,GB | 由 CR 推得 | +| 节省 | $\approx 14.05$\,GB/file | $15.1-1.05$ | +| AWS S3 Standard | $\$0.023$/GB-month(US East, public list price as of the time of this revision) | AWS S3 pricing page | +| AWS S3 internet egress | $\$0.09$/GB | AWS pricing page | +| 推荐挂靠的计算实例 | `r6i.2xlarge`(8 vCPU, **64 GiB RAM**, Intel Ice Lake, 与 `FastqCA.tex` line 252 报告的 Intel i9 + 64 GB RAM 实验平台**内存对齐**)按需价 $\$0.504$/h | AWS EC2 pricing page | +| **单文件计算成本** | $1.7 \times 0.504 \approx \$0.86$ | $T_{\text{comp}} \times p_{\text{compute}}$ | +| **单文件月度存储节省** | $(15.1-1.05) \times 0.023 \approx \$0.32$/file/month(精确 \$0.323) | 节省 × 单价 | +| **回本时间** | $0.86 / 0.32 \approx 2.7$ months("约 3 个月内") | 计算成本 / 月度节省 | +| **1000 样本一次性计算** | $1000 \times 0.86 = \$860$ | — | +| **1000 样本年度存储节省** | $1000 \times 0.32 \times 12 \approx \$3{,}880$(按 \$0.323 精确 \$3,876) | — | +| **1000 样本 5 年净节省** | $5 \times 3{,}880 - 860 = \$18{,}540 \approx \$18{,}500$ | $S_{\text{5yr}}-C$ | + +> **本组数字内部已自洽**:精确链路 $5\times(15.1-1.05)\times 0.023 \times 12 \times 1000 - 1000\times 1.7\times 0.504 = 19{,}377 - 857 = \$18{,}520 \approx \$18{,}500$;近似链路 $5 \times \$3{,}880 - \$860 = \$18{,}540 \approx \$18{,}500$;两条路径都收敛到 $\sim\$18{,}500$,审稿人无论用哪一档精度复核都能落到同一区间。 + +--- + +## 3. 修改位置 + +`FastqCA.tex` **line 477 段之后**(Section 3.3 Speed/Throughput 块尾 "...far outweigh the one-off time expenditure caused by lower processing speeds."),**新增一个 `\paragraph{Cost--benefit illustration.}` 段**。已有 line 476–477 的定性叙述保留不动。 + +## 4. 原文(英文 LaTeX,上下文 line 476–477) + +```latex +% ---------- line 476–477 段末 ---------- +To balance this computational load, parallel processing with four threads is applied by default. Crucially, in the context of large-scale genomic archival, this trade-off offers significant economic advantages. Since compression is typically a one-time computational investment while storage represents a recurring cost, the substantial space savings yielded by FastqCA (often 20--50\% superior to faster alternatives) translate into long-term reductions in infrastructure and maintenance costs that far outweigh the one-off time expenditure caused by lower processing speeds. + +% ---------- ↑ 此段之后插入下面的新段 ---------- +``` + +## 5. 修改后(英文 LaTeX,新增段) + +```latex +\paragraph{Cost--benefit illustration.} +\hl{To quantify the above trade-off in monetary terms, we use the largest dataset in our benchmark, SRR1210085\_1, as an example. This file is 15.1 GB before compression (Table~\ref{tab:Samples}). With the lossy compression ratio of 14.32 reported in Table~\ref{tab:Ratio}, FastqCA reduces the file to about 1.05 GB, saving about 14 GB of storage per file. The compression time is about 1.7 h on the experimental platform. Using public AWS US East prices as an illustrative reference, S3 Standard storage costs 0.023 USD per GB-month, internet egress costs 0.09 USD per GB, and an on-demand r6i.2xlarge instance with 8 vCPUs and 64 GiB RAM costs 0.504 USD per hour. Under these prices, the one-time compute cost is about 0.86 USD per file, whereas the storage saving is about 0.32 USD per file per month. The reduced file size also saves about 1.26 USD for each internet egress event. Thus, the compute cost is recovered after about three months of storage alone. For 1,000 similar samples, the one-time compute cost is about 860 USD, the annual storage saving is about 3,880 USD, and the five-year net storage saving is about 18,500 USD before any bandwidth saving is counted. These values are intended as an illustrative cost calculation based on public cloud prices.} +``` + +## 6. 中文对照(仅供作者审阅,**不写入论文**) + +| 部分 | 英文(修改后) | 中文译文 | +|---|---|---| +| 引出例子 + 压缩后大小 | To quantify the above trade-off in monetary terms, we use the largest dataset in our benchmark, SRR1210085\_1, as an example. This file is 15.1 GB before compression (Table~\ref{tab:Samples}). With the lossy compression ratio of 14.32 reported in Table~\ref{tab:Ratio}, FastqCA reduces the file to about 1.05 GB, saving about 14 GB of storage per file. | 为把这一折衷量化到具体金额,使用基准中最大的数据集 SRR1210085\_1 作为例子。该文件压缩前为 15.1 GB(Table~\ref{tab:Samples})。根据 Table~\ref{tab:Ratio} 中报告的 14.32 有损压缩率,FastqCA 将其压缩到约 1.05 GB,每个文件节省约 14 GB 存储。 | +| 价格基准 + 单文件成本 | The compression time is about 1.7 h on the experimental platform. Using public AWS US East prices as an illustrative reference, S3 Standard storage costs 0.023 USD per GB-month, internet egress costs 0.09 USD per GB, and an on-demand r6i.2xlarge instance with 8 vCPUs and 64 GiB RAM costs 0.504 USD per hour. Under these prices, the one-time compute cost is about 0.86 USD per file, whereas the storage saving is about 0.32 USD per file per month. | 该文件在实验平台上的压缩时间约为 1.7 h。以 AWS 美东区公开价格作为示例参考,S3 Standard 存储价格为 0.023 USD/GB-month,互联网出向流量为 0.09 USD/GB,具有 8 vCPU 和 64 GiB RAM 的按需 r6i.2xlarge 实例价格为 0.504 USD/h。在这些价格下,单文件一次性计算成本约为 0.86 USD,而月度存储节省约为 0.32 USD/文件。 | +| egress 节省 + 1000 样本规模 | The reduced file size also saves about 1.26 USD for each internet egress event. Thus, the compute cost is recovered after about three months of storage alone. For 1,000 similar samples, the one-time compute cost is about 860 USD, the annual storage saving is about 3,880 USD, and the five-year net storage saving is about 18,500 USD before any bandwidth saving is counted. These values are intended as an illustrative cost calculation based on public cloud prices. | 文件变小后,每次互联网出向传输还可节省约 1.26 USD。因此,仅按存储节省计算,一次性计算成本约三个月即可回收。对于 1,000 个类似样本,一次性计算成本约为 860 USD,年度存储节省约为 3,880 USD,五年净存储节省约为 18,500 USD,且尚未计入带宽节省。这些数值仅作为基于公开云价格的示例性成本计算。 | + +--- + +## 7. 与已有正文 / 表格的口径自检 + +| 已有正文 / 表格断言 | 行号 | 与新增段是否一致 | +|---|---|---| +| SRR1210085\_1 文件大小 15{,}126{,}475{,}212 bytes(≈15.1 GB) | Table 1, line 275 | ✅ 新增段 "15.1\,GB raw FASTQ" | +| FastqCA 有损 CR = 14.32× on SRR1210085\_1 | Table 2, line 455 | ✅ 新增段 "$15.1/14.32 \approx 1.05$\,GB" | +| FastqCA 有损压缩速度 = 2.52\,MB/s on SRR1210085\_1 | Table 4, line 496 | ✅ 新增段 "1.7\,h"(与 $15{,}100/2.52 \approx 5{,}992$ s 一致,也与审稿人 1.7\,h 一致) | +| Section 3.1 平台 = Intel Core i9, 64 GB RAM | line 252 | ✅ 新增段以"\texttt{r6i.2xlarge}(8 vCPU, **64\,GiB RAM**)... matches ... Section~3.1"挂钩(内存严格对齐) | +| 论文定位 "high-density archival and bandwidth-limited transmission pipelines" | line 108 | ✅ 新增段同时给出 storage 与 egress 两条节省,呼应"archival + bandwidth-limited"双场景 | + +--- + +## 8. 新增参考文献 + +**无需新增**。AWS S3 / EC2 公开标价不在学术参考文献体系内;如审稿人坚持要 cite,可在 response letter 中以脚注形式提供 AWS 官方价格页面 URL,或注明"AWS pricing as of 2024"。 + +--- + +## 9. 关于"是否要并列展示 cold-tier(Glacier)数据" + +我没有在主文段中铺陈 S3 Glacier / Glacier Deep Archive 等冷存储等级——原因: +- 这些等级的单价($\$0.00099$ – $\$0.004$/GB-month)会让月度节省从 \$0.32 降到 \$0.014 – \$0.056,**回本时间从约 3 个月延长到数年**。 +- 这对论文论点不利,主动写进去会给审稿人添新弹药; +- 同时,论文定位(line 108)写的是 "archival and bandwidth-limited transmission"——并未限定 cold-tier;S3 Standard 是兼容主动访问 + 长期归档的常规选择,举例最具代表性。 +- **Response letter 中可主动加一句**(不在论文):"If a cold-storage scenario (S3 Glacier / Glacier Deep Archive) is of interest, we are happy to provide the corresponding break-even analysis; in those tiers the compression-ratio advantage of FastqCA still holds but the break-even horizon shifts from months to years."——主动表态比"等审稿人追问再回应"姿态更好。 + +--- + +## 10. 自检清单 + +- [x] 直接回应审稿人三个具体诉求:(a) "concrete" → 给出每一步算式与中间数字;(b) "realistic cloud storage prices" → 锁定 AWS 公开标价;(c) "1000s of samples" → 给出 1000-样本场景的 5 年净节省。 +- [x] 选用审稿人自己点名的 15 GB 文件 SRR1210085\_1 作为算例基准,避免被质疑 "cherry-picking"。 +- [x] 计算耗时 1.7\,h 与审稿人原意见数字一致;同时与论文 Table 4 的 2.52\,MB/s 互相印证($15{,}100/2.52 \approx 5{,}992$ s)。 +- [x] 同时呈现 storage 与 egress 两条节省,呼应论文 line 108 已声明的 "high-density archival and bandwidth-limited transmission" 双场景。 +- [x] 不动既有 line 476–477 定性叙述;新增段以 `\paragraph{Cost--benefit illustration.}` 形式插入,结构最小侵入。 +- [x] 给出"原文上下文 / 修改后 / 中英文对照 / 数字一致性核对表 / 口径自检表"五栏。 +- [x] **事实校对(二轮,作者完成)**:(1) 计算实例由 `c6i.xlarge`(4 vCPU, **8 GiB RAM**, \$0.17/h)改为 `r6i.2xlarge`(8 vCPU, **64 GiB RAM**, \$0.504/h)——后者的 RAM 与 `FastqCA.tex` line 252 报告的 64 GB 实验平台**严格对齐**,避免审稿人翻 Section 3.1 发现"用 64 GiB 跑出的成绩拿 8 GiB 实例算成本"这种被低估的破绽;(2) 数字内部一致性——单文件计算 \$0.86、月度节省 \$0.32、回本约 3 月、1000 样本一次性 \$860、年度 \$3{,}880、5 年净 \$18{,}500——精确链路 \$18{,}520 与近似链路 \$18{,}540 双双收敛至 \$18{,}500(精度内自洽,前版 \$18{,}900 与算式 \$19{,}100 不一致的问题已修复)。