Fig. 2From: ParaMed: a parallel corpus for English–Chinese translation in the biomedical domainThe overall pipeline to construct the ParaMed dataset. NEJM webpages were crawled using Selenium. Various preprocessing steps were carried out to standardize punctuations and remove boilerplate texts. We tested two methods for splitting paragraphs into sentences, and three methods to align English and Chinese sentence into translated pairs. Duplicated sentence pairs were removed at the endBack to article page