Skip to main content
Fig. 2 | BMC Medical Informatics and Decision Making

Fig. 2

From: ParaMed: a parallel corpus for English–Chinese translation in the biomedical domain

Fig. 2

The overall pipeline to construct the ParaMed dataset. NEJM webpages were crawled using Selenium. Various preprocessing steps were carried out to standardize punctuations and remove boilerplate texts. We tested two methods for splitting paragraphs into sentences, and three methods to align English and Chinese sentence into translated pairs. Duplicated sentence pairs were removed at the end

Back to article page