Improving the quality of Persian clinical text with a novel spelling correction system

Dashti, Seyed Mohammad Sadegh; Dashti, Seyedeh Fatemeh

doi:10.1186/s12911-024-02613-0

Research
Open access
Published: 05 August 2024

Improving the quality of Persian clinical text with a novel spelling correction system

Seyed Mohammad Sadegh Dashti¹ &
Seyedeh Fatemeh Dashti²

BMC Medical Informatics and Decision Making volume 24, Article number: 220 (2024) Cite this article

293 Accesses
1 Altmetric
Metrics details

Abstract

Background

The accuracy of spelling in Electronic Health Records (EHRs) is a critical factor for efficient clinical care, research, and ensuring patient safety. The Persian language, with its abundant vocabulary and complex characteristics, poses unique challenges for real-word error correction. This research aimed to develop an innovative approach for detecting and correcting spelling errors in Persian clinical text.

Methods

Our strategy employs a state-of-the-art pre-trained model that has been meticulously fine-tuned specifically for the task of spelling correction in the Persian clinical domain. This model is complemented by an innovative orthographic similarity matching algorithm, PERTO, which uses visual similarity of characters for ranking correction candidates.

Results

The evaluation of our approach demonstrated its robustness and precision in detecting and rectifying word errors in Persian clinical text. In terms of non-word error correction, our model achieved an F1-Score of 90.0% when the PERTO algorithm was employed. For real-word error detection, our model demonstrated its highest performance, achieving an F1-Score of 90.6%. Furthermore, the model reached its highest F1-Score of 91.5% for real-word error correction when the PERTO algorithm was employed.

Conclusions

Despite certain limitations, our method represents a substantial advancement in the field of spelling error detection and correction for Persian clinical text. By effectively addressing the unique challenges posed by the Persian language, our approach paves the way for more accurate and efficient clinical documentation, contributing to improved patient care and safety. Future research could explore its use in other areas of the Persian medical domain, enhancing its impact and utility.

Peer Review reports

Introduction

Spelling correction is a vital task in all text processing environments, with its importance amplified for languages with intricate morphology and syntax, such as Persian. This significance is further heightened in the realm of clinical text, where precise documentation is a cornerstone for effective patient care, research, and ensuring patient safety The written text of medical findings remains the essential source of information for clinical decision making. Clinicians prefer to write unstructured text rather than filling out structured forms when they document the progress notes, due to time and efficiency constraints [1]. The quality and safety of health care depend on the accuracy of clinical documentation [2]. However, misspellings often occur in clinical texts because they are written under time pressure [3].

The process of spelling correction primarily tackles two types of errors: non-word errors, which are nonsensical words not found within a dictionary, and real-word errors, that are correctly spelled words but utilized inappropriately in context. These errors can stem from various sources including typographical mistakes, confusion between similar sounding or meaning words [4], incorrect replacements by automated systems like AutoCorrect features [5], and misinterpretation of input by ASR and OCR systems [6,7,8,9].

The Persian language, with its rich vocabulary and complex properties, presents unique challenges for real-word error correction. Features unique to Persian such as homophony (words that are pronounced identically yet carry distinct meanings), polysemy (words with multiple meanings), heterography (words that share identical spelling but their meanings vary based on how they are pronounced), and word boundary issues contribute to this complexity.

Despite these challenges, numerous efforts have been made to develop both statistical and rule-based approaches for identifying and rectifying both classes of errors in the general Persian text domain; however, the work in the Persian medical domain and specifically the Persian clinical text is very limited. Moreover, these methods have attained only limited success. In this study, we introduce an innovative method to detect and correct word errors in Persian clinical text, aiming to significantly improve the accuracy and reliability of healthcare documentation. Our key contributions include:

Language Representation Model: We showcase a pre-trained language representation model that has undergone meticulous fine-tuning, specifically for the task of spelling correction in the Persian clinical domain.
PERTO Algorithm: We introduce an innovative orthographic similarity matching algorithm that leverages the visual resemblance of characters to prioritize correction candidates.

We utilize the F1-score metric to evaluate and contrast our methodology with established approaches for detecting and rectifying both non-word and real-word errors within the context of Persian clinical text.

The rest of this paper is structured as follows: We commence with a review of prior research in the field. Following this, we delve into the challenges faced in Persian language text processing. Subsequently, we outline our proposed approach. Evaluation and experiment results are then presented and discussed. In the final segment, we summarize our findings.

Related works

Automatic word error correction is a crucial component in NLP systems, particularly in the context of EHR and clinical reports. Early techniques were based on edit distance and phonetic algorithms [10,11,12,13]. The incorporation of context information has been demonstrated to be effective in boosting the efficiency of auto-correction systems [14]. Contextual measures like semantic distance and noisy channel models based on N-grams have been employed across numerous NLP applications [4, 5, 15,16,17]. A novel approach was also developed to correct multiple context-sensitive errors in excessively noisy situations [18]. Dashti developed a model that addressed the identification and automatic correction of context-sensitive errors in cases where more than one error existed in a given word sequence [19].

Cutting-edge methods in NLP systems utilize context information through neural word or sense embeddings for spelling correction [20]. Pretrained contextual embeddings have been used to detect and rectify context-sensitive errors [21]. The issue of spelling correction has been addressed using deep learning techniques for various languages in recent years. For example, a study in 2020 proposed a deep learning method to correct context-sensitive spelling errors in English documents [22]. Another work developed a BERT-Based model for the same purpose [23]. NeuSpell is a user-friendly neural spelling correction toolkit that offers a variety of pre-trained models [24]. SpellBERT is a lightweight pre-trained model for Chinese spelling check [25]. A disentangled phonetic representation approach for Chinese spelling correction was proposed [26]. Other approaches for Chinese spelling correction utilized phonetic pre-training [27]. An innovative approach was devised specifically for the purpose of contextual spelling correction within comprehensive speech recognition systems [28]. A dual-function framework for detecting and correcting spelling errors in Chinese was proposed [29]. Liu and colleagues proposed a method, known as CRASpell, which is resilient to contextual typos and has been developed to enhance the process of correcting spelling errors in Chinese [30]. AraSpell is an Arabic spelling correction approach that utilized a Transformer model to understand the connections between words and their typographical errors in Arabic [31].

In the realm of healthcare, the application of spelling correction techniques has been instrumental in expanding acronyms and abbreviations, truncating, and rectifying misspellings. It has been observed that such instances constitute up to 30% of clinical content [32]. In the last twenty years, a significant amount of research has been conducted on spelling correction methods specifically designed for clinical texts [1]. The majority of these studies have primarily focused on EHR [33], while a few have explored consumer-generated texts in healthcare [34, 35].

Several noteworthy contributions in this field include the French clinical record spell checker introduced by Ruch and colleagues, which boasts a correction rate of up to 95% [36]. Siklósi and his associates devised a system that is aware of context for Hungarian clinical text, which is grounded on statistical machine translation, and it attained an accuracy rate of 87.23% [37]. Grigonyte and her research team introduced a system tailored for Swedish clinical text, achieving a precision of 83.9% and a recall rate of 76.2% [38].

Zhou and colleagues leveraged the Google spell checker to develop a system capable of accurately correcting 86% of typographical and linguistic inaccuracies found in routine medical terminologies [35]. Another study deliberated on a spelling correction system that was referenced in reports concerning the safety of vaccines, with recall and precision rates of 74% and 47%, respectively [39]. Wong and his team have designed a system that operates in real-time to rectify spelling errors in clinical reports, achieving an accuracy of 88.73%. This system leverages the power of semantic and statistical analysis applied to web data for the purpose of automatic correction [1]. Doan and his research team presented a system, specifically designed for the rectification of misspellings in drug names. This system, which is based on the Aspell algorithm, reported a commendable precision rate of 80% [40].

Among the recent contributions is an article by Lai and colleagues proposing a system for automatic spelling correction in medical texts, employing a noisy channel model to achieve significant accuracy [41]. Similarly, unsupervised, context-aware models have shown promise in correcting spelling errors in English and Dutch clinical unstructured texts [42, 43].

While these advancements have significantly improved spelling correction across languages and domains, recent innovations in BCIs, eye-tracking, VR/AR, and non-invasive EEG technologies open new avenues for further enhancing human–computer interaction and the accuracy of medical documentation [44,45,46,47]. These technologies, through their unique capabilities to interact directly with the user's cognitive states and attention, offer potential solutions to some of the inherent limitations of current NLP systems in understanding and correcting complex, context-sensitive errors in clinical texts. As the field continues to evolve, integrating these cutting-edge technologies into spelling correction tools for medical documentation could revolutionize the way healthcare professionals interact with digital text, making the process more efficient, accurate, and tailored to their specific needs.

In addition, the emergence and application of Optical technology in the healthcare sector over the past twenty years has led to the creation of several systems designed to detect and correct OCR errors automatically. A reference to one such system can be found in [48]; this system identifies and rectifies typographical errors in French clinical documents. In a newer study, Tran and colleagues suggested a model for spelling correction in clinical text that is sensitive to context [49].

Despite the complexities inherent in the Persian language, substantial progress has been made in the field of spelling correction. The strategies employed range from statistical or rule-based methods to more contemporary systems, such as the Vafa spellchecker, which is capable of detecting a wide variety of errors. Mosavi and Miangah have addressed spelling issues in the Persian language using N-grams, a monolingual corpus, and a measure of string distance [50,51,52,53,54,55,56,57]. Within these methodologies, one focuses on correcting typographical errors in clinical text, utilizing a four-gram language model. Consequently, the need for a Persian spell-checking tool in specialized domains, such as healthcare, is clear.

Given the variety of methodologies and their targeted applications in spelling correction, we provide Table 1 below to efficiently summarize the key contributions within the medical domain and Persian language spelling correction models. This comparative analysis not only illuminates the range of strategies employed to address spelling correction challenges across diverse languages and contexts but also underlines the distinctive features of each method. In doing so, it enhances our understanding of the current research landscape in this field, spotlighting the innovative approaches and shedding light on the potential avenues for future exploration.

Table 1 Comparative analysis of spelling correction models across languages with a focus on the medical domain

Full size table

Persian spelling challenges

Persian, alternatively referred to as Farsi, belongs to the Indo-Iranian subgroup of the Indo-European family of languages. It holds official language status in countries such as Iran, Tajikistan, and Afghanistan. Over time, Persian has incorporated elements from other languages such as Arabic, thereby enriching its vocabulary. Despite these influences, the fundamental structure of the language has largely remained intact for centuries [55, 59].

While Persian is a vibrant and expressive language, it presents several challenges for language processing:

1.
Character Ambiguity: Persian characters like “ی” and “ي” are often used interchangeably but represent different sounds [60].
2.
Rich Morphology: New words can be created by adding prefixes and suffixes to a base word, like “دست” (hand) to “دست‌ها” (hands) [61].
3.
Orthography: Persian involves a combination of spaces and semi-spaces, which can lead to inconsistencies [62].
4.
Co-articulation: The pronunciation of a consonant like “ب” can be affected by the subsequent vowel [63].
5.
Dialectal Variation: Persian has several standard varieties such as Farsi, Dari, and Tajik [64].
6.
Cultural Factors: The phenomenon of persianization can shape the way Persian is used and interpreted.
7.
Lack of Resources: Often, Persian is classified as a language with limited resources, given the scarcity of accessible data and tools for Natural Language Processing [61].
8.
Free Word Order: Persian allows for the rearrangement of words within a sentence without significantly altering its meaning [65].
9.
Homophony: Different words have identical pronunciation but different meanings, like (“گذار” /gʊzɑr/ ‘transition’)^{Footnote 1} and (“گزار” /gʊzɑr/ ‘predicate’) [66].
10.
Diacritics: They are frequently left out in writing, leading to ambiguity in word recognition [67].
11.
Rapidly Changing Vocabulary: Persian’s vocabulary is rapidly evolving due to factors such as technology, globalization [68].
12.
Lack of standardization: There isn’t a single standard for Persian text, which can complicate the development of language processing models capable of handling a variety of dialects and styles [69].

A significant issue is the treatment of internal word boundaries, often represented by a zero-width non-joiner space or “pseudo-space”. Ignoring these can lead to text processing errors. Pre-processing steps can help resolve these issues by correcting pseudo and white spaces according to internal word boundaries and addressing tokenization problems.

These challenges highlight the need for robust computational models and resources that can handle the intricacies of the Persian language while ensuring accurate language processing.

Material and methods

Our methodology detects and corrects two categories of mistakes in Persian clinical text: Non-word and Real-word errors. The architecture of the proposed system is depicted in Fig. 1. The system design is composed of five distinct modules that communicate via a databus.

The INPUT module accepts raw test corpora. The pre-processing component normalizes the text and addresses word boundary issues. The contextual analyzer module assesses the contextual similarity within desired word sequences.

For error detection, we implement a dictionary reference technique to pinpoint non-word errors and use contextual similarity matching to detect real-word errors. The error correction module rectifies both classes of errors using context information from a fine-tuned contextual embeddings model, in conjunction with orthographic and edit-distance similarity measures.

The corrected corpora or word sequence is then delivered through the OUTPUT module.

Pre-processing step

Text pre-processing is a crucial step in numerous NLP applications, which includes the segmentation of sentences, tokenization, normalization, and the removal of stop-words. The segmentation of sentences involves determining the boundaries of a sentence, usually marked by punctuation such as full stops, exclamation marks, or question marks. Tokenization is the process of decomposing a sentence into a set of terms that capture the sentence's meaning and are utilized for feature extraction. Normalization is the procedure of converting text into its standard forms and is particularly important in NLP applications for Persian, as it is for many other languages. A key task in normalizing Persian text is the conversion of pseudo and white spaces into regular forms, replacing whitespaces with zero-width non-joiners when necessary.

For example, (‘می شود’ /mi ʃævæd/ ‘is becoming’) is replaced with (‘میشود’ / miʃævæd / ‘is becoming’). Persian and Arabic have numerous similarities, and certain Persian alphabets are frequently incorrectly written using Arabic versions. It is often advantageous for researchers to normalize these discrepancies by substituting Arabic characters (ي ‘Y’ /j/; ک ‘k’ /k/; ه ‘h’ /h/) with their corresponding Persian forms. For instance, (‘براي’ /bærɑy/ ‘for’) is transformed to (‘برای’ /bærɑy/ ‘for’). Normalization also includes removing diacritics from Persian words; e.g., (‘ذرّه’ /zærre/ ‘particle) is changed to (‘ذره’ /zære/ ‘particle). Additionally, Kashida(s) are removed from words; for instance, (‘بــــــاند’ /bɑnd/ ‘band’) is transformed to (‘باند’ /bɑnd/ ‘band’).

In order to accomplish the goal of normalization, a dictionary named Dehkhoda, which includes the correct typographic form of all Persian words, is utilized to determine the standard form of words that have multiple shapes [70].

Damerau-Levenshtein distance and candidate generation

Our methodology employs the Damerau-Levenshtein distance metric to generate potential rectifications for both non-word and real-word errors [11]. This measure considers insertion, deletion, substitution, and transposition of characters. For instance, the measure of Damerau-Levenshtein distance between "KC" and "CKE" equals 2. It’s found that around 80% of human-generated spelling errors involve these four error types [71]. Studies indicate that context-sensitive error constitute approximately 25% to 40% of all typographical errors in English documents [72, 73].

Our model utilizes an extensive dictionary to pinpoint misspellings. This dictionary is bifurcated into two segments: general and specialized terms. For the general segment, we employ the Vafa spell-checker dictionary, a highly respected spell checker for the Persian language. This dictionary encompasses 1,095,959 terms, all of which are general terms, but it excludes specialized medical terminology. In this research, we utilized the texts we trained to formulate a custom dictionary. This dictionary integrates specialized terminology found in breast ultrasonography, head and neck ultrasonography, and abdominal and pelvic ultrasonography texts. It was further enriched with translations from the Radiological Sciences Dictionary by David J Dowsett to pinpoint misspellings of specialized terms [74]. This dictionary comprises 10,332 terms, all of which are specialized terms in the field of breast ultrasound, head and neck ultrasound, and abdominal and pelvic ultrasound. However, this specialized dictionary does not encompass general terms.

To circumvent duplication of specialized terms, we juxtaposed our comprehensive dictionary with the Radiological Sciences Dictionary using a custom software developed by the researchers of this study. This ensured that no term was included more than once in the dictionary, as some terms might be present in both dictionaries.

Upon our analysis of the test data, we concluded that an edit distance of up to 2 between the candidate corrections and error would be ideal. With an edit distance set to one, an average of three candidates are generated as potential replacements for a target context word. However, when the edit distance is increased to 2, the average number of generated candidates rises to 15. Correspondingly, the computation time also increases. We ensure that the generated candidates are validated against the reference lexicon.

Contextual embeddings

Word embeddings, which analyze vast amounts of text data to encapsulate word meanings into low-dimensional vectors [75, 76], retain valuable syntactic and semantic information [77] and are advantageous for numerous NLP applications [78]. However, they grapple with the issue of meaning conflation deficiency, which is the inability to differentiate between multiple meanings of a word.

To tackle this, cutting-edge approaches represent specific word senses, referred to as contextual embeddings or sense representation. Context-sensitive word embedding techniques such as ELMo consider the context of the input sequence [65]. There exist two main strategies for pre-training language representation model: feature-oriented methods and fine-tuning methods [79]. Fine-tuning techniques train a language model utilizing large datasets of unlabeled plain texts. The parameters of these models are later fine-tuned using data that is pertinent to the task at hand [79,80,81]. However, pre-training an efficient language model demands substantial data and computational resources [82,83,84,85]. Models that are multilingual have been formulated for languages that share morphological and syntactic structures. However, languages that do not use the Latin script significantly deviate from those that do, thereby requiring an approach that is specific to each language [86]. This challenge is also common in the Persian language. Although some multilingual models encompass Persian, their performance may not match that of monolingual models, which are specifically trained on a language-specific lexicon with more extensive volumes of Persian text data. As far as we are aware, ParsBert [69] and SinaBERT [87] are the sole efforts to pre-train a Bidirectional Encoder Representation Transformer (BERT) model explicitly for the Persian language.

Pre-trained language representation model

Persian is often recognized as an under-resourced language. Despite the existence of language models that support Persian, only two, namely ParsBert [69] and SinaBERT [87], have been pre-trained on large Persian corpora. ParsBERT was pre-trained on data from the general domain, which includes a substantial amount of informal documents such as user reviews and comments, many of which contain misspelled words.

Conversely, SinaBERT was pre-trained on unprocessed text from the overarching medical field. The data for SinaBERT was compiled from a diverse set of sources such as websites that provide health and medical news, websites that disseminate scientific information about health, nutrition, lifestyle, and more, journals (encompassing both abstracts and complete papers) and conference proceedings, scholarly written materials, medical reference books and dissertations, online forums centered around health, medical and health-related Instagram pages, along with medical channels and groups on Telegram.

The data primarily consisted of general medical domain data, a portion of which was informal and contained misspellings. These factors make these pre-trained models unsuitable for Persian clinical domain spelling correction tasks. The lack of an efficient language model in this domain poses a considerable hurdle. In the subsequent section, we will explore our Persian Clinical Corpus and the procedure of pre-training our language representation model.

Data

While numerous formal general domain Persian medical texts are freely accessible, they may not be ideal for spelling correction in clinical texts. Conversely, Persian clinical texts are not widely available to the public. Nevertheless, the use of Persian clinical text is essential for pre-training a language representation model specifically for spelling correction in Persian clinical text. Consequently, we assembled a substantial collection of Persian Clinical texts to train an effective model for spelling correction in Persian.

Our data comprises a total of 78,643 ultrasonography reports, which were obtained from three distinct datasets. These datasets were generously provided by the Department of Imaging's HIS at Tehran's Imam Khomeini Hospital. For a detailed breakdown of these datasets, please refer to Table 2.

Table 2 Details of the datasets

Full size table

Each dataset comprised three different types of medical reports: breast ultrasonography, head and neck ultrasonography, and abdominal and pelvic ultrasound reports. The first dataset, spanning from January 2011 to February 2015, included 22,504 reports with a total of 7,538,840 words. The average report length in this dataset was 335 words. The second dataset contained 15,888 reports and 4,782,288 words, encompassing all texts entered by medical typists from March 2015 to July 2018. The average length of sonography reports in this dataset was 301 words. The third dataset, which covers the period from August 2018 to June 2023, comprises 40,251 reports and a total of 14,007,348 words. All of these reports were inputted by medical typists. The average word count for the sonography reports in this dataset is 348 words. Upon analyzing the corpus, we found that 1.2% of the words in the corpora represent instances of errors, which can be classified into two types: non-word errors and real-word errors. Further scrutiny revealed that out of this 1.2% segment, non-word errors constitute 1%, while the remaining 0.2% are real-word errors.

We employed a random selection process to ensure a fair representation of the entire corpora in both the testing and training datasets. Specifically, 10% of the sentences from the corpora, amounting to 188,963 sentences, were randomly chosen for testing and evaluation. The remaining 90% of the sentences, which equates to 1,700,668 sentences, were allocated for the fine-tuning and pre-training of the model. Of these, 10% were used for fine-tuning and the rest, 90%, for pre-training. This process encompassed several steps including normalization, pre-processing, and the removal of punctuation marks, tags, and so forth. In addition, we addressed both real-word and non-word errors present in the training corpus. This meticulous approach ensures the robustness and accuracy of our model.

Model architecture

The structure of our suggested model is founded on the original ${\mathbf{B}\mathbf{E}\mathbf{R}\mathbf{T}}_{\mathbf{B}\mathbf{A}\mathbf{S}\mathbf{E}}$ setup, which comprises 12 hidden layers, 12 attention heads, 768 hidden sizes, and a total of 110M parameters. Our model is designed to handle a maximum token capacity of 512. The architecture of the model is depicted in Fig. 2. BERT's success is often attributed to its MLM pre-training task, where it randomly masks or replaces tokens before predicting the original tokens [80]. This feature makes BERT particularly suitable for a spelling checker, as it interprets the masked and altered tokens as misspellings. In the embedding layer of BERT, each input token, denoted as ${\mathbf{T}}_{\mathbf{i}}$, is indexed to its corresponding embedding representation, ${\mathbf{E}\mathbf{R}}_{\mathbf{i}}$. This ${\mathbf{E}\mathbf{R}}_{\mathbf{i}}$ is then forwarded to BERT's encoder layers to obtain the subsequent representation, ${\mathbf{H}\mathbf{R}}_{\mathbf{i}}$.

$${\text{ER}}_{\text{i}}=\text{ BERT}-\text{Embedding}({\text{T}}_{\text{i}})$$

(1)

$${\text{HR}}_{\text{i}}=\text{ BERT}-\text{Encoder}\left({\text{ER}}_{\text{i}}\right)$$

(2)

In this context, both ${\text{ER}}_{\text{i}}$ and ${\text{HR}}_{\text{i}}$ belong to the real number space ${R}^{1*d}$, where $d$ represents the hidden dimension. Subsequently, the similarities between ${\text{HR}}_{\text{i}}$ and all token embeddings are calculated to predict the distribution of ${\text{Y}}_{\text{i}}$ over the existing vocabulary.

$${\text{Y}}_{\text{i}}=\text{ Softmax}({\text{HR}}_{\text{i}},{{\varvec{E}}}^{T})$$

(3)

where ${\varvec{E}}\boldsymbol{ }\in {R}^{V*d}$ and ${\text{Y}}_{\text{i}}$$\in$${R}^{1*V}$; here $V$ signifies the size of the vocabulary and ${\varvec{E}}$ represents the BERT embedding layer. The $i$ th row of ${\varvec{E}}$ aligns with ${\text{ER}}_{\text{i}}$ in accordance with Eq. 1. The ultimate rectification outcome for ${\text{T}}_{\text{i}}$ is the ${\text{T}}_{\text{k}}$ token, whose corresponding ${\text{ER}}_{\text{k}}$ exhibits the greatest similarity to ${\text{HR}}_{\text{i}}$.

Fine-tuning for spelling correction task

We fine-tuned the pre-trained model specifically for the task of spelling correction in Persian clinical text, aiming to achieve optimal performance. For this fine-tuning process, we utilized 10% of the reserved sentences from the training corpus, amounting to 170,066 sentences. Each input to the model was a single sentence ending with a full stop, as our primary focus was on training the model for spelling correction. Upon examining the test set, we found that many sentences were short, and masking a few tokens would significantly reduce the context. Consequently, we excluded sentences with fewer than 20 words from the corpus. In the end, we selected 122,162 sentences, each with a minimum length of 20 words. However, since the input was a list of sentences that couldn't be directly fed into the model, we tokenized the text. The objective of the error correction task is to predict target or masked words by gaining context from adjacent words. Essentially, the model tries to reconstruct the original sentence from the masked sentence received in the input at the output. Therefore, the target labels are the actual input_ids of the tokenizer.

In the original ${\text{BERT}}_{\text{BASE}}$ model, 15% of the input tokens were masked, with 80% replaced with [mask] tokens, 10% replaced with random tokens, and the remaining 10% left unchanged. However, in our fine-tuning task, we only replaced 15% of the input tokens with [mask], except for special ones; we did not use [mask] tokens to replace [SEP] and [CLS] tokens. We also avoided the random replacement of tokens to achieve better results. We used TensorFlow [88] for training with Keras [89]. Additionally, we used the Adam optimizer with a learning rate of 1E-4. The batch size was 32 and each model was run for 4 epochs.

PERTO algorithm

We have designed an algorithm called PERTO, which stands for Persian Orthography Matching. This algorithm ranks the most likely candidate words derived from the output of a pre-trained model, based on shape similarity. In this algorithm, every character in the Persian script is given a distinct code. Characters that share similar forms or glyphs are classified under the same code, enabling words with similar shape characters to be identified, even if there are slight spelling variations. Our pioneering hybrid model classifies characters with the same shapes into identical groups, as depicted in Table 3.

Table 3 PERTO code for persian language alphabet

Full size table

In order to identify shape similarity in Persian, a PERTO code is generated for the incorrectly spelled word. This code is subsequently matched with the PERTO codes of all potential words generated via edit distance. Our model distinctively merges PERTO with a contextual score ranking system. PERTO is solely utilized for substitution errors. In cases of insertion or deletion type errors, where the PERTO codes of all potential words do not correspond to the PERTO code of the misspelled word, our model depends entirely on contextual scores derived from the pre-trained model. Pseudocode1 outlines the implementation details of the PERTO algorithm.

To illustrate the PERTO code generation process, let us consider the word "پرگاز," which translates to "a stomach full of gas" in English. The generation of the PERTO code for this word, as per the method outlined in Pseudocode1, is as follows:

1)
We begin with the first character on the right side of the word and find its hash code from Table 3. The code for "پ" is 1, which we store in an empty string.
2)
Moving one unit to the left, we retrieve the hash code for the character "ر," which is 4, and add this digit to the string.
3)
This process continues for each character in the word until no characters are left.
4)
For "گاز," the respective codes are "9," "0" and "4," following the same lookup and concatenation procedure.
5)
In the end, we obtain the PERTO code "14904" for the given word, which has the same length as the original word.

In the appraisal segment of our research, we will meticulously scrutinize the impact of the PERTO algorithm on the accuracy of spelling rectification within the healthcare sector. Through a comprehensive examination of the outcomes, our aim is to measure the effectiveness of this algorithm in enhancing the accuracy of spelling rectification, particularly designed for Persian medical text. This endeavor will provide valuable insights into the potential applications and benefits of the PERTO algorithm in real-world scenarios.

Error detection module

The error detection module utilizes two separate strategies based on the nature of the error being identified. For non-word errors, a lexical lookup approach is employed, while real-word errors are addressed through contextual analysis. The initial step in error detection, irrespective of the error type, involves boundary detection and token identification. Upon receiving an input sentence S, the model first demarcates the start and end of the sentence with Beginning of Sentence $(BoS$) and End of Sentence ($EoS$) markers, respectively, markers respectively, and approximates the word count in the sentence:

$$<BoS>{W}_{\text{i }}{W}_{\text{i}+1}{ W}_{\text{i}+2}\dots { W}_{n}<EoS>$$

It’s crucial to note that the word count corresponds to the maximum number of iterations the model will undertake to identify an error in the sentence.

Non-word error detection

Spell checkers predominantly employ the lexical lookup method to detect spelling errors. This technique involves comparing each word in the input sentence with a reference dictionary in real-time, which is usually built using a hash table. Beginning with the $BoS$ marker, the model scrutinizes every token in the sentence for its correctness based on its sequence. This process continues until the $EoS$ marker is reached. However, if a word is identified as misspelled, the error detection cycle halts and the error correction phase commences. Here's an illustration of non-word error detection:

In the given example, the word intended to be typed was (“مایع” /mɑye / ‘fluid’), but it was mistakenly typed as ‘مایغ’. This error is due to a substitution operation and is a single unit of distance away from the correct word. The model was successful in promptly identifying this error.

Real-word error detection

In this study, we employ contextual analysis for the detection of real-word errors. Traditional statistical models relied on n-gram language models to examine the frequency of a word's occurrence and assess the word's context by considering the frequency of the word appearing with "n" preceding terms. However, contemporary approaches use neural embeddings to evaluate the semantic fit of words within a given sentence. In our proposed methodology, we utilize the mask feature and leverage contextual scores derived from the fine-tuned bidirectional language model to detect and correct word errors. The process of real-word error detection is explained as follows:

1)
The model begins with the BoS marker and attempts to encode each word as a masked word, starting with the first word.
2)
A list of potential replacements for the masked word is derived from the output of the pre-trained model.
3)
Based on the candidate generation scenario, replacement candidates are generated within edit-distances of 1 and 2 from the masked word.
4)
The list of candidates, along with the original token, is cross-verified against the pre-trained model’s output for the masked token.
5)
If a candidate demonstrates a probability value that surpasses that of the masked word, the initial word is considered erroneous, thus bringing the procedure to a close.
6)
However, if no error is detected, the model shifts one unit to the left, and the same steps are reiterated for all words within the sentence until the EoS marker is encountered.

Therefore, the moment an error is identified, the correction process is initiated immediately; subsequently, the model advances to the next sentence. Pseudocode2 offers an in-depth exploration of the Real-word error detection process.

Here's an illustration of successful real-world error detection:

In the given example, the term ( “اینترارکتال” /intrarectɑl/ ‘intrarectal’) is identified as a real-word error. The word that the user intended to type was (“اینتراداکتال” /intrɑductɑl/ ‘intraductal’). Initially, the model encodes the masked token and feeds it into the pre-trained model, which subsequently generates a list of contextually appropriate tokens. Following this, a roster of potential replacement candidates is created using the Damerau-Levenshtein distance measure. In this instance, the edit-distance is 2. The model then juxtaposes the context similarity score of each replacement candidate with the output list derived from the pre-trained model. Table 4 showcases the context similarity scores of the top two replacement candidates.

Table 4 Contextual scores of the top five replacement candidates

Full size table

Error correction module

The error correction phase is initiated when an error is identified in the input. In this stage, we devise a ranking algorithm that primarily relies on the contextual scores obtained from the fine-tuned pre-trained model and the corresponding PERTO codes between potential candidates and the errors.

Non-word error correction process

In the non-word error correction process, the following steps are undertaken:

1)
The model initially employs the Damerau-Levenshtein edit distance measure to generate a set of replacement candidates within 1 or 2 edits.
2)
The misspelled word is subsequently encoded as a “mask” and input into the fine-tuned model.
3)
The model extracts all probable words from the output and matches them against the candidate list.
4)
The model then retains a certain number of candidates with the highest contextual scores. Based on our observations, the optimal number is 10.
5)
The method proceeds to compare the PERTO similarity between the erroneous word and the remaining replacement candidates. If the error and candidate share the same code, that candidate is considered the most suitable word. However, if two or more probable candidates carry the same PERTO code as the erroneous word, then the candidate with the highest contextual score is selected as the replacement for the error.

Pseudocode3 delivers a comprehensive exploration of the Non-word error correction mechanism.

Real-word error correction process

In the scenario of real-word error correction, the process is as follows:

1)
The contextual scores of potential candidates are retrieved from the fine-tuned model.
2)
The model retains a certain number of candidates with the highest contextual score. Based on our observations, the optimal number is 10.
3)
The method then compares the PERTO similarity between the erroneous word and the replacement candidates. If the error and the candidate share the same code, that candidate is deemed the most suitable word.
4)
However, if two or more probable candidates carry the same PERTO code as the erroneous word, then the candidate with the highest contextual score is selected as the replacement for the error.

Pseudocode4 delivers a comprehensive exploration of the Non-word error correction mechanism.

Evaluation and results

In this section, we first conduct an analysis of the test data. Following this, we evaluate our method's performance and compare it with various baseline models in the task of spelling correction. This comparison will offer valuable insights into the efficacy and precision of our approach in identifying and rectifying spelling errors.

Test dataset

Our test datasets consist of 188,963 reserved sentences derived from the Persian clinical corpus. Upon scrutinizing the errors present in the test dataset, we found that 1.20% of sentences exhibited instances of non-word errors, which equates to 120 errors in every 10,000 sentences. In addition, 0.29% of sentences contained a real-word error, corresponding to 29 errors in every 10,000 sentences. We examined all the erroneous words to categorize them into one of the predefined classes of errors, such as substitution, transposition, insertion, and deletion. The frequency of these errors, based on the error type, is illustrated in Table 5. When addressing both real-word and non-word errors, substitution errors are more prevalent than other types of errors. Furthermore, insertion errors are quite common when dealing with both classes of error, while deletion and transposition errors are the least common.

Table 5 Distribution of different error types in the test corpus

Full size table

We also analyzed the test dataset for the number of edit distances required for spell correction, the results of which are presented in Table 6. In dealing with both real-word and non-word errors, 86.1% of misspellings required an edit distance of 1 to correct the incorrect word. 13.7% of errors were rectified with an edit distance of 2, and a mere 2.1% of errors fell within an edit-distance of 3 or more. Due to the combinatorial explosion when generating and examining candidates within distance 3, these classes of error were excluded from the dataset.

Table 6 Minimum edit distances required to convert misspelled words into correct words in the test dataset

Full size table

Upon conducting a more thorough analysis of the data, we found that 0.8% of sentences contained more than one error. As our method is designed to handle only one-error-per-sentence, we removed these sentences from the test dataset.

Evaluation metrics

The principal metrics for evaluating the effectiveness of models on tasks related to non-word and real-word error identification and rectification are precision (P), recall (R), and the F-measure (F1-Score). Precision (P) quantifies the model's accuracy, whereas recall evaluates its comprehensiveness or sensitivity. The F1-Score, a weighted harmonic average of these two metrics, can be computed by integrating them. In F1, both precision and recall are given equal weight. Equation 4 describes the F1-Score evaluation measure.

$$F1-Score=2*\frac{P*R}{P+R}$$

(4)

Baseline models

In our research, we implemented two baseline models for non-word correction in Persian clinical text to ensure a comprehensive comparison. These models include the four-gram model introduced by [57], and a Persian Continuous Bag-of-Words (CBOW) model [90]. Both models were developed using Python and trained on the same dataset as the pre-trained model. Our aim is to understand the strengths and weaknesses of these models, and leverage this understanding to enhance error correction in Persian language processing. Unfortunately, for real-word error correction in the Persian medical domain, no prior work has been introduced. Therefore, a meaningful comparison is not achievable at this time. This highlights the novelty and importance of our research in this specific area.

Yazdani, et al.

The statistical methodology, pioneered by Yazdani and colleagues, stands out as a promising approach for rectifying non-word errors. It is meticulously crafted to address typographical inaccuracies prevalent in Persian healthcare text, thereby enhancing the quality and reliability of the information [57]. This method leverages a weighted bi-directional fourgram language model to pinpoint the most appropriate substitution for a given error. It incorporates a quadripartite equation that assigns priority to n-grams based on their sequence, thereby enhancing the precision of error correction.

CBOW Model

CBOW model operates by comprehending the semantics of words through the analysis of their surrounding context, and then uses this information as input to predict suitable words for the given context [90]. The architecture of the CBOW model is designed to identify the target word (the center word) based on the context words provided. This model has been specifically trained to tackle the task of non-word error rectification. It employs two matrices to calculate the hidden layer (H): the input matrix (IM) and the output matrix (OM). The CBOW model was trained using a corpus of 1.4 million documents derived from the pre-trained model, which facilitated the generation of the input and output matrices. The training parameters incorporated a context window size of 10 and a dimension size of 300.

Non-word error correction evaluation

In the initial phase of assessment, we juxtapose the effectiveness of our suggested methodology with that of the previously mentioned baseline models concerning non-word error rectification. It's crucial to highlight that all models employ a dictionary look-up method for identifying typos, resulting in an F1-score of 100% for typo detection. Table 7 presents the results of the non-word error correction task, providing a detailed comparison of the effectiveness of our approach and the baseline models.

Table 7 Comparison of various models’ performance on non-word error correction task

Full size table

Table 7 provides a detailed comparison of the performance of various models on the non-word error correction task. It compares two configurations of our proposed approach with statistical baselines and the CBOW model. To gauge their effectiveness in practical scenarios, all models were subjected to an extensive array of test instances. The results clearly indicate that both configurations of our approach outperform the other models, demonstrating superior performance. The model achieves its best performance, with an F1-Score of 90.0%, when the PERTO algorithm is employed. The combination of contextual similarity with the PERTO algorithm proves to be the most robust scheme, offering a 1.1% increase in correcting non-word errors compared to using only contextual scores.

The authors of [57] reported achieving an F1-Score of 90.2% for non-word error correction. However, our attempts to reproduce this result in our evaluations were unsuccessful.

In fact, the approach by Yazdani et al. shows the lowest performance, with an F1-Score of 74.6%. The Contextual Scores + PERTO scheme outperforms Yazdani et al.'s approach by 15.4%, further demonstrating the robustness of our method. In terms of the proposed approach, the results of the scheme that combines contextual scores and PERTO are significantly superior to those achieved using only contextual scores. The most effective results are achieved when the pretrained model is used in conjunction with the PERTO orthographic similarity algorithm. Our observations confirm that the PERTO algorithm significantly enhances results, as substitution errors, which are predominantly either visually or phonetically similar, account for 49.1% of all non-word errors in the test corpus. This is in comparison to insertion, deletion, and transposition errors. This underscores the effectiveness of our approach in handling substitution errors.

Real-word error detection and correction evaluations

We performed a comprehensive evaluation of our proposed model for detecting and correcting real-word errors in Persian clinical text. The results of these evaluations are summarized in Table 8. Our model demonstrated its highest performance in real-word error detection, achieving an F1-Score of 90.6%.

Table 8 Performance evaluation on real-word error correction task

Full size table

We further evaluated our model’s ability to correct real-word errors. As depicted in Table 8, our suggested approach, particularly when enhanced with the PERTO algorithm, exhibits outstanding performance in correcting real-word errors across a range of distances. The model reached its highest F1-Score of 0.915 when the Persian orthographic similarity algorithm was employed, indicating an approximate enhancement of 1.5% in the correction F1-Score. It’s noteworthy that the PERTO significantly enhances the results as substitution errors constitute 47.8% of all real-word errors in the test corpus, compared to insertion, deletion, and transposition errors. Furthermore, a significant portion of these substitution errors bear a visual resemblance.

We also conducted a comprehensive analysis of the errors made by our model. We discovered that in a few cases, real-word errors were missed when the erroneous word had a strong semantic connection to the context words. For instance, in the original word sequence “روده اطراف تستیس راست رویت شد” (The presence of the intestine was observed around the right testis.), the medical typist mistakenly replaced the intended word (“روده” / rʊdeh/ ‘intestine’) with the erroneous word (“توده” /tʊdeh/ ‘mass’), which is within an edit distance of 1. This resulted in the word sequence “توده اطراف تستیس راست رویت شد” (A mass was observed surrounding the right testicle), which had a higher context similarity score than the original word sequence. Consequently, this word sequence was overlooked by the model.

While this issue has not been highlighted in previous research on Persian spelling correction, we believe it poses a significant challenge in addressing real-word errors in Persian clinical texts. To prevent such errors from being overlooked, we could present a list of the most probable candidates along with their context scores to a human expert, allowing them to select the most appropriate replacement. This emphasizes that, despite the advancements in state-of-the-art models, human expertise remains indispensable in certain situations.

In summary, the results indicate that our proposed method exhibits robustness and precision in detecting and rectifying context-sensitive errors in Persian clinical text, thereby affirming its potential for practical application in the field.

Discussion

Typographical errors, a frequent occurrence in radiology reports often attributed to incessant interruptions and a dynamic work environment, have the potential to endanger patient health, introduce ambiguity, and undermine the reputation of radiologists [91]. The cardinal goal of our research was to pioneer an avant-garde technique for pinpointing and rectifying spelling inaccuracies in Persian clinical text. The elaborate morphology and syntax of the Persian language, intertwined with the pivotal role of meticulous documentation in fostering effective patient care, facilitating research, and safeguarding patient safety, accentuate the gravity of this undertaking. Within the confines of the Imaging Department at Imam Khomeini Hospital, the formulation of radiology reports is an intricate multi-step endeavor that averages around 30 min in duration.

This process includes dictation by radiologists, transcription by medical typists, and a review and editing process before the final report is stored in the HIS. However, this process includes non-value-added activities, known as ‘Muda’, particularly the time spent between transcription and final confirmation [92]. Our newly developed software addresses this inefficiency by quickly correcting misspelled words during transcription, reducing the time between initial writing and final confirmation, and thereby decreasing ‘Muda’.

Our approach leverages a pre-trained language representation model, fine-tuned specifically for the task of spelling correction in the clinical domain. This model is complemented by an innovative orthographic similarity matching algorithm, PERTO, which uses visual similarity of characters for ranking correction candidates. This unique combination of techniques distinguishes our approach from existing methods, enabling our model to effectively address both non-word and real-word errors. The evaluation of our approach demonstrated its robustness and precision in detecting and rectifying word errors in Persian clinical text. In terms of non-word error correction, our model achieved an F1-Score of 90.0% when the PERTO algorithm was employed. This represents a 1.1% increase in correcting non-word errors compared to using only contextual scores. For real-word error detection, our model demonstrated its highest performance, achieving an F1-Score of 90.6%. Furthermore, the model reached its highest F1-Score of 91.5% for real-word error correction when the PERTO algorithm was employed, indicating an approximate enhancement of 1.5% in the correction F1-Score.

Despite these promising results, our model has certain limitations. For instance, in a few cases, real-word errors were missed when the erroneous word had a strong semantic connection to the context words. Additionally, while our model is effective in handling non-word and real-word errors, it is not equipped to deal with grammatical errors. Moreover, our model was set up to handle one-error-per-sentence cases and cannot handle more than one error in a sentence. There were a few cases where a sentence included more than two errors.

Building upon our current achievements, the integration of emerging technologies such as BCI eye-tracking, VR/AR, and EEG offers a promising frontier for further enhancing our spelling correction system. These technologies present unique opportunities to address some of the inherent limitations identified in our study. For example, BCIs could offer intuitive, direct error correction interfaces, while eye-tracking might refine error detection based on user interaction patterns. VR/AR could provide immersive training environments, improving proficiency with correction tools, and EEG monitoring could lead to spelling correction interfaces that adapt to user stress levels and cognitive states, ultimately making the correction process less taxing and more efficient.

While prevailing spelling correction mechanisms for the Persian language cater to a broad spectrum and are not tailored to the medical sphere, our innovative system is specifically architected to autonomously pinpoint and amend misspellings prevalent in Persian radiology and ultrasound reports. The seamless integration of automatic spell-checking systems, notably in critical facets for patient safety such as allergy entries, medication details, diagnoses, and problem listings, can substantially bolster the quality and exactness of electronic medical records. Our system, which can be seamlessly integrated as an auxiliary program on platforms like Microsoft Office Word, web-browsers, or employed as an API in the HIS system, expands the potential applications of our model transcending the boundaries of the clinical domain.

In summary, the results of this study affirm the potential of our proposed method in transforming Persian clinical text processing. By effectively addressing the unique challenges posed by the Persian language and integrating cutting-edge technologies, our approach paves the way for more accurate and efficient clinical documentation, contributing to improved patient care and safety.

Conclusions

This study presents a novel method for detecting and correcting spelling errors in Persian clinical texts, leveraging a pre-trained model fine-tuned for this specific domain. Our approach has notably outperformed existing models, achieving F1-Scores of over 90% in both real-word and non-word error correction. This advancement underscores the method's robustness and its wide-ranging applicability, from error types like substitution and insertion to deletion and transposition. By integrating our orthographic similarity algorithm, PERTO, with contextual insights, we've significantly enhanced the correction success rate, marking a substantial improvement in spelling error correction for Persian clinical texts.

The potential of our methodologies extends beyond medical documentation, offering valuable applications in engineering sciences. The NLP and machine learning techniques employed here could revolutionize error detection and correction in engineering documents and software code, improving review processes, technical documentation accuracy, and software development efficiency. Furthermore, our findings could inform the creation of intelligent diagnostic systems for predictive maintenance and quality control, leveraging our error correction mechanisms for enhanced precision and reliability.

Looking ahead, we aim to refine our model further to tackle multiple errors within a sentence and address grammatical inaccuracies, broadening our method's comprehensiveness for the Persian medical domain. Additionally, we plan to explore the integration of emerging technologies like BCI, eye-tracking, VR/AR, and EEG, aiming to create more intuitive correction interfaces and immersive training environments. These efforts will not only advance spelling correction tools technically but also amplify their practical impact in medical documentation, contributing to improved patient care and safety.

Availability of data and materials

The data that support the findings of this study are held by Imam Khomeini Hospital. They are not publicly accessible due to privacy restrictions. However, they may be available from the authors upon reasonable request and with permission from the hospital.

Notes

All pronunciations have been provided in International Phonetic Alphabet (IPA).

Abbreviations

EHR :: Electronic health record
OCR :: Optical character recognition
ASR :: Automatic Speech Recognition
NLP :: Natural Language Processing
HIS :: Hospital information system
BCI :: Brain-Computer Interfaces
VR :: Virtual Reality
AR :: Augmented Reality
EEG :: Electroencephalography
MLM:: Masked Language Model

References

Wong W, Glance D. Statistical semantic and clinician confidence analysis for correcting abbreviations and spelling errors in clinical progress notes. Artif Intell Med. 2011;53(3):171–80.
Article PubMed Google Scholar
Zhou L, et al. Analysis of errors in dictated clinical documents assisted by speech recognition software and professional transcriptionists. JAMA Netw Open. 2018;1(3):e180530–e180530.
Article PubMed PubMed Central Google Scholar
Turchin A, et al. Identification of misspelled words without a comprehensive dictionary using prevalence analysis. AMIA Ann Symp Proc. 2007;2007:751–5 American Medical Informatics Association.
Google Scholar
Wilcox-O’Hearn A, Hirst G, Budanitsky A. Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model. In: International conference on intelligent text processing and computational linguistics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 605–16.
Hirst G, Budanitsky A. Correcting real-word spelling errors by restoring lexical cohesion. Nat Lang Eng. 2005;11(1):87–111.
Article Google Scholar
Bassil Y, Alwani M. OCR context-sensitive error correction based on Google web 1t 5-gram data set. Am J Sci Res. 2012;50.
Deng L, Huang X. Challenges in adopting speech recognition. Commun ACM. 2004;47(1):69–75.
Article Google Scholar
Hartley RT, Crumpton K. Quality of OCR for degraded text images. In: Proceedings of the fourth ACM conference on Digital libraries. 1999. p. 228–9.
Jurafsky D, James H, Martin J. Speech and Language Processing: An Introduction to Natural Language Processing. Computational Linguistics, and Speech Recognition. 2nd ed. New Jersey: Prentice-Hall; 2008.
Google Scholar
Atkinson K. Gnu aspell 0.60. 4. 2006, GNU Aspell) Retrieved from http://aspell.net
Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM. 1964;7(3):171–6.
Article Google Scholar
Idzelis M and Galbraith B. Jazzy: The java open source spell checker; 2005, Retrieved 2019/10/10, from http://jazzy.sourceforge.net
Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Doklady. 1966;10:8 Soviet Union.
Google Scholar
Dashti SMS, et al. Toward a thesis in automatic context-sensitive spelling correction. Int J Artif Intell Mechatron. 2014;3(1):19–24.
Google Scholar
Mays E, Damerau FJ, Mercer RL. Context based spelling correction. Inf Process Manage. 1991;27(5):517–22.
Article Google Scholar
Samanta P, Chaudhuri BB. A simple real-word error detection and correction using local word bigram and trigram. In: Proceedings of the 25th conference on computational linguistics and speech processing (ROCLING 2013). 2013.
Google Scholar
Wilcox-O'Hearn LA. Detection is the central problem in real-word spelling correction. 2014. arXiv preprint arXiv:1408.3153.
Dashti SM, KhatibiBardsiri A, Khatibi Bardsiri V. Correcting real-word spelling errors: A new hybrid approach. Digital Sch Humanit. 2018;33(3):488–99.
Article Google Scholar
Dashti SM. Real-word error correction with trigrams: correcting multiple errors in a sentence. Lang Resour Eval. 2018;52(2):485–502.
Article Google Scholar
Pande H. Effective search space reduction for spell correction using character neural embeddings. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017.
Google Scholar
Hu Y, Jing X, Ko Y, Rayz JT. Misspelling Correction with Pre-trained Contextual Language Model. 2020 IEEE 19th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC). IEEE: Beijing; 2020. p. 144–49. https://doi.org/10.1109/ICCICC50026.2020.9450253.
Lee J-H, Kim M, Kwon H-C. Deep learning-based context-sensitive spelling typing error correction. IEEE Access. 2020;8:152565–78.
Article Google Scholar
Sun R, Wu X, Wu Y. An Error-Guided Correction Model for Chinese Spelling Error Correction. In: Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. p. 3800–10.
Jayanthi SM, Pruthi D, Neubig G. NeuSpell: A Neural Spelling Correction Toolkit. EMNLP 2020. 2020:158.
Ji T, Yan H, Qiu X. SpellBERT: A lightweight pretrained model for Chinese spelling check. In: Proceedings of the 2021 conference on empirical methods in natural language processing. 2021.
Google Scholar
Liu S, et al. PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
Google Scholar
Zhang R, et al. Correcting Chinese spelling errors with phonetic pre-training. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021.
Google Scholar
Wang X, et al. Towards contextual spelling correction for customization of end-to-end speech recognition systems. IEEE/ACM Trans Audio, Speech Lang Proc. 2022;30:3089–97.
Article Google Scholar
Zhu C, et al. MDCSpell: A multi-task detector-corrector framework for Chinese spelling correction. In: Findings of the Association for Computational Linguistics: ACL 2022. 2022.
Google Scholar
Liu S, et al. CRASpell: A contextual typo robust approach to improve Chinese spelling correction. In: Findings of the Association for Computational Linguistics: ACL 2022. 2022.
Google Scholar
Salhab M, Abu-Khzam F. AraSpell: A Deep Learning Approach for Arabic Spelling Correction. 2023.
Google Scholar
Dalianis H, Dalianis H. Characteristics of patient records and clinical corpora. In: Clinical Text Mining: Secondary Use of Electronic Patient Records. 2018. p. 21–34.
Chapter Google Scholar
Hussain F, Qamar U. Identification and correction of misspelled drugs’ names in electronic medical records (EMR). In: International Conference on Enterprise Information Systems, vol. 3. SCITEPRESS; 2016. p. 333–8.
Kilicoglu H, et al. An ensemble method for spelling correction in consumer health questions. AMIA Annu Symp Proc. 2015;2015:727 American Medical Informatics Association.
PubMed PubMed Central Google Scholar
Zhou X, et al. Context-sensitive spelling correction of consumer-generated content on health care. JMIR Med Inform. 2015;3(3): e4211.
Article Google Scholar
Ruch P, Baud R, Geissbühler A. Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record. Artif Intell Med. 2003;29(1–2):169–84.
Article PubMed Google Scholar
Siklósi B, Novák A, Prószéky G. Context-aware correction of spelling errors in Hungarian medical documents. In: Statistical Language and Speech Processing: First International Conference, SLSP 2013. Proceedings 1 2013. Tarragona: Springer Berlin Heidelberg; 2013. p. 248–59.
Grigonyte G, et al. Improving readability of Swedish electronic health records through lexical simplification: First results. In: European Chapter of ACL (EACL), 26–30 April, 2014. Gothenburg: Association for Computational Linguistics; 2014.
Google Scholar
Tolentino HD, et al. A UMLS-based spell checker for natural language processing in vaccine safety. BMC Med Inform Decis Mak. 2007;7:1–13.
Article Google Scholar
Doan S, et al. Integrating existing natural language processing tools for medication extraction from discharge summaries. J Am Med Inform Assoc. 2010;17(5):528–31.
Article PubMed PubMed Central Google Scholar
Lai KH, et al. Automated misspelling detection and correction in clinical free-text records. J Biomed Inform. 2015;55:188–95.
Article PubMed Google Scholar
Fivez P, Šuster S, Daelemans W. Unsupervised context-sensitive spelling correction of English and Dutch clinical free-text with word and character n-gram embeddings. 2017. arXiv preprint arXiv:1710.07045.
Pérez A, et al. Inferred joint multigram models for medical term normalization according to ICD. Int J Med Informatics. 2018;110:111–7.
Article Google Scholar
Khan MF, et al. Augmented reality based spelling assistance to dysgraphia students. J Basic Appl Sci. 2017;13:500–7.
Article Google Scholar
Li Y, et al. Exploring text revision with backspace and caret in virtual reality. In: Proceedings of the 2021 CHI conference on human factors in computing systems. 2021.
Google Scholar
Lim J-H, et al. Development of a hybrid mental spelling system combining SSVEP-based brain–computer interface and webcam-based eye tracking. Biomed Signal Process Control. 2015;21:99–104.
Article Google Scholar
Mora-Cortes A, et al. Language model applications to spelling with brain-computer interfaces. Sensors. 2014;14(4):5967–93.
Article PubMed PubMed Central Google Scholar
D’hondt E, Grouin C, Grau B. Low-resource OCR error detection and correction in French Clinical Texts. In: Proceedings of the seventh international workshop on health text mining and information analysis. 2016.
Google Scholar
Tran K, Nguyen A, Vo C, Nguyen P. Vietnamese Electronic Medical Record Management with Text Preprocessing for Spelling Errors. 2022 9th NAFOSTED Conference on Information and Computer Science (NICS), Ho Chi Minh City: IEEE; 2022. p. 223–9. https://doi.org/10.1109/NICS56915.2022.10013386.
Dastgheib MB, Fakhrahmad SM, Jahromi MZ. Perspell: a new Persian semantic-based spelling correction system. Digit Sch Humanit. 2017;32(3):543–53.
Google Scholar
Ghayoomi M, Assi SM. Word prediction in a running text: A statistical language modeling for the Persian language. In: Proceedings of the Australasian Language Technology Workshop 2005. 2005.
Google Scholar
Kashefi O, Sharifi M, Minaie B. A novel string distance metric for ranking Persian respelling suggestions. Nat Lang Eng. 2013;19(2):259–84.
Article Google Scholar
MosaviMiangah T. FarsiSpell: a spell-checking system for Persian using a large monolingual corpus. Literary Linguist Comput. 2014;29(1):56–73.
Article Google Scholar
Naseem T, Hussain S. A novel approach for ranking spelling error corrections for Urdu. Lang Resour Eval. 2007;41(2):117–28.
Article Google Scholar
Shamsfard M. Challenges and open problems in Persian text processing. Proceedings of LTC. 2011;11:65–9.
Google Scholar
Shamsfard M, Jafari HS, Ilbeygi M. STeP-1: A set of fundamental tools for Persian text processing. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). 2010.
Google Scholar
Yazdani A, et al. Automated misspelling detection and correction in Persian clinical text. J Digit Imaging. 2020;33:555–62.
Article PubMed Google Scholar
Faili H, Ehsan N, Montazery M, Pilehvar MT. Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language. Digital Scholarsh Humanit. 2016;31(1):95–117.
Article Google Scholar
Ghayoomi M, Momtazi S, Bijankhan M. A Study of Corpus Development for Persian. Int J Asian Lang Process. 2010;20(1):17–34.
Google Scholar
Farshbafian A, Asl ES. A metafunctional approach to word order in Persian language. J Lang Linguist Stud. 2021;17(S2):773–93.
Article Google Scholar
Seraji M, Megyesi B, Nivre J. A basic language resource kit for Persian. In: Eight International Conference on Language Resources and Evaluation (LREC 2012), 23–25 May 2012. Istanbul: European Language Resources Association; 2012.
Google Scholar
Miangah TM, Vulanović R. The Ambiguity of the Relations between Graphemes and Phonemes in the Persian Orthographic System. Glottometrics. 2021;50:9–26.
Article Google Scholar
Modarresi Ghavami G. Vowel Harmony and Vowel-to-Vowel Coarticulation in Persian. Language and Linguistics. 2010;6(11):69–86.
Google Scholar
Sedighi A. Persian in use: An Elementary Textbook of Language and Culture. 1st ed. Leiden University Press; 2015. https://www.muse.jhu.edu/book/46336.
Mozafari J, et al. PerAnSel: a novel deep neural network-based system for Persian question answering. Comput Intell Neurosci. 2022;2022:3661286.
Article PubMed PubMed Central Google Scholar
Ghomeshi J. The additive particle in Persian: A case of morphological homophony between syntax and pragmatics. Adv Iran Linguist. 2020;1:57–84.
Article Google Scholar
Bonyani M, Jahangard S, Daneshmand M. Persian handwritten digit, character and word recognition using deep learning. Int J Doc Anal Recognit. 2021;24(1–2):133–43.
Article Google Scholar
Rasooli MS, et al. Automatic standardization of colloquial Persian. 2020. arXiv preprint arXiv:2012.05879.
Farahani M, et al. Parsbert: Transformer-based model for persian language understanding. Neural Process Lett. 2021;53:3831–47.
Article Google Scholar
Dehkhoda AA. Dehkhoda dictionary. Tehran: Tehran University; 1998. p. 1377.
Google Scholar
Peterson JL. A note on undetected typing errors. Commun ACM. 1986;29(7):633–7.
Article Google Scholar
Huang Y, Murphey YL, Ge Y. Automotive diagnosis typo correction using domain knowledge and machine learning. 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Singapore: IEEE; 2013. p. 267–74. https://doi.org/10.1109/CIDM.2013.6597246.
Kukich K. Techniques for automatically correcting words in text. ACM Comput Surv (CSUR). 1992;24(4):377–439.
Article Google Scholar
Dowsett DJ. Radiological sciences dictionary : keywords, names and definitions. 1st ed. Hodder Arnold; 2009. https://doi.org/10.1201/b13300.
Pennington J, Socher R, Manning CD. Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
Google Scholar
Mikolov T, et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Proc Syst. 2013;26:3111–9.
Google Scholar
Mikolov T, Yih WT, Zweig G. Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. 2013.
Google Scholar
Goldberg Y. A primer on neural network models for natural language processing. J Artif Intell Res. 2016;57:345–420.
Article Google Scholar
Radford A, et al. Improving language understanding by generative pre-training. 2018.
Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. p. 4171–86.
Sarzynska-Wawer J, et al. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021;304: 114135.
Article PubMed Google Scholar
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised Cross-lingual Representation Learning at Scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. p. 8440–51.
Raffel C, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(140):1–67.
Google Scholar
Yang Z, et al. Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inf Proc Syst. 2019;32:1–11.
Google Scholar
Liu Y, et al. Roberta: a robustly optimized bert pretraining approach; 2019. arXiv preprint arXiv:1907.11692.
Wang W, Bao F, Gao G. Learning morpheme representation for mongolian named entity recognition. Neural Process Lett. 2019;50(3):2647–64.
Article Google Scholar
Taghizadeh N, et al. SINA-BERT: a pre-trained language model for analysis of medical texts in Persian. 2021. arXiv preprint arXiv:2104.07613.
Abadi M, et al. Tensorflow: a system for large-scale machine learning. Savannah: Osdi; 2016.
Google Scholar
Ketkar N, Ketkar N. Introduction to keras. Deep learning with python: a hands-on introduction. 2017. p. 97–111.
Google Scholar
Mikolov T, et al. Efficient estimation of word representations in vector space. 2013. arXiv preprint arXiv:1301.3781.
Minn MJ, Zandieh AR, Filice RW. Improving radiology report quality by rapidly notifying radiologist of report errors. J Digit Imaging. 2015;28:492–8.
Article PubMed PubMed Central Google Scholar
Kruskal JB, et al. Quality initiatives: lean approach to improving performance and efficiency in a radiology department. Radiographics. 2012;32(2):573–87.
Article PubMed Google Scholar

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Computer Engineering, Kerman Branch, Islamic Azad University, Kerman, Iran
Seyed Mohammad Sadegh Dashti
Department of Advanced Research, Bushehr University of Medical Sciences, Bushehr, Iran
Seyedeh Fatemeh Dashti

Authors

Seyed Mohammad Sadegh Dashti
View author publications
You can also search for this author in PubMed Google Scholar
Seyedeh Fatemeh Dashti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Seyed Mohammad Sadegh Dashti and Seyedeh Fatemeh Dashti conceptualized and designed the study. Seyed Mohammad Sadegh Dashti developed the model and performed the experiments. Seyedeh Fatemeh Dashti collected and analyzed the data. Both authors contributed to writing the manuscript and approved the final version for publication.

Corresponding author

Correspondence to Seyed Mohammad Sadegh Dashti.

Ethics declarations

Ethics approval and consent to participate

The study was conducted in accordance with ethical standards and received approval from the Institutional Review Board of the Islamic Azad University, Kerman Branch (approval ID: IR.IAU.KERMAN.REC.1402.124). As the study did not involve any human trials, the requirement for informed consent was waived by the same Institutional Review Board.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Dashti, S.M.S., Dashti, S.F. Improving the quality of Persian clinical text with a novel spelling correction system. BMC Med Inform Decis Mak 24, 220 (2024). https://doi.org/10.1186/s12911-024-02613-0

Download citation

Received: 21 October 2023
Accepted: 17 July 2024
Published: 05 August 2024
DOI: https://doi.org/10.1186/s12911-024-02613-0

Improving the quality of Persian clinical text with a novel spelling correction system

Abstract

Background

Methods

Results

Conclusions

Introduction

Related works

Persian spelling challenges

Material and methods

Pre-processing step

Damerau-Levenshtein distance and candidate generation

Contextual embeddings

Pre-trained language representation model

Data

Model architecture

Fine-tuning for spelling correction task

PERTO algorithm

Error detection module

Non-word error detection

Real-word error detection

Error correction module

Non-word error correction process

Real-word error correction process

Evaluation and results

Test dataset

Evaluation metrics

Baseline models

Yazdani, et al.

CBOW Model

Non-word error correction evaluation

Real-word error detection and correction evaluations

Discussion

Conclusions

Availability of data and materials

Notes

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us