Composite CDE: modeling composite relationships between common data elements for representing complex clinical data

Background Semantic interoperability is essential for improving data quality and sharing. The ISO/IEC 11179 Metadata Registry (MDR) standard has been highlighted as a solution for standardizing and registering clinical data elements (DEs). However, the standard model has both structural and semantic limitations, and the number of DEs continues to increase due to poor term reusability. Semantic types and constraints are lacking for comprehensively describing and evaluating DEs on real-world clinical documents. Methods We addressed these limitations by defining three new types of semantic relationship (dependency, composite, and variable) in our previous studies. The present study created new and further extended existing semantic types (hybrid atomic and repeated and dictionary composite common data elements [CDEs]) with four constraints: ordered, operated, required, and dependent. For evaluation, we extracted all atomic and composite CDEs from five major clinical documents from five teaching hospitals in Korea, 14 Fast Healthcare Interoperability Resources (FHIR) resources from FHIR bulk sample data, and MIMIC-III (Medical Information Mart for Intensive Care) demo dataset. Metadata reusability and semantic interoperability in real clinical settings were comprehensively evaluated by applying the CDEs with our extended semantic types and constraints. Results All of the CDEs (n = 1142) extracted from the 25 clinical documents were successfully integrated with a very high CDE reuse ratio (46.9%) into 586 CDEs (259 atomic and 20 unique composite CDEs), and all of CDEs (n = 238) extracted from the 14 FHIR resources of FHIR bulk sample data were successfully integrated with high CDE reuse ration (59.7%) into 96 CDEs (21 atomic and 28 unique composite CDEs), which improved the semantic integrity and interoperability without any semantic loss. Moreover, the most complex data structures from two CDE projects were successfully encoded with rich semantics and semantic integrity. Conclusion MDR-based extended semantic types and constraints can facilitate comprehensive representation of clinical documents with rich semantics, and improved semantic interoperability without semantic loss.


Background
Data harmonization and interoperability are essential for advancing biomedical research. These features can be achieved by representing clinical data in a standard format, and they are crucial for facilitating understanding and sharing data across diverse translational studies [1,2]. A data element (DE) is defined as the fundamental unit of data which contains information with a clear conceptualized meaning, together with its representation, and is considered as the correct approach for standardizing data and improving data quality (DQ) and efficiency.
The ISO/IEC 11179 Metadata Registry (MDR) standard describes a method of standardizing and registering DEs to make them understandable and shareable between studies and institutions. MDR-based DE provide data uniformly and interoperability between clinical studies and institutions since they are specified based on a standard metadata model that consists of a sets of attributes, which are delineating the definition, identification, representation, classification, and permissible values [3][4][5].
The terms DE and common data element (CDE) have been used interchangeably in many ways. However, it can be clearly explained by defining these two terms as the following. The term DE is an atomic unit of data that has precise meaning and precise semantics in metadata. CDE is a data element that is common to multiple data sets across different studies [6]. In this paper, we used the term DE to specifically describe the concept of metadata, but for all other cases we used the term CDE.
CDEs are increasingly being used by clinical researchers in trials for harmonizing data collected across diverse studies. The use of standardized CDEs provides various benefits to investigators including (1) rapid and efficient study start-up by enabling access to defined CDEs and case report forms (CRFs) and (2) enriched data sharing and aggregation using standard definitions and forms [7].
The use of CDEs has been extended to clinical practice by using standardized CDEs for representing the clinical information in electronic health records (EHRs). For example, Newton et al. included phenotype data in EHRs using CDEs to facilitate EHR-driven genomic studies [8]. The National Institutes of Health have developed ISO/ IEC 11179 MDR-based CDEs providing a controlled terminology for data descriptors. They also encouraged clinical researchers to use CDEs to facilitate data harmonization [5]. CDEs have been adopted in numerous clinical domains including cancer, stroke, epilepsy, rare disease, emergency medicine, and radiology for patient care and research. Utilizing CDEs will facilitate secondary data use (i.e., 'collect once and use many times'), which is an approach to data standardization for spanning silos in primary and secondary data use [9].
However, ISO/IEC 11179 MDR focuses only on the representation of individual and independent CDEs without providing the ability to describe constraints for a CDE nor relationships among different CDEs, which are essential for fully describe, semantically compose, and correctly interpret CDEs of clinical documents [10][11][12][13]. Although ISO/IEC 11179 MDR standard describes Derived Data Element (DDE) [14] detailing the relationship between a CDE and another CDE from which it is derived with the rule controlling its derivation, this approach is inherently limited by requiring one or more input CDEs and the DDE becoming output DE. For example, while CDEs for describing systolic blood pressure (SBP) and diastolic blood pressure (DBP) can be easily defined as two separate ones annotated with standardized metadata conforming to the ISO/IEC 11179 MDR standard, these two CDEs become mere input CDEs and a separate output CDE should be created as the DDE. Also, a constraint between the two CDEs such as 'the SBP must be greater than the DBP' is usually described outside of the CDEs for there is no designated reason for the CDEs to carry constraint information.
To address these challenges in our previous study [10], we proposed three types of semantic relationships (i.e, variable, dependency, and composite relationships) representing semantic constraints or rules among multiple CDEs. These relationships can be described as follows: First, CDEs are in a variable relationship when they can be systematically derived from a base CDE by applying a standardized concept from a controlled vocabulary as the variable. For example, the meanings of two CDEs for 'normal value range of laboratory test, Albumin' and 'normal value range of laboratory test, Homocysteine' are closely related, differing only in the laboratory test names of 'Albumin' and 'Homocysteine.' It means many lab tests related CDEs can be assigned to one variable CDE. The variable relationship can systematically represent all these variations as a single CDE, 'DE: Normal value range of lab test x,' by specifying a controlled vocabulary such as LOINC. The variable relationship can therefore systematically reduce the number of required CDEs. Second, a CDE is in a dependency relationship may influence the possible determinations of the value space of the CDE(s) base on the value of another CDE(s). For example, the value of a certain CDE may be defined as the sum of the values of a set of CDEs in a questionnaire. Third, the composite relationship can be conveniently applied to integrate several interrelated CDEs into a composite CDE. For example, the medical history of a patient is likely to be more informative when body parts are correctly assigned, which can be achieved by grouping 'DE: Body System for Medical History' and 'DE: Medical History Specify' into the composite CDE of 'DE: Medical History.' However, we realized that our previous work, supports relatively simple semantic relationships among CDEs and is not robust enough to cover many other specific challenges associated with CDEs used in real-world clinical forms.
The present study further proposes extended semantic types (hybrid atomic CED (aCDE) and repeated and dictionary composite CDEs (cCDEs)) and four semantic constraints (ordered, operated, required, and dependent) for correctly representing even more complex but essential semantic relationships between CDEs that are found in real-world clinical documents (Fig. 1). We found useful patterns characterizing challenging cases, that required further semantic definitions and descriptions as the following four cases; Data entries with multiple data types A data type determines the type of data that can be entered and stored in a CDE and each CDE contains only one data type [15]. However, we found that free-textbased data entry in many clinical documents stored in EHRs often allows multiple data types to be entered and stored in the same attribute. For example, a laboratory result for syphilis normally has a numeric data type that allows numeric values (e.g., '0.8') as input. However, it often also requires the entry of string or logical data such as 'negative' or 'false' as input. Sometimes creating two strictly separate CDEs for the same laboratory result for syphilis (i.e., numeric and string) may cause more confusion than not. We found that sometimes it is better to allow either numeric or string data types for the same value domain. We created a value property (hybrid) to make it possible to ensure that conventional multiple data types are available in the same CDE by explicitly defining hybrid data type for a CDE.

Dictionary data entries
Data may refer to a controlled biomedical vocabulary for several reasons such as adherence to standards, semantic enrichment for better understanding, and input validation for improving semantic integrity. A CDE referring to a controlled biomedical vocabulary was defined as being in a variable relationship in our previous study [10]. We extended the concept of the variable relationship to dictionary data entries in order to tightly link a set of CDEs via a 'foreign key' between a real-world dictionary database and a controlled biomedical vocabulary. This also ensures that a set of CDEs and tuples with rich attributes provided by the dictionary are linked with their proper data type definitions and value domains.
Tabular data entries with repeated data entry Clinical data are frequently described in tabular formats. A tabular data entry is an enclosed structure in which a composed set of CDEs is repetitively listed for repeated observations. For example, body weight and height may be measured for each patient when she/he visits for treatment. The set of data items such as body weight, height, and date of measurement should both be collected together and repeatedly. We created a value property (repeat) to ensure that the values that belong to the same set of CDEs are identified as such.

Data constraints
Highly interrelated CDEs in a clinical document need to be defined by semantic constraints for better interchange of semantics and context. By specifying constraints on an aCDE, users can further narrow down the definition of what a valid value really means. For example, a derived value such as BMI (body mass index) can be automatically calculated from the values of the two aCDEs for body weight and height. Because the values of bodyweight and height aCDEs should not be null, a required constraint should be applied to each of the two aCDEs Fig. 1 Overview of the formal relationship between aCDE and cCDEs with extended semantic types and CDE-type specific constraints to make the BMI aCDE to be valid. The calculation formula to obtain BMI is described by an operated constraint of BMI aCDE.
For another example, for an aCDE related to the question of whether any drug side effect has happened with permissible answers of 'Yes' and 'No', the following aCDE, "Specify the drug side effect", holds only when the value was 'Yes'. These two aCDEs are in a dependent relationship with each other and the sequence of the two has an order. The dependent and sequence relationships can be defined by dependent and ordered constraints.

Data sources: 2 CDE projects
The National Institute of Neurological Disorders and Stroke (NINDS) CDE Project [16] is an ongoing effort to develop data standards for use in clinical research in neuroscience. It was initiated in 2006 to standardize data collection across neurological-disorder-related clinical studies funded by the NINDS. As of October 2016, the NINDS CDE project included 20 studies with 11,296 distinct CDEs. The NINDS CDEs are not fully compliant with ISO/IEC 11179. Instead they are provided with only simple CDE descriptions and definitions. However, a part of NIND CDEs that are registered in National Cancer Institute (NCI) cancer Data Standards Registry (caDSR) and reviewed by the NCI cancer Biomedical Informatics Grid project manager, conforms fully with the ISO/IEC 11179 MDR standard. In the present study, we used part of the NINDS CDEs, which are 308 (3.1%) stroke and general CDEs of the NINDS in 57 CRFs (Supplementary Tables S1) that are registered in the caDSR. Selected CDEs within the context of their CRFs were explored for challenging cases requiring new semantic relationships that we have defined.
The DialysisNet and Avatar Beans Project is a tabletand phone-based mobile application developed by the Health Avatar Initiative [17]. The project started in 2013, and it has established clinical data standards for managing and harmonizing hemodialysis data across multiple medical institutions in Korea [18,19]. This project aims to improve the management of chronic kidney disease and end-stage renal disease by using an integrated mobile application for data collection and documentation. The DialysisNet application was initially built upon 122 distinct hemodialysis-associated CDEs based on CRFs from major four hemodialysis centers (Supplementary Tables S2). We used 11,428 CDEs from the above two projects for comprehensively defining and evaluating new CDE relationships and constraints.

Designating key concepts
The CRFs and clinical documents from the two CDE projects incorporate all the data collection items with CDEs. We first examined the CDEs to formalize the above mentioned four challenging cases. Figure 1 depics the formal relationships between atomic (aCDE) and composite (cCDE) CDEs with type-specific constraints. Since the core structure of a CDE is a name-value pair augmented by DE concept-domain and value-domain details, an aCDE is a single unambiguously described data item [19]. Our previous and simple-minded definition of cCDE as a set of interrelated aCDEs [9] was extended to include two new semantic relationships: dictionary and repeated cCDEs.
For example, a drug side effect is regarded as an undesirable secondary effect that occurs in addition to the desired therapeutic effect of a medication. To correctly represent 'a drugs side effect', at least three types of information needs to be presented: 'drug name', 'drug dosage', 'drug side effect'. One can define the three types of information as aCDEs and then combine them to compose a cCDE.
We extracted aCDEs and cCDEs from the above mentioned two DE projects (NINDS and DialysisNet CDE Projects) and applied the extended semantic types and constraints. We then mapped and integrated the CDEs in order to comprehensively evaluate the metadata reusability and semantic interoperability in the clinicalpractice setting.

Evaluation scheme
For the purpose of evaluating the utility of the newly proposed semantic types and constraints, we used three different data sources: (1) deriving CDEs from clinical documents, (2) Fast Healthcare Interoperability Resources (FHIR) based structured data, and (3) practical clinical dataset from MIMIC-III (Medical Information Mart for Intensive Care).
For deriving CDEs from clinical docments, we collected 25 clinical documents in real-world clinical practice, comprising five documents including Admission Note, Initial Medical Examination Note, Discharge Summary, Emergency Note, and Operation Note from five major teaching hospitals in Korea: Seoul National University Hospital, Ajou University Medical Center, Pusan National University Hospital, Gachon University Gil Hospital, and Chonnam National University Hospital. It contains Patient, PastHistory, AdmissionInformation, Operation, Family-History, SocialHistory, LabResult, Medication, VitalSign, Treatments, and PhysicalExam [18]. We chose these 25 clinical documents since these documents are used in common by all five hospitals and are essential in the process of patient admission to discharge, for representing the specificity of the data. The limits of these 25 clinical documents are their insufficiency in providing a richness of depth and detail concerning the levels of clinical data. Thus, we added two different structured data from the FHIR bulk sample data and the MIMIC-III demo dataset.
FHIR is propagated as an open standard describing data formats and elements, known as 'resources' and an application programming interface (API) for exchanging EHR. FHIR's clinical resource definitions are concrete, intuitive concepts such as MedicationPrescription, AdverseReaction, Procedure, and Condition. The standard was created by the Health Level Seven International (HL7) healthcare standards organization. We downloaded FHIR bulk sample data, which is exported from a FHIR server to a preauthorized client by using FHIR bulk Downloader sample app [20][21][22]. Among 145 resources of FHIR version 4 [23], the FHIR bulk sample data contains 14 resources; AllergyIntolerance, CarePlan, Claim, Condition, Goal, Encounter, Observation, DiagnosticReport, Immunization, MedicationRequest, ImagingStudy, Organization, Patient, and Procedure. Although we could analyze metadata of all FHIR resources through the structural information provided by HL7, it was necessary to review the actual sample data with metadata to confirm the relationships and constraints among the data. Thus, we chose 14 out of the 145 FHIR resources.
The MIMIC-III clinical database contains comprehensive clinical data relating to tens of thousands of Intensive Care Unit patients. MIMIC-III is a large, freelyavailable database comprising of deidentified healthrelated data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The Dataset has 26 tables which includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. We downloaded the MIMIC-III demo dataset that is limited to 100 patients. While the number of patients were limited, the metadata and dataschema were identical [24,25].
The evaluation process consisted of the following three steps: CDE extraction, CDE integration, and construction of semantic relationships among the CDEs. We counted the numbers of CDEs generated in each step as an evaluation measure of the structural efficiency for the 25 clinical documents and FHIR bulk sample data. However, the MIMIC-III demo dataset is provided as a relational database, containing tables of data relating to patients. A table is a data storage structure which is similar to a spreadsheet: each column contains consistent information (e.g., patient identifiers), and each row contains an instantiation of that information (e.g. a row could contain the integer 340 in the patient identifier column which would imply that the row's patient identifier is 340) [25]. We manually reviewed the relationships among the columns of each table, whether there were cases which were covered by our proposed CDE relationships and constraints.

Overview of all types of semantic relationships
To address the semantic challenges described above, we defined atomic and composite CDEs using newly proposed three semantic types, i.e., hybrid, dictionary, and repeated, and three constraints, i.e., ordered, operated, and required, in addition to the existing two semantic relationship constraints, i.e., dependent and variable relationships, defined in our previous study. The newly defined composite semantic type replaced the old composite relationship constraint that we defined previously [10]. Figure 1 depicts atomic and composite CDEs along with their specific relationships and constraints. An aCDE can be constrained using variable and hybrid relationships by classifying them as variable and hybrid aCDEs, respectively. The definition of cCDE as a set of interrelated aCDEs in our previous study [10] was extended to include a clear definition, a separate identifier for reuse, and constraints among aCDEs included in a cCDE. A cCDEs can be classified into dictionary and repeated cCDEs. The dependent relationship was the only relationship constraint in our previous study. We extended it to four constraints: ordered, operated, required, and dependent. As shown in the left lower panel in Fig.  1, the ordered constraint does not apply to an aCDE.

Data entries with multiple data types: Hybrid aCDE
A hybrid aCDE is a particular type of aCDE that allows a value domain with multiple (or hybrid) data types. Technically it includes several aCDEs having the same CDE concept but different value domains. Figure 2a shows a part of a hemodialysis CRF from the DialysisNet and Avatar Beans Project. A time-tagged hybrid aCDE was applied to the Time attribute in a tabular data-entry format (Fig. 2a). Time is defined as a hybrid aCDE, Hemodialysis_Time_Hybrid_DE (DE:47616). Time is derived from two aCDEs, i.e., Hemodialysis_Time_DE (DE: 43239) and Hemodialysis_Time_String_DE (DE:47614) allowing a 'time' such as '08:00' and an 'enumeratedstring' such as 'Finish' and 'Start', data types, respectively (Fig. 2b). The hybrid aCDE Time (or Hemodialysis_ Time_Hybrid_DE (DE:47616)) can capture either a time or an enumerated string value as input.

Tabular data entries: Repeated cCDE
A repeated cCDE is a cCDE that captures data input multiple times in a tabular format. The definition of the repeated cCDE prevents the unnecessary creation of redundant CDEs and capture input data in a tabular format.

Dictionary data entries: Dictionary cCDE
Our previous study [10] defined a variable CDE as a CDE that contains a controlled biomedical vocabulary variable. Similarly, a cCDE containing a variable aCDE as the primary key of a dictionary table can be defined as a dictionary cCDE. This approach provides a way to encode an entire dictionary table as well as a controlled vocabulary into a single dictionary cCDE, and thereby capture comprehensive biomedical knowledge from a database. A dictionary cCDE provides a useful means to apply relevant attributes of a dictionary database to constrain and validate input values to the dictionary cCDE. Figure 4a displays a typical data-entry document for laboratory test results in a tabular format. The 'Electrolyte Laboratory Tests' form from 'Recommended Labs for Stroke' of the NINDS CDE project [26] consists of six attributes including the laboratory test name, laboratory test result, unit of the laboratory test result, an indicator for whether the laboratory test result was abnormal, and another indicator for whether the laboratory test result was clinically significant when the laboratory test result was abnormal. Figure 4b shows a part of the structured NINDS 'Electrolyte Laboratory Tests Dictionary' reference table. The Unit of Result attribute supports multiple units that are delimited by '^'. The Normal Range attribute is also separated according to the Unit of Result and is represented in JSON (Javascript object notation)-type encoding.
A dictionary cCDE can systematically capture the entire 'Electrolyte Laboratory Tests' data-entry document as 'DE: 47571 Laboratory_Test_NINDS_Composite_DE,' which is composed of six aCDEs (Fig. 4c, Relation) Fig. 4c. Figure 4c Relation Rule shows how a dictionary cCDE accompanied by its constraint rules are defined. For the two evaluation cases listed in Fig. 4b, both a Dictionary Rule and a Dependent Rule are defined by symbolic logic (or pseudocode) with the accompanying Descriptions. Dictionary Rule defines how to use biomedical knowledge contained in a dictionary table and Dependent Rule defines the interrelatedness of aCDEs in a cCDE by using dependent constraint relationship

Semantic restriction: Constraints
We defined four constraints that support the creation of a robust clinical document by specifying the interrelationship among many aCDEs. We defined four classes of operators: assignment, arithmetic, logical, and relational. Order can only be applied to aCDEs contained in a cCDE. However, the other three constraints (operated, required, and dependent) can be applied to both independent aCDEs and those contained in cCDEs (Fig. 1). We created symbolic logic with prefix notation [27] (Table 1) to describe the order of operations and to formulate constraints. More practical examples are shown in Fig. 5 to demonstrate how constraints are applied to a repeated cCDE as well. The four constraints are described as follows:  Table 1C can be ordered by a constraint statement such as (Ordered CDE20 CDE21 CDE22).

Evaluation study
To evaluate the usefulness of our newly extended composite semantic relationships, we applied them to CDEs which were systematically extracted from the 25 clinical documents of five teaching hospitals in Korea and from FHIR bulk sample data. At first, we focused on deriving CDEs from clinical documents, which provided many explicit cases that clearly demonstrated the relationships between CDEs. We then wanted to prove that our proposed relationships and constraints were valid in structured clinical dataset as well. It was why we chose two difference types of source data: unstructured and structured data. The evaluation process consisted of the following steps: CDE extraction, CDE integration by using the newly proposed atomic and composite CDEs with semantic enrichments. We examined how the number of CDEs had been reduced from CDE extraction to CDE integration, measuring the structural and semantic efficiency of CDEs for clinical data elements. Although HL7 FHIR supports mainly structured data, it also provides a document related resource, FHIR Questionnaire. To see whether our proposed semantic types can cover FHIR Questionnaire, we matched elements of the FHIR Questionnaire resource to our developed relationships and constraints for further evaluation.
For evaluating derived CDEs from clinical documents, we first extracted 84, 48, 70, 83, and 37 CDEs from the Table 1 Encoding operated, required, dependent, and ordered constraints for CDEs with prefix notation. Examples of (A) an operated constraint for calculating BMI, (B) a required constraint for demography information, (C) a dependent constraint for smoking history, and (D) an ordered constraint  Table 2). In the CDE extraction step, we found that applying CDE is an effective way to reduce redundant CDEs (22.23 7.9%) at each hospital. This means that there were many CDEs shared across the five different clinical documents at each hospital. We found that an even higher CDE reduction rate of 48.7% was achieved by integrating the information for all five hospitals, which indicated that various CDEs were commonly used across five different teaching hospitals. The CDE integration step involved integrating aCDEs into clinically relevant cCDEs to further structure the clinical documents and then integrating the cCDEs across different clinical documents. For example, when a vital sign-related cCDE contained three aCDEs ('body weight,' 'body temperature,' and 'blood pressure') and another vital sign-related cCDE contained an additional aCDE ('description the reason of unstable vital sign'), we integrated them into a vital-sign cCDE comprising four aCDEs. The application of these three steps constantly  Table 2) but also showed greatly improved semantic accuracy and interoperability, which was also supported by the review of the documents by the authors.
We found that the compositions of the clinical documents differed quite markedly across the five hospitals. The clinical documents at Hospitals P and S contained the largest (n = 266) and smallest (n = 31) numbers of independent CDEs, respectively. We also found that even the same clinical documents showed huge variations in CDE numbers. The number of CDEs in Admission Notes varied from 12 at Hospital S to 204 at Hospital P. Hospital P also had the largest number of aCDEs for Initial Medical Examination Note (n = 123) while Hospital A had the largest number of aCDEs for Emergency Note (n = 83) and Operation Note (n = 37). We also applied constraint rules for the five clinical documents of the five hospitals (Table 3). We could not determine if a CDE was a hybrid aCDE partly due to the lack of sufficient input values and partly due to poor descriptions of the response values for the clinical documents. We designated the cCDEs as basic cCDEs to distinguish them from repeated and dictionary cCDEs. A cCDE was on average reused twice among the five documents by the hospitals. We also found that the clinical documents at Hospital A were the best structured and contained the greatest detail with more cCDEs and constraint rules compared to the documents of the other hospitals.
We evaluated the DE relationships and constraints with the same method applied to different data sources, which were 14 FHIR resources from FHIR bulk sample data. We first extracted 238 CDEs and found 142 CDEs (59.7%) were reused in at least 2 of 14 FHIR resources, resulting in 96 unique aCDEs. We then created clinically relevant cCDEs and applied semantic relationships to them. 48 cCDEs successfully captured 194 (81.5%) of 238 CDEs. Finally, 28 cCDEs successfully captured 75 of the 96 unique CDEs such that 49 (=28 + 21) CDEs were enough to represent the initial 238 CDEs extracted from 14 FHIR resources (Table 4). Supplementary Tables S6-S7 list the cCDEs and how they were distributed in each FHIR resources. The fact that more than half of the CDEs has been reused shows that the FHIR data are relatively well standardized and structured. Half of the FHIR resources, i.e., AllergyIntolerance, Condition, Encounter, Goal, MedicationRequest, Organization, and Procedure, were represented by repeated cCDEs, which means all extracted CDEs of each FHIR resource became a component aCDEs of the repeated cCDEs. These structured data have been reused frequently among different FHIR resources.
While we were mapping our proposed semantic types and constraints to FHIR resources, we found that hybrid aCDE and operated, and dependent constraints were not applicable in FHIR resources. For the case of hybrid aCDE, although only one datatype is allowed for each data in FHIR specification, we foun no restriction on the datatype in the FHIR bulk sample data since the data was represented by JSON, and XML. While the required and ordered constraints were explicitly indicated, operated, and dependent constraints were not valid in FHIR resources because the rule by which two or more data values were related could not be applied (Table 5).
Another evaluation was the mapping between our semantic types and constraints to document-associated FHIR resource, Questionnaire. Figure 6 represents the mapping of the FHIR structure in extracts on the left side, linked via arrows to the corresponding developed CDE relationships and constraints. The relevant elements in the FHIR Questionnaire resource were group and question, which represents composite and atomic CDEs (the data model of a single question). Among our three CDE relationships and four constraints, the repeated cCDE relationship and the required and operated constraints were straightforwardly mapped. The FHIR Questionnaire resource is to define both collection forms, surveys and other structures that can be filled out with their context. It had a certain structure to represent relationships among CDEs but value related constraints could not be modelled. For instance, it could not be represented whether the value allows for multiple data types (Hybrid aCDE) or whether one value can be changed depending upon another element's value (Constraint: Dependent).
For evaluations with a real dataset, we analyzed 26 tables of the MIMIC-III demo database. These tables were divided into three categories which were classified by different data characteristics: (1) 14 tables for hospital data, (2) three tables for online definitions, and (3) 19 tables for care-value and meta-version ICU related data (Supplementary Tables S8). We first manually reviewed the relationships among the columns of each table. The evaluation process was conducted only for cases in which a relationship was found through the following steps: CDE extraction, CDE integration by using atomic and composite CDEs and then the construction of semantic relationships among the CDEs.
We found four hybrid aCDEs that allows numeric data and text data. For example, VALUE in LABEVENTS  The numbers before the parentheses represent unique counts allows for string data and numeric data. If this value is numeric, then VALUENUM represents the same data in a numeric format with an appropriate unit from VALUEUOM for its usability in calculations. The four general cCDEs in Table 5 list cCDEs that includes the hybrid aCDE. We also found three variable aCDEs associated with its particular dictionary cCDE. For example, ICD9_CODE in DIAGNOSES_ICD is matched to the same value as ICD9_CODE in D_ICD_DIAGNOSES. And each table became a repeated cCDE because it is composed of a set of related items. All tables have a required constraint, and two tables have an operated constraint. As MIMIC data is provided as a relational database, dependent and/or ordered constraints are not applicable. Relational table treats the value of each column independently without ordering based on set inclusion theory (Table 5). Supplementary Tables S8-S9 lists specific results which the MIMIC-III metadata matched to our proposed relationships and constraints.

Comparison with related studies
Standardization of clinical data using CDEs based on ISO/IEC 11179 metadata standard is clearly one of most effective ways to harmonize data collected from various clinical institutions and studies. The advantages of this approach are (1) providing a consistent data collection tool, (2) improving study data quality, and (3) reducing the cost of data entry and cleansing by having uniform data. However, the limitation of ISO/IEC 11179 of not providing a data structure for representing interrelationships among CDEs has resulted in a gap between the development of CDEs and their utilization in clinical forms. Although ISO/IEC 11179 provides DDEs to overcome the limitation by enhancing interrelated CDEs. A DDE is a DE whose values are derived through a transformation of the values of one or more source CDEs. For example, the DDE of the 'length of stay in a hospital'  is derived from two independent DEs that counts the number of days from two input CDEs: 'Admission date' and 'Discharge date.' However, this strategy is far from enough to cover all use cases of interrelated CDEs that we have describe in the Background section. Table 6 compares the DDE and our CDE semantic relationships. The value of a DDE is derived from input DE(s). Our CDE semantic relationship provides rich semantics for creating atomic and composite CDEs that feature repeat and dictionary properties, supporting references to  outside biomedical resources as described in Table 6. The relatively simple-minded concept of the DDE may be insufficient to cover various CDE semantic relationships since a DDE covers only two constraints: Operated and Ordered.
There have also been efforts to address the issues of interrelated CDE(s) by applying external data models. The CDISC (clinical data interchange standards consortium) ODM (operational data model), which is an XMLbased standardized data model that supports the acquisition and exchange of metadata specifically related to clinical studies, can also be used to overcome the limitations of ISO/IEC 11179. However, it is not sufficiently comprehensive to generate CRFs by importing elements directly [28,29]. Lin et al. also suggested to use the openEHR approach for modeling CDEs [30]. Though this approach provides a comprehensive structure with two-level modeling, several limitations when implementing openEHRs have been identified in various studies such as immaturity of archetype modification operations, insufficient support for hierarchical archetypes due to their granularity [31,32], and the cost burden of development and adoption due to the complexity of defining openEHRs. Therefore, instead of utilizing external data models, we propose improving and extending the existing composite relationship by specifying two subtypes of aCDE, three subtypes of cCDEs, and four constraints to take advantage of utilizing CDEs and related technologies.
The newly released version of HL7 FHIR provides the ElementDefinition type, which is the core of the FHIR metadata layer and is closely (conceptually) aligned to ISO/IEC 11179. It has the result of mapping to the other standards as well to help implementers and clinical researchers understand the content and use it correctly. However, they found that the principles from both standards were totally different. FHIR does not differentiate the difference between a CDE and a CDE value and the FHIR specification is heavily type dependent. For instance, HL7 FHIR provides the pair of Questionnaire and QuestionnaireResponse resources and a pair of Appointment and AppointmentResponse resources at the same time. Also, the FHIR specification includes constraints and other concerns that are outside the scope of ISO/IEC 11179. Thus, the HL7 admitted that there still was a shortage of connection between HL7 FHIR and ISO/IEC 11179. It is said that the FHIR Infrastructure work group is considering rolling the DataElement resource into the StructureDefinition resource. If this is done, DataElement resource will be treated as a type of logical model (whether there will be a distinct 'type' for it is unclear) [33].
Since the FHIR specification includes concepts for the group and constraints, they were matched with our proposed concepts of composite and the part of constraints (ordered, operated). However, some of the semantic types and constraints that we have proposed are not provided by FHIR. We detailed whether our proposed semantic types and constraints were covered by FHIR. Since the FHIR Questionnaire is the only resource, which is related to clinical forms or documents, we distinguished from the other FHIR resources (Table 7).
Overcoming the challenges of understanding semantic relationships of form-lEVeL data This paper has presented an in-depth evaluation of the ISO/IEC 11179 MDR standard based CDE semantic interrelationships in the context of formalizing clinical document structures. For converting form-level data into DE-level data, two cCDEs (repeated and dictionary cCDEs) and their related constraints were developed, which provide the following benefits: 1) Repeated cCDEs support clinical data management in a tabular format in a clinical document. Since multiple value sets are supported to be represented in a unified tabular format, a repeated cCDE is useful for managing sequential data entry in a tabular format and for analyzing how the values change over time. A repeated cCDE enables standard MDR-based CDE-level descriptions and evaluations of clinical data entry in a tabular format. 2) Dictionary cCDEs enable biomedical knowledge to be brought from a dictionary database via a variable aCDE. Data items referencing a certain standard terminology appear frequently on clinical forms. A dictionary cCDE can help to include rich semantics from externally managed biomedical terminologies and/or dictionaries, with rich attributes being applied for input data validation. 3) Four different types of constraints enable rich evaluations of input values. A prefix notation with functional logic programming can be applied for evaluating user-defined constraints in order to ensure contextual correctness and interrelationships among data items on clinical document.

Advantages of using CDEs and CDE relationships for building clinical documents
The data element is the atomic unit of data and is associated with a data element concept (DEC, an abstract unit of knowledge for representing semantics) and a value domain (representation of data including the data type and permissible values) according to the ISO/IEC 11179 MDR standard. The DEC is the combination of an object class (a set of entities) and a property (a peculiarity common to all member of an object class). As these two components of DEC are matched to the standard medical terminologies, it strengthens the semantic part. It is an advantage to use CDE. Our proposed new semantic types and constraints comply with this part in the ISO/IEC 11179 standard. As verified in the evaluation part of this study, building clinical documents with CDEs can provide three major advantages. First, it prevents the generation of redundant data by facilitating predefined and registered CDEs to the MDR. Second, it ensures semantic data integrity since an MDR-based CDE has comprehensive and standardized metadata attributes for data description and the proposed cCDE provides a means to encode rich constraints for inter-CDE relationships. The health data of a patient that are fragmented, dispersed, and duplicated in a variety of clinical documents across different medical centers should be integrated, and mapping data items to CDEs facilitates data integration and semantic interoperability across different clinical documents. Third, clinical data exchange and sharing can be greatly facilitated by this approach.

Limitation and future work
The real-life clinical documents provide reasonable examples of reality, but particular instances of reality do not necessarily always provide good representative examples. For instance, we found that the quality of data in the clinical documents is dependent on whether the clinicians who wrote these documents were well trained in terminology representation to be inclusive in writing correctly and sufficiently valid clinical documents. If the document provides poor examples, then the outcome of the evaluation will also be poor. It is not only the problem of clinical documents but also it can be applied to when a clinical researcher creates data in the FHIR model or a physician inputs clinical data in an EHR system. Thus, we should measure the DQ, which is one of the aspects of the interoperability that reveals the process of standardizing EHRs to ensure the selected clinical documents are a good representation of the evaluation.
We also found that one essential issue was whether our proposed semantic types and constraints ensure semantical consistency with the use of standard biomedical terminologies. For the instance of data transfer and the purpose of interoperability, it is important to examine how well our proposed semantic types and constraints correspond to the standard biomedical terminologies and how we can address the issue of terminology variations. Although the DEC part of the ISO/IEC 11179 is matched to the standard medical terminologies, when multiple standard biomedical vocabularies are used in the complicated CDEs the above issue can occur. A similar issue can occur when we utilize the dictionary cCDE, since it includes a biomedical vocabulary. For instance, the dictionary cCDE can consider different 'versions' of a particular laboratory test with different time stamps, which could end with a differing variance of normal ranges. In other words, even if we reference the same standard vocabulary for the dictionary cCDE, the result could be different. We will measure another DQ for semantical consistency from the two issues mentioned above as a future work.
To measure DQ, we will consider the 5 different dimensions of DQ such as completeness, correctness, concordance, plausibility, and currency. The strategies used to assess the dimensions of DQ fell into seven DQ methods such as gold standard, data element agreement, element presence, data source agreement, distribution comparison, and validity check as a future work [34]. Not applicable, there is no restriction on the datatype as it is represented JSON, XML.

Variable
Yes, it is supported by "coding". Yes, it is supported by "coding".
cCDE General Yes, it is supported because the FHIR is following a structured model.
Yes, it is supported because the FHIR is following a structured model.
Repeated Yes, it is supported by "repeats". Yes, it is supported because the FHIR is allowing repeated representation of the group of items.
Dictionary Not applicable, it does not support any value related rule.
Not applicable, it does not support any value related rule.

Constraint Operated
Allowing only logical operations. Only resources that have the "operator" are supported (e.g., Observation Resouce).

Required
Yes, it is supported by "required". Yes, it is supported by "required".
Dependent Not applicable, it does not support any value related rule.
Not applicable, it does not support any value related rule

Ordered
Although not explicit, it is included in the structure. Only resources that have "sequences" are supported (e.g., Claim Resouce)

Conclusion
The sharing and understanding of data from multiple different domains can be facilitated by standardization. An MDR-based CDE is considered a type of standardized data with specified concept and value domains. However, ISO/IEC 11179 MDR-based CDEs do not provide the ability to describe constraints on a CDE or relationships among different CDEs, instead merely focusing on single independent CDEs, which makes it difficult to either correctly compose or interpret CDEs on clinical documents. We developed MDR-based extended semantic types and constraints, and it can facilitate comprehensive representation of clinical documents with rich semantics and improved semantic interoperability.