ORIGINAL RESEARCH

Developing an artificial intelligence-based system for medical prediction

Sakhibgareeva MV, Zaozersky AYu
About authors

COMTEK LLC, Ufa, Russia

Correspondence should be addressed: Margarita Sakhibgareeva
ul. Bekhtereva 16, kv. 48, Ufa, Russia, 450047; moc.liamg@1102lv.atiragram

About paper

Contribution of the authors to this work: Sakhibgareeva MV — data processing and analysis, analysis of literature, research, drafting of a manuscript; Zaozersky AYu — research planning, data interpretation, drafting of a manuscript.

Received: 2017-11-23 Accepted: 2017-12-13 Published online: 2018-01-23
|

Development of information technologies aimed to facilitate the efficient delivery of medical care is one of the priority goals set for the Russian healthcare system. Increasing effort is being made to improve the quality of healthcare through the use of information systems, expedite transition from paper files to electronic medical records and employ data mining for the analysis of huge arrays of medical data [1, 2].

Collection of medical data still presents a problem, as noted in a number of works [3, 4], which seriously impedes their digitalization necessary for machine learning and delays development of analytical software. Our close collaboration with the Siberian Center for Information Protection and deployment of the original Zdravookhranenie software in a few regional medical centers allowed us to build a vast database of medical records and obtain authorization to process these data. It was a perfect opportunity to perform data mining using machine learning techniques.

The use of diagnostic information systems in clinical practice can be very beneficial for patients. High workload or the lack of expertise affects clinical decisions doctors make. Besides, a taking into account of a set of information about the patient is a basis for accurate diagnosis, prediction of disease progression and treatment planning; without it clinical decisions are mere approximations [5].

According to A. Chuchalin’s report presented at the Second National Congress of GPs, every third case in Russia is misdiagnosed [6]. Likewise, we have discovered a considerable number of diagnostic errors while analyzing the records of a few healthcare facilities that use our software. In the course of our analysis, we calculated the discrepancies between the definitive and preliminary diagnoses. Results are presented in tab. 1 which features distribution of erroneous diagnoses across different departments of healthcare facilities and tab. 2 showing the percentage of erroneous diagnoses in different nosological categories. Names of the healthcare facilities are not provided in this article for ethical reasons.

Not only patients become victims of wrong preliminary diagnoses and get useless treatments but also medical clinics incur considerable expenses: the Fund of Compulsory Health Insurance only subsidizes treatments based on a definitive diagnosis.

In view of this, we decided that prediction of nosological diagnosis should be a priority task in the development of an artificial intelligence-based system. The aim of this work was to test the feasibility of medical data mining using machine learning, to assess prediction accuracy that makes a machine learning model useful, and to enhance our Zdravookhranenie platform.

METHODS

Initial dataset

Medical decisions can be based on a medical history, physical examinations, and results of laboratory or complex functional tests. Lab tests provide the most objective information about patient’s condition and are often used when other methods have failed to identify or confirm a pathology. These tests are especially useful in patients with anemia, lipidemia, hepatitis, seropositive rheumatoid arthritis, etc.

The source dataset consisted of disease cases with established definitive diagnoses. The feature space included patients’ sex and age and the results of laboratory tests obtained from the data of prophylactic medical examination. The data were collected using our Zdravookhranenie software solution [7]. We chose 4 nosologies for the analysis, including D50, E11, E74, and E78, that can be suspected and diagnosed based on laboratory tests. The initial dataset was as follows:

  • iron deficiency anemia (D50) — 778 cases (10 %);
  • non-insulin-dependent diabetes mellitus (E11) — 1,392 cases (17 %);
  • other disorders of carbohydrate metabolism (E74) — 163 cases (2 %);
  • disorders of lipoprotein metabolism and other lipidemias (E78) — 5,585 cases (71 %).

In total, the dataset included 7,918 cases with results of 200 laboratory tests (blood and urine tests, cytologic examinations, etc.) that occurred during the period from 2005 to 2017 with patients aged 18 to 99 years, of whom 71 % of were females and 29 % were males. In some cases, the results of laboratory tests were recorded as “normal”, “below the norm” and “above the norm”.

Choosing a method of machine learning and performance metrics

Prediction of diagnosis based on the results of laboratory tests is a multiclass classification problem.

The data were analyzed using Scikit-learn [8], a Python-based open-source library for machine learning. We carried out a few preliminary tests involving such methods of machine learning as neuronal networks, decision trees, and gradient boosting. The last one showed the best results for our problem. It is a technique in which an ensemble of predictors is built sequentially, with every subsequent algorithm compensating for the mistakes of a previous predictor [9]. Gradient boosting over decision trees is believed to be the most effective universal method of machine learning. Decision trees also perform very well in classification tasks.

Considering the specifics of the problem and the fact that the initial dataset was imbalanced, we selected performance metrics with special care. The metrics will be described below in terms of a confusion matrix [9, 10] with respect to multiclass classification using the one-against-all approach. This approach is based on reducing multidimensional classifications to a set of binary tasks in which a picked class is classified as 1, and the rest classes are classified as 0. For every picked class i the following parameters are determined:

  • TP (True Positive) — the number of true positive instances correctly assigned to class i;
  • TN (True Negative) — the number of true negatives instances correctly not assigned to class i and therefore assigned to class j ≠ i;
  • FP (False Positive) — the number of false positives instances incorrectly assigned to class i;
  • FN (False Negative) — the number of false negatives instances incorrectly assigned to class ji that should have been assigned to class i.

Accuracy is the most intuitive performance metric showing a fraction of correct responses; however, is not suitable for imbalanced datasets: form. 1 .

Therefore, other metrics are often used instead, including:• precision — a fraction of true positives instances among all predicted positives. In other words, it shows how many positive predictions were really positive

  • precision — a fraction of true positives instances among all predicted positives. In other words, it shows how many positive predictions were really positive: form. 2 ;
  • recall — a fraction of true negatives instances among all true and false positives. It is also known as a true positive rate (TPR): form. 3 .

Recall is used to evaluate performance of a machine learning model when there is a need to reduce the number of false negatives (FN) and measure all positives [10]. This metric is preferred for medical diagnostic tasks when it is important not to miss a diagnosis. Although it is quite intuitive, it is not always good for imbalanced datasets.

Another metric used in our study was ROC AUC recommended in [10] for the evaluation of model performance on imbalanced datasets. AUC stands for area under [ROC] curve, ROC is receiver operating characteristic. This curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR) and is a line connecting (0, 0) to (1,1): form. 4 .

It is believed that the higher the ROC AUC value, the better the performance of the classifier. ROC AUC of 0.5 means the classifier makes random guesses. ROC AUC below 0.5 means that the classifier does the opposite of what is expected of it: if true positives were labeled as negatives, it would perform better.

Considering the above said, we used ROC AUC as a primary metric, but also accounted for recall.

RESULTS

The diagnoses and the results of laboratory tests were divided into two sets: the training set (75 % of cases) and the test set (25 % of cases). The model was built for 4 nosological categories (D50, E11, E74, E78) using gradient boosting. For the test set ROC AUC was above 89 % (tab. 3). Mean certainty in correct diagnoses included in a test sample was 92 %.

DISCUSSION

High ROC AUC values falling between 89 % and 98 % indicate that our model is feasible for the prediction of the studied diagnoses. Importantly, our dataset consisted of various data types, including the results of 200 different laboratory tests and such parameters as patients’ sex and age. Among other strengths of the study is the use of enough large dataset accumulated over the course of a few years. For example, in [11] the analysis was carried out on the data collected over the period of just 3 months in a Boston hospital. The authors of the study attempted to predict ferritin blood levels. They also used ROC AUC as quality metric which turned to be as high as 97 %. However, it should be noted that according to a number of research works [12, 13, 14] a focus on nosological categories may increase prediction accuracy. According to [15, 16], performance can be enhanced through the use of different methods for medical data preprocessing.

CONCLUSIONS

Our study has proved the feasibility of machine learning techniques for the analysis of our medical records. Currently, we are incorporating this model into our Zdravookhranenie software. We are working on a web service which will accumulate and analyze the results of all laboratory tests specified in a patient’s medical history. The web service will “report” to the Zdravookhranenie platform the results of the analysis and returning the most probable diagnoses that a doctor may take into for appointment of a treatment regimen.

We are planning to include more nosologies into our model and improve its quality by designing separate models for each diagnosis. These models will account for the laboratory tests that affect the prediction outcome the most. Thus, we will be able to start developing a tool that can recommend the most relevant lab tests for the diagnosis of a particular condition.

We hope that our work will expedite transition to personal medicine [17, 18] based on the analysis of patient’s unique medical records not limited to the results of the laboratory tests. This task can be solved using artificial intelligence for diagnostic prediction and generating personalized treatment recommendations. This will help to reduce the number of medical errors and increase clinical significance of prevention measures by monitoring patient’s records.

КОММЕНТАРИИ (0)