A predictive model for COVID-19 outcomes

As of March 9, 2022, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of the coronavirus disease 2019 (COVID-19) pandemic, has infected more than 446 million people and caused over 6 million deaths worldwide. , with an estimated mortality rate of 1.5%.

Study: A predictive model of COVID-19 hospitalization and survival in a population-based retrospective study. Image Credit: Cryptographer / Shutterstock.com


COVID-19 vaccines have been shown to reduce hospitalization and death rates. Periods of extended SARS-CoV-2 transmission during the ongoing pandemic, also known as “waves,” have strained hospital resources. This tension is largely due to the unprecedented number of COVID-19 cases requiring intensive care, which have exceeded the capacity of the intensive care unit (ICU).

During these peaks in SARS-CoV-2 transmission, death rates were particularly high due to overburdened healthcare systems. Thus, going forward, rapid risk stratification, planned clinical management, and optimized use of resources are essential in managing the COVID-19 pandemic.

Electronic medical records (EHRs) serve as comprehensive guides for the accurate triage and consistent management of COVID-19 patients. Artificial intelligence has been used to predict the prognosis of patients infected with SARS-CoV-2 using health and demographic data collected from health systems.

However, in most cases, these data are biased, as the proportion of patients with severe episodes of the disease remains low. Therefore, supervised machine learning may not be a balanced model for predicting COVID-19 outcomes.

About the study

A recent study published on the medRxiv* The preprint server presents a new technique to solve imbalanced problems for the effective management of patients infected with SARS-CoV-2 according to comorbidities, age and sex based on data available from the health system. regional in Spain.

Additionally, this technique involved using machine learning to develop models to determine whether a newly diagnosed COVID-19 patient would require hospitalization and predict their prognosis.

As the data was highly unbalanced due to fewer expired and inpatients compared to discharged and outpatients, a new ensemble-based, imbalance-sensitive machine learning method called Identical Partitions for Imbalance Problems (IPIP ) has been proposed. For each question of interest, two IPIP models were created and evaluated with fivefold cross-validation.

Classification of COVID-19 patient subtypes

The present study was conducted between January 4, 2020 and February 4, 2021 and included patients diagnosed with COVID-19, as confirmed by a positive antigen test or reverse transcriptase polymerase chain reaction (RT) test -PCR) of pharynx or nasal swab specimens.

An exploratory analysis of 86,867 SARS-CoV-2 positive patients showed that 93.7% were outpatients, including 5.4% hospitalized outside of intensive care, while 0.85% were inpatients intensive. The most common symptoms were cough in 49.9% of cases, headache in 38.3% and myalgia in 36%.

The participants were classified into three types. The outpatient prototype was a 38-year-old woman with two affected systems and two chronic conditions, with common comorbidities such as high blood pressure, obesity, asthma, and depression. In comparison, the typical non-ICU hospitalized patient was a 62-year-old male with four affected systems and five chronic pathologies, with more frequent comorbidities.

The prototype ICU patient was a 62-year-old man with three affected systems and five chronic conditions, with the most common comorbidities including high blood pressure, diabetes mellitus, obesity and osteoarthritis. Patients in intensive care had a mortality rate twice as high as hospitalized patients not in intensive care.

Other surveys were conducted to differentiate the survivors from the deceased. The surviving prototype was a 39-year-old female with two affected systems and two chronic pathologies and comorbidities similar to ambulatory patients.

By comparison, the deceased prototype was an 83-year-old man with five affected systems and eight chronic pathologies, of which the most frequent pathology was arterial hypertension (75.64%). Additional comorbidities for this type of patient included diabetes mellitus, depression, osteoarthritis, and obesity.

Three variables, including age, comorbidity, and affected systems, were relevant to a patient’s end state. Increasing age, number of comorbidities, and affected organ systems increased the likelihood of death. A similar relationship could be inferred between outpatients, ICU patients, and non-ICU inpatients.

Males have a higher mortality rate than females. Higher-risk comorbidities included kidney failure, heart failure, stroke, dementia, and ischemic cardiomyopathy. Deaths related to COVID-19 could not be correlated with the presence of asthma, osteoporosis or osteoarthritis.

The accuracy of machine learning models

Several machine learning models were generated to predict patients’ need for hospitalization and their final condition. To deal with unbalanced data, two machine learning algorithms, including logistic regression and random forest, were evaluated with or without considering IPIP.

The model using logistic regression with IPIP (LR-IPIP) provided the best result for predicting a patient’s end state with balanced accuracy. The result showed that the ROC-AUC for the unbalanced dataset predicted by this model was 0.937.

The most important determinants of the end state of patients using the RL-IPIP model included age, sexual obesity, osteoarthritis, and number of affected systems.

A training dataset was used to develop a hospitalization need assessment model and a test dataset was used for its evaluation. The LR-IPIP model gave the best results.

Using the RL-IPIP model, the need for hospitalization was predicted with a balanced accuracy of 0.72 for the balanced dataset and between 0.71 and 0.73 for the unbalanced datasets. The ROC-AUC for the unbalanced dataset predicted by this model was 0.746. Age, sex, kidney failure, depression and number of chronic diseases were relevant characteristics obtained by the RL-IPIP model.

Importance of features for final models. NSA is the number of affected systems and NCD is the number of chronic diseases.


The current study developed and analyzed machine learning-based models that could predict the end state of patients infected with SARS-CoV-2 with high accuracy, as well as assess the need for hospitalization of patients with reasonable accuracy. Additionally, the class imbalance was resolved by developing a new algorithm called IPIP.

The proposed LR-IPIP model could be used to effectively manage COVID-19 patients who have limited access to healthcare resources. The predictive models, along with corresponding web apps, are accessible on GitHub for future use in future waves of COVID-19 or other viral respiratory illnesses.

*Important Notice

medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be considered conclusive, guide clinical practice/health-related behaviors, or treated as established information.

About Hector Hedgepeth

Check Also

New study challenges widely held beliefs about Alzheimer’s disease

A new study challenges the dogma behind drug trials for Alzheimer’s disease. The study found …