Extracting, Cleaning and Analysing eICU and MIMIC-IV data using Huawei Cloud, PostgreSQL and Python

6 min readJun 19, 2021

Photo: https://www.fool.com/investing/2017/03/26/3-stocks-to-invest-in-healthcare.aspx

Introduction

After much delay, I finally managed to allocate some time to write about my experience in the annual Healthcare AI Datathon (11-13 Dec 2020) organised by the NUS Saw Swee Hock School of Public Health (SSHSPH). Through this datathon, I had the privilege to work with multiple clinicians, medical students, researchers, and data professionals from all around the world.

The scope of this article will focus mainly on the two databases — MIMIC-IV and eICU — which we based our project on, as well as the Extract, Transform, Load (ETL) / wrangling process. There is also a short segment on our team’s strategy to make efficient use of the 48 hours that we had. You can see some of the codes that me and my team mates wrote over here at my GitHub: https://github.com/hello-bob!

Our problem statement: What is the optimal blood pressure control within first 6h of ICU Admission for acute ischaemic stroke patients?

Resources

In this datathon, we were given access to a buffet of databases, many of which stored images. We stuck with tabular data from the MIMIC-IV and eICU as they suited our needs. These databases are consolidated on PhysioNet, and if you do want to access the datasets there for your own personal projects, you would have to apply for Credentialed Access. You will need some form of reference.

A brief description of the databases, as per PhysioNet:

eICU Collaborative Research Database: Multi-center database comprising deidentified health data associated with over 200,000 admissions to ICUs across the United States between 2014–2015.
MIMIC-IV: Database of patients admitted to the Beth Israel Deaconess Medical Center. This contains Electronic Health Records (EHR) and ICU data.

These were some of variables we consolidated from both databases, with blood pressure (B.P.) as our target variable. We attempted modelling using a few different variations of B.P. e.g. mean arterial B.P., lowest/mean systolic B.P. etc etc. Both databases are really rich, and this just scratches the surfaces of what they have. So definitely, do check out the schemas and tutorials that they have if you’re interested!

Figure 1. Variables extracted from both MIMIC-IV and eICU. Low-density Lipoprotein (LDL), Glasgow Coma Scale (GCS), Hypertension (HTN), Ischemic Heart Disease (IHD), Peripheral Vascular Disease (PVD), Atrial Fibrillation (AF), Transient Ischaemic Attack (TIA), Hyperlipidemia (HLD), Coronary Artery Disease (CAD).

With respect to the data, something I did not mentally prepare myself for was the stark difference in structures of each databases. It was difficult to wrap my head around where to find what data from where, especially with lack of sleep. It gets especially confusing when you switch between two different databases when needed, so preparation here to familiarise with the databases would have helped.

Important nugget gained: Every database has their strengths and weaknesses. Always be alert where the ice grows thin; there can be exceptions on a table level, relations across tables level, or even on a column level.

Method for ETL

Firstly, I have to raise a disclaimer that this was the first time I used cloud technology. If I get any concepts wrong, do let me know in the comments section below!

This is a rough schematic of how the ETL process was like (at least for me):

Each team was allocated a powerful Elastic Cloud Server (ECS) on Huawei Cloud, containing both remote CPU and GPU. ECS has a conda_env set up within it, and we could use Jupyter Lab to code collaboratively.
Both databases were stored on Huawei’s Relational Database Service (RDS), where we used postgreSQL for querying. The inbuilt Data Admin Service allowed the data team to run scripts and store them, allowing our team to be on the same page and have access to the same data. Very nifty function!
As the platform faced high latency at times, I drafted queries using the remote Jupyter Lab on ECS. SQLalchemy and psycopg2 were used to connect to the RDS, and pd.read_sql() to load it into a Pandas dataframe for some basic wrangling, and more importantly to check that the query works.

##### Sample on connecting and querying the databases on RDS serverconn_mimiciv = sqlalchemy.create_engine('postgresql://team12: notapassword@119.8.167.24/mimiciv')

Team strategy

Our team of 9 first scoped our research question and the inclusion / exclusion criteria. Given the team’s makeup of technical and domain experts, we split into 2 teams, one for each database.

Each team had a balanced ratio of clinicians to data wranglers. The clinicians would conduct research and determine what data is important to answering the research question. They also played a pivotal role in compiling the list of codes (e.g. DRG, ICD-9/10, drugs) for the data people to perform the extraction smoothly, in a timely fashion at that!

The tailoring of the data to our use-case was an eye-opening experience, and was no where near easy. This is given that the heavy SQL-lifting had to be done concurrently with discussions between data and clinical teams. Multi-tasking abilities were definitely tested that day… Do drop a comment down below if you think there could have been a better strategy.

A bit on Pandas-Profiling

I wanted to experiment a bit with auto-EDA tools available out there as well during this datathon. Pandas-Profiling happens to be the one I liked the most at that point in time, allowing very comprehensive deep dives. I used this package after producing our final dataset. Below are snapshots of the HTML file that it produces, which took only ~10 lines of code to write. The best thing about it is that it’s so portable, and you can literally do your EDA on the go.

Figure 2. Pandas Profiling at work for quick, dirty and interactive EDA

Lessons learnt

Effective communication is essential. Every discussion between the clinical and data teams was so, so precious, given the time pressure.
The question and direction can change quickly during those 48 hours. Something I want to work on for the next datathon is to keep my head more afloat, instead of being too stuck between the codes.
Each database has its quirk. It takes a lot of time and effort to understand one database well, which was probably impossible to do in the given timeframe. Prep-work, whenever possible, is recommended!
Conducting efficient queries given the size and scale of the data was important too. Some datasets in eICU had data in with the resolution of 5-second intervals. Thankfully, we had the magic of powerful cloud computing (yay Huawei).
It might have been better to focus on one database, since the population captured in each database is quite different (single centre vs multi-centre). Splitting our attention turned out to be very resource-costly, especially with limited data manpower.

Closing Statement

It was truly a memorable 48 hours of intense learning and data crunching. Looking back, I think the experience was extremely valuable and humbling. There is still so much more I do not know, and I am already excited to learn more in the next Healthcare AI datathon, I think on 3 Dec 2021!

Thank you SSHSPH for hosting the Healthcare AI Datathon, and the panel of coaches for providing us a wonder space to learn about healthcare data. Would also like to extend thanks to the clinicians on my team for taking their precious time off to work on this competition. I am lastly grateful for the data scientists/analysts for guiding me along this project, and not letting me drown in my own scripts!

Team mates: Emma Toh (PI), Timothy, Wang Han, Dr Bryce, Dr Martin, Dr Dorin, Dr Amber, Dr Toshikazu