inacio-luma-patient-recommender

v0.9.0

Published

3 months ago

This library recommends a list of patients most likely to pick up a call from the hospital, based on the hospital's location and other optional parameters.

Downloads

0High
0Medium
0Low

inaciom

Introduction

This is my submission to the Luma Health backend interview assigment, it is:

a simple library you can import and use to create a list of top priority patients for a hospital call;
a RESTful API that given a hospital coordinates (latitude, longitude) it returns a waitlist of patient most likely to pick a call from the hospital.

Table of content

Quickstart

There's three ways to play with this submission:

I've deployed the REST API at AWS as a ECS instance; feel free to acess it here http://luma.inacio.codes/docs
Or you can install and run the API locally with:

npm i
npm run dev

Finally, you can also install and import the library with:

git clone https://github.com/inacioMattos/luma-interview.git
cd luma-interview
npm i
npm run dev

Endpoints

`GET /docs`

The Swagger endpoint.

`GET /health`

A healthcheck route.

Monitoring and Availability: Health checks allow monitoring systems to verify that an API is up and running correctly. This helps in ensuring high availability of the service, as it allows for quick detection and response to issues.
Load Balancing: In environments with multiple instances of a service, load balancers use health checks to decide which instances are capable of handling requests. Instances that fail health checks can be automatically removed from the pool, ensuring that traffic is only directed to healthy instances.
Auto-scaling and Orchestration: In cloud environments, health checks are crucial for auto-scaling and orchestration. Systems like Kubernetes and other orchestration tools rely on health check endpoints to manage the lifecycle of containers and services, such as scaling up or down and performing rolling updates without downtime.

`GET /patients/recommend`

This endpoint recommends a list of patients most likely to pick up a call from the hospital, based on the hospital's location and other optional parameters.

Query Parameters:

lat (required): Hospital's latitude
- Type: number
- Range: -90 to 90
- Description: The latitude coordinate of the hospital's location
long (required): Hospital's longitude
- Type: number
- Range: -180 to 180
- Description: The longitude coordinate of the hospital's location
limit (optional):
- Type: number
- Default: 10
- Minimum: 1
- Description: Number of patients to recommend
include_details (optional):
- Type: boolean
- Default: false
- Description: If set to true, the response will include all patient data plus individual scores for each historical feature (ageScore, replyTimeScore, etc.). This is useful for debugging purposes. All scores are weighted by their respective weights.

Response:

The endpoint returns a list of recommended patients based on their likelihood to pick up a call from the hospital. The exact structure of the response depends on the include_details parameter.

Success Response (200 OK):

When include_details is false (default):

[
  {
    "id": "string",
    "name": "string",
    "score": 9.67,
  }
]

When include_details is true:

[
  {
    "id": "string",
    "name": "string",
    "age": 67,
    "acceptedOffers": 22,
    "canceledOffers": 7,
    "averageReplyTime": 1230,
    "location": {
      "latitude": 87.0444,
      "longitude": 155.0585
    },
    "ageScore": 0.7,
    "replyTimeScore": 1.95,
    "offersScore": 4.4,
    "locationScore": 0.3,
    "score": 7.35
  }
]

Example usage

GET /patients/recommend?lat=37.7749&long=-122.4194&limit=5&include_details=true

This request would return a list of 5 recommended patients for a hospital located in San Francisco, including detailed scores for each patient.

Time complexity

This endpoint has a $O(log(n))$ time complexity where $n$ is the total number of patients. (Will be explained in depth below)

Implementation

The implementation is divided into two major components:

Scoring: How to score each individual patient;
Traversing: Given a hospital coordinates (latitude, longitude) & a patient score, how to traverse through them in order to find the top 10 patients.

Since we can score all patients beforehand, traversing becomes the important part since it's what will define our app perfomance (because we'll need to traverse for each new hospital search).

So, let's first dive into how traversing is implemented.

Traversing

A O(log n) implementation

To optimize our algorithm, I decided to use a data structure known as K-d tree. Its main selling point is the efficient nearest neighbor searches in $O(log(n))$ time on average.

A K-d tree (k-dimensional tree) is a data structure that allows for efficient nearest neighbor searches in O(log n) time on average. It is an auto-balacing binary tree but for arbitrary d dimensions.

By leveraging this data structure, we can significantly improve our algorithm's speed. Here's our algorithm's outline:

Preprocessing: Construct a K-d tree using the patients' location data:
- latitude;
- longitude;
- precomputed age score
- precomputed offers acceptancy rate score;
- precomputed time to reply score.
Query: When a hospital request comes in, use the K-d tree to efficiently find the nearest neighbors (potential patients) based on location.
Result: Return the k-nearest-neighbors.

The standard approach to traversing leads to a time complexity of $O(n)$, while mine approach has a time complexity of $O(log(n))$.

What happens as the total number of patients grows? With this method, the processing time would increase linearly with the number of patients — resulting in a time complexity of O(n). This becomes problematic, especially considering we're calculating the computationally expensive haversine distance for every patient. Fortunately, we can implement a more efficient solution:

Scoring

Now let's turn our attention to scoring — i.e. how to score a individual patient.

Given this patient:

{
  "name": "Mr. Carmella VonRueden",
  "age": 43,
  "acceptedOffers": 98,
  "canceledOffers": 9,
  "averageReplyTime": 3170,
  "location": {
    "latitude": 87.0444,
    "longitude": 155.0585
  }
}

There's three features we can score statically (i.e. before knowing the hospital coordinates):

age: I assumed the higher the age the better the score;
averageReplyTime: I assumed the lower the averageReplyTime the better the score;
offers: This one is a bit tricky — I'll go in depth below.

To score one of the static features we simply:

Standardize it (also known as z-score):
- $z = \frac{(x - \mu)}{\sigma}$
- Where $x$ is the value to be standardize, $\mu$ and $\sigma$ are the mean and standard deviation of the $X$ set respectively.
Normalize it:
- $normalized = \frac{(z - Z_{\text{min}})}{(Z_{\text{max}} - Z_{\text{min}})}$
- Where $z$ is the value to be normalized, $Z_{\min}$ and $Z_{\max}$ are the minimum and maximum values in the $Z$ set respectively.
Apply its weight:
- $weighted = normalized * W_x$
- Where $normalized$ is the normalized value you wish apply weighting & $W_x$ is the weight of the respective feature (age, average reply time, etc.).

The standardization step is significantly important because it eliminates the risk of having outliers in the patients set.

Probably the most straightforward way to score those features would be to simply normalize them, putting them in the range of [0, 1] and then multiplying by their weights.

This is not great. Why?

Because normalization is vulnerable to outliers — if an outlier is extremely high, the range of the data becomes unusually large, making the normalized values of other data points inordinately small and tightly clustered.

Normalization adjusts the data based on the minimum and maximum values, if outliers are present, they'll significantly skew these minima and maxima, thus distorting the normalized values.

Since most every real-world phenomenon follows a normal distribution due to the central limit theorem, there almost certainly will be outliers present.

We can fix this using standardization!

Standardization (or z-scores)

Standardization — also known as z-scores — mitigates the impact of outliers more effectively because it is based on the mean and standard deviation. It ensures that:

The unit of measurement for variances and covariances is consistent across variables, which is particularly important in models that weigh inputs equally (like many machine learning algorithms).
It maintains the relative distances between and within data points, preserving outliers in a way that does not disproportionately influence the overall data structure as much as normalization might.

Thus, to compute the score for the age & averageReplyTime features one should simply follow the three steps above. Computing acceptedOffers and canceledOffers, however, requires one extra-step:

Computing offers

I've decided to use a Empirical Bayes Estimator since it makes a ton of sense here. Here's its formulation:

$\text{Offer Score} = \frac{C \times m + \text{Accepted Offers}}{C + \text{Total Offers}}$

Where $C$ is the average total offers number and $m$ is the median offer acceptancy rate.

Why did I choose this approach?

Bayes versus additive scoring

Additive score is the idea of: add a 'point' for each accepted offer and subtract a 'point' for each canceled offer. This simplifies to $Offer Score = acceptedOffers - canceledOffers$.

This can be problematic in the following scenario: patient1 has { acceptedOffers: 100, canceledOffers: 80 } and patient2 has { acceptedOffers: 18, canceledOffers: 0 }

patient1 additive score: $100 - 80 = 20$
patient2 additive score: $18 - 0 = 18$

Bayes versus simple offer acceptancy rate

Suppose a patient1 has { acceptedOffers: 1, canceledOffers: 0 } and patient2 has { acceptedOffers: 92, canceledOffers: 2 }. Using a naive ratio or acceptancy ratio would lead us to rate patient1 as better — which isn't ideal.

Using an empirical Bayes estimator solves both issues.

Running the Application

`npm run dev`

This will simply run the server in development mode at port 3000.

`npm run dev:debug`

This will simply run the server in debug mode (additional logs) at port 3000.

`npm run build`

This will transpile the Typescript code into Javascript

`npm run start`

This will start the server in production mode — only available after building with npm run build

`npm run build:docker`

This will build a docker container for this service. Useful for deploying to the cloud.

`npm run start:docker`

This will start the server in production mode using the container previously built with npm run build:docker

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Introduction

Table of content

Quickstart

Endpoints

GET /docs

GET /health

GET /patients/recommend