--- title: "fhircrackr: Download FHIR resources" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{fhircrackr: Download FHIR resources} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#" ) hook_output = knitr::knit_hooks$get('output') knitr::knit_hooks$set(output = function(x, options) { if (!is.null(n <- options$out.lines)){ if (any(nchar(x) > n)){ index <- seq(1,nchar(x),n) x = substring(x, index, c(index[2:length(index)]-1, nchar(x))) } x = paste(x, collapse = '\n# ') } hook_output(x, options) }) # hook_warning = knitr::knit_hooks$get('warning') # knitr::knit_hooks$set(warning = function(x, options) { # n <- 90 # x = knitr:::split_lines(x) # # any lines wider than n should be wrapped # if (any(nchar(x) > n)) x = strwrap(x, width = n) # x = paste(x, collapse = '\n ') # hook_warning(x, options) # }) ``` This vignette covers all topics concerned with downloading resources from a server in some depth. If you are interested in a quick overview, please have a look at the fhircrackr:intro vignette. Before running any of the following code, you need to load the `fhircrackr` package: ```{r} library(fhircrackr) ``` ## FHIR search requests To download data from a FHIR server, you need to specify which resources you want to get with a *FHIR search request*. You can just define your search request as a simple string that you provide to `fhir_search()`. In that case, however, no checking of spelling mistakes of resource types and URL encoding will be done for you. If you are comfortable with this, you can skip the following paragraph, as the first part of this vignette introduces the basics of FHIR search and some functions to build valid FHIR search requests with the `fhircrackr`. A FHIR search request will mostly have the form `[base]/[type]?parameter(s)`, where `[base]` is the base URL to the FHIR server you are trying to access, `[type]` refers to the type of resource you are looking for and `parameter(s)` characterize specific properties those resources should have. The function `fhir_url()` offers a solution to bring those three components together correctly, taking care of proper formatting. In the simplest case, `fhir_url()` takes only the base url and the resource type you are looking for like this: ```{r} fhir_url(url = "http://hapi.fhir.org/baseR4", resource = "Patient") ``` Internally, the function `fhir_resource_type()` is called to check the type you provided against list of all currently available resource types can be found at https://hl7.org/FHIR/resourcelist.html. Case errors are corrected automatically and the function throws a warning, if the resource type doesn't match the list under hl7.org: ```{r,warning=FALSE} fhir_resource_type(string = "Patient") #correct fhir_resource_type(string = "medicationstatement") #fixed fhir_resource_type(string = "medicationstatement", fix_capitalization = FALSE) #not fixed fhir_resource_type(string = "Hospital") #an unknown resource type, a warning is issued # Warning: # In fhir_resource_type("Hospital") : # You gave "Hospital" as the resource type. # This doesn't match any of the resource types defined under # https://hl7.org/FHIR/resourcelist.html. # If you are sure the resource type is correct anyway, you can ignore this warning. ``` Besides telling the server which resource type to give back, the resource type also determines the kinds of search parameters that are allowed. Search parameters are used to further qualify the resources you want to download, e.g by restricting the search result to Patient resources of female patients only. You can add several parameters to the search request. If you don't give any parameters, the search will just return all resources (if not explicitly limited by the parameter `max_bundles`) of the specified type from the server. Search parameters generally come in the form `key = value`. There are also a number of resource independent parameters that can be found under https://www.hl7.org/fhir/search.html#Summary. These parameters usually have a `_` at the beginning. `"_sort" = "status"` for examples sorts the results by their status, `"_include" = "Observation:patient"` includes the linked Patient resources in a search for Observation resources. Apart from the resource independent parameters, there are also resource dependent parameters referring to elements specific to that resource type. These parameters come without a `_` and you can find a list of them at the end of every resource site e.g. at https://www.hl7.org/fhir/patient.html#search for the Patient resource. An example of such a parameter would be `"birthdate" = "lt2000-01-01"` for patients born before the year 2000 or `"gender" = "female"` to get female patients only. You can add search parameters to your request via a named list or a named character vector: ```{r, out.lines=110} request <- fhir_url( url = "http://hapi.fhir.org/baseR4", resource = "Patient", parameters = list( "birthdate" = "lt2000-01-01", "code" = "http://loinc.org|1751-1")) request ``` As you can see, `fhir_url()` performs automatic url encoding and the `|` is transformed to `%7C`. ### Accessing the current request Whenever you call `fhir_url()` or `fhir_search()`, the corresponding FHIR search request will be saved implicitly and can be accessed with `fhir_current_request()` If you call `fhir_search()` without providing an explicit request, the function will automatically call `fhir_current_request()`. ## Download FHIR resources from a server To download resources from a server, you use the function `fhir_search()` and provide a FHIR search request. ### Basic request We will start with a very simple example and use `fhir_search()` to download Patient resources from a public HAPI server: ```{r, eval=F} request <- fhir_url(url = "https://hapi.fhir.org/baseR4", resource = "Patient") patient_bundles <- fhir_search(request = request, max_bundles = 2, verbose = 0) ``` ```{r, include=F} patient_bundles <- fhir_unserialize(bundles = patient_bundles) ``` In general, a FHIR search request returns a *bundle* of the resources you requested. If there are a lot of resources matching your request, the search result isn't returned in one big bundle but distributed over several of them, also called *pages*, the size of which is determined by the FHIR server. If the argument `max_bundles` is not set, its default `Inf` will be applied. `fhir_search()` will then return all available bundles/pages, meaning all resources matching your request. If you set it to `2` as in the example above, the download will stop after the second bundle. Note that in this case, the result *may not contain all* the resources from the server matching your request, but it can be useful to first look at the first couple of search results before you download all of them. If you want to connect to a FHIR server that uses basic authentication, you can supply the arguments `username` and `password`. If the server uses some bearer token authentication, you can provide the token in the argument `token`. See below for more information on authentication. Because servers can sometimes be hard to reach, `fhir_search()` will start five attempts to connect to the server before it gives up. With the argument `delay_between_attempts` you can control the number of attempts as well the time interval between them. As you can see in the next block of code, `fhir_search()` returns an object of class `fhir_bundle_list` where each element represents one bundle of resources, so a list of two in our case: ```{r,results='hide'} patient_bundles # An object of class "fhir_bundle_list" # [[1]] # A fhir_bundle_xml object # No. of entries : 20 # Self Link: http://hapi.fhir.org/baseR4/Patient # Next Link: http://hapi.fhir.org/baseR4?_getpages=ce958386-53d0-4042-888c-cad53bf5d5a1 ... # # {xml_node} # # [1] # [2] \n \n # [3] # [4] \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n # [1] # [2] \n \n # [3] # [4] \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n here instead. This is mostly the case when the URL of you FHIR search request gets long enough to exceed the allowed url length. A common scenario for this would be a request querying an explicit list of identifiers. Let's for example say you are looking for the following list of patient identifiers: ```{r} ids <- c("72622884-0a09-4ea9-9a91-685bce3b0fe3", "2ca48b68-a641-4be7-a39d-9ffe2691a29a", "8bcdd92d-5f96-4e07-9f6a-e22a3591ee30", "2067558f-c9ed-489a-9c2f-7387bb3426a2", "5077b4b0-07c9-4d03-b9ec-1f9f218f8239") ``` You can use them comma separated in the value of the `identifier` search parameter like this: ```{r} id_strings <- paste(ids, collapse = ",") ``` But this string would make the FHIR search request URL very long, especially if it is combined with additional other search parameters. In a search via POST, the search parameters (everything that would usually follow the resource type after the ?) can be transferred to a body of type `application/x-www-form-urlencoded` and sent via POST. A body of this kind can be created the same way the parameters are usually given to the `parameters` argument of `fhir_url()`, i.e. as a named list or character: ```{r} #note the list()-expression body <- fhir_body(content = list( "identifier" = id_strings, "_revinclude" = "Observation:patient")) ``` The body will then automatically be assigned the content type `application/x-www-form-urlencoded`. If you provide a body like this in `fhir_search()`, the url in request should **only** contain the base URL and the resource type. The function will automatically amend it with the suffix `_search` and perform a POST: ```{r, eval = F} url <- fhir_url(url = "https://hapi.fhir.org/baseR4/", resource = "Patient") bundles <- fhir_search(request = url, body = body) ``` ## Deal with HTTP Errors `fhir_search()` internally sends a `GET` or `POST` request to the server. If anything goes wrong, e.g. because your request wasn't valid or the server caused an error, the result of you request will be a HTTP error. `fhir_search()` will print the error code along with some suggestions for the most common errors to the console. To get more detailed information on the error response, you can either call `fhir_recent_http_error()` to print more information into the console or you can pass a string with a file name to the argument `log_errors`. This will write a log with error information to the specified file: ```{r, eval=F} medication_bundles <- fhir_search( request = request, max_bundles = 3, log_errors = "myErrorFile") ``` ## Save the downloaded bundles There are two ways of saving the FHIR bundles you downloaded: Either you save them as R objects, or you write them to an xml file. This is possible while downloading the bundles or after all bundles have been downloaded. The following section covers saving after downloading. See the **Dealing with large data sets** section for how to save bundles during downloading. ### Save bundles as R objects If you want to save the list of downloaded bundles as an `.rda` or `.RData` file, you can't just use R's `save()` or `save_image()` on it, because this will break the external pointers in the xml objects representing your bundles. Instead, you have to serialize the bundles before saving and unserialize them after loading. For single xml objects the package `xml2` provides serialization functions. For convenience, however, `fhircrackr` provides the functions `fhir_serialize()` and `fhir_unserialize()` that can be used directly on the bundles returned by `fhir_search()`: ```{r} #serialize bundles serialized_bundles <- fhir_serialize(bundles = patient_bundles) #have a look at them head(serialized_bundles[[1]]) ``` ```{r} #create temporary directory for saving temp_dir <- tempdir() #save save(serialized_bundles, file = paste0(temp_dir, "/bundles.rda")) ``` If you load this bundle again, you have to unserialize it before you can work with it: ```{r} #load bundles load(paste0(temp_dir, "/bundles.rda")) ``` ```{r,results='hide'} #unserialize bundles <- fhir_unserialize(bundles = serialized_bundles) #have a look bundles # An object of class "fhir_bundle_list" # [[1]] # A fhir_bundle_xml object # No. of entries : 20 # Self Link: http://hapi.fhir.org/baseR4/Patient # Next Link: http://hapi.fhir.org/baseR4?_getpages=ce958386-53d0-4042-888c-cad53bf5d5a1 ... # # {xml_node} # # [1] # [2] \n \n # [3] # [4] \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n # [1] # [2] \n \n # [3] # [4] \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ", " " ) #convert to FHIR bundle list bundles <- as_fhir(bundle_strings) ``` ## Dealing with large data sets If you need to download a particularly large data set from a FHIR server this can lead to challenges in two areas: computation time and memory usage. Downloading the FHIR Bundles will be time consuming because the paging mechanism leading from one bundle to the next is not optimized for speed in most FHIR server implementations and neither is the execution of complex search queries. Keeping a lot of bundles in the working memory is memory consuming because the xml structures contain a lot of overhead that will be removed, once the relevant bits of information will be transferred into a table. There are several options to alleviate these problems, a couple of which will be shown in the following. ### Saving memory #### 1. Download trimmed resources using `_elements` If you know you are just going to need a few elements from each resource, you can restrict the downloaded resources to those elements, which will result in much smaller resources and thus much smaller bundles. The following example downloads the first bundle of Patient resources that are trimmed down to the `name`, `gender` and `birthDate` elements, which are specified in the `_elements` parameter. `_elements` takes a comma separated list of base level elements for the resource and will make sure that the downloaded resources only contain those elements plus the mandatory elements `id` and `meta`. The `_count` parameter in the examples restricts the number of resources in the bundle to 2, which is just done to make the result more printable for this vignette. ```{r, eval=FALSE} request <- fhir_url(url = "http://hapi.fhir.org/baseR4", resource = "Patient", parameters = c("_elements" = "name,gender,birthDate", "_count"= "2")) bundles <- fhir_search(request, max_bundles = 1) cat(toString(bundles[[1]])) ``` ```{r} # # # # # # # # # # # # # # # # # # # # # # # # # #


#             
#


```

As you can see, the resulting Bundle is much smaller than it would be if the full resources where downloaded.

#### 2. Batch process bundles by saving them to hard drive
You can spare working memory by saving the bundles to your hard drive during the download instead of keeping them all in the working memory of your R session at once. If you pass the name of a directory to the argument `save_to_disc` in your call to `fhir_search()`, the bundles will not be combined in a bundle list that is returned when the downloading is done, but will instead be saved as xml-files to the directory specified in the argument `directory` one by one. If the directory you specified doesn't exist yet, `fhir_search()` will create it for you. This way, the R session will only have to keep one bundle at a time in the working memory. You can later load them using `fhir_load()` and crack them one after another:

```{r, eval=F}
request <- fhir_url(url = "http://hapi.fhir.org/baseR4", resource = "Patient")

fhir_search(
	request      = request,
	max_bundles  = 10,
	save_to_disc = "MyProject/downloadedBundles"
	)

bundles<- fhir_load(directory = "MyProject/downloadedBundles")
```

### 3. Batch process bundles by downloading them piece by piece
Alternatively, you can also use `fhir_next_bundle_url()`. This function returns the url to the next bundle from you most recent call to `fhir_search()`:

```{r, include=F}
assign(x = "last_next_link", value = fhir_url( "http://hapi.fhir.org/baseR4?_getpages=0be4d713-a4db-4c27-b384-b772deabcbc4&_getpagesoffset=200&_count=20&_pretty=true&_bundletype=searchset"), envir = fhircrackr:::fhircrackr_env)

```

To get a better overview, we can split this very long link along the `&`:

```{r}
strsplit(fhir_next_bundle_url(), "&")
``` 

You can see two interesting numbers: `_count=20` tells you that the queried hapi server has a default bundle size of 20. `getpagesoffset=200` tells you that the bundle referred to in this link starts after resource no. 200, which makes sense since the `fhir_search()` request above downloaded 10 bundles with 20 resources each, i.e. 200 resources. If you use this link in a new call to `fhir_search`, the download will start from this bundle (i.e. the 11th bundle with resources 201-220) and will go on to the following bundles from there. 

When there is no next bundle (because all available resources have been downloaded), `fhir_next_bundle_url()` returns `NULL`.

If a download with `fhir_search()` is interrupted due to a server error somewhere in between, you can use `fhir_next_bundle_url()` to see where the download was interrupted.

You can also use this function to avoid memory issues. The following block of code utilizes `fhir_next_bundle_url()` to download all available Observation resources in small batches of 10 bundles that are immediately cracked and saved before the next batch of bundles is downloaded. Note that this example can be very time consuming if there are a lot of resources on the server. To limit the number of iterations uncomment the `if` statement at the end of the `while` loop:

```{r, eval=F}
#Starting fhir search request
url <- fhir_url(
	url        = "http://hapi.fhir.org/baseR4",
	resource   = "Observation",
	parameters = list("_count" = "500"))

count <- 0

table_description <- fhir_table_description(resource = "Observation")

while(!is.null(url)){
	
	#load 10 bundles
	bundles <- fhir_search(request = url, max_bundles = 10) 
	
	#crack bundles
	dfs <- fhir_crack(bundles = bundles, design = table_description)
	
	#save cracked bundle to RData-file (can be exchanged by other data type)
	save(tables, file = paste0(tempdir(), "/table_", count, ".RData"))
	
	#retrieve starting point for next 10 bundles
	url <- fhir_next_bundle_url()
	
	count <- count + 1
	# if(count >= 20) {break}
}

```

### Saving download time

In most cases the bottle neck in your analysis will be the download time from the server, because most FHIR server are optimized for handling a lot of simultaneous small requests instead of a single big one. You can gain time by splitting up your request into chunks and sending it to the server in parallel using a parallelized version of `lapply()` but there are couple of issues to keep in mind.

#### Operating system 
The easiest to use version of parallelization is the function `parallel::mclapply()` which uses forking to process list elements from a `lapply()` call in parallel. As windows doesn't support forking, this solution only can only be used on osx or linux operating systems. 
If you want to achieve similar results on a windows machine, you can either run the fhircrackr in an R installation/RStudio Server that you set up in WSL2 (see [here](https://support.posit.co/hc/en-us/articles/360049776974-Using-RStudio-Server-in-Windows-WSL2) for an installation guide) or you can try out the [windows mclapply hack](https://github.com/nathanvan/parallelsugar) written by Nathan vanHoudnos.

#### Breaking pointers

The xml objects that represent the FHIR bundles contain external pointers that will break when they are exported to/from a cluster. This means that objects of type `fhir_bundle` or `fhir_bundle_list` always have to be serialized using `fhir_serialize()` when they are downloaded in parallel.

#### Splitting up requests 

Splitting up a FHIR request isn't always trivial. We'll show you two scenarios where you can split up a request into smaller chunks.

a) You have a list of resource ids or a list of identifiers (e.g. patient identifiers) for which you intend to download the corresponding resources. This is the most simple case, because here you just have to split up the vector of ids that you have into smaller chunks and then send one FHIR search request per chunk. You can to that with `fhir_search()` but there is also a convenience function for exactly that use case called `fhir_get_resources_by_ids()`. The following minimal example of course only works if the ids defined here are actually found on the server:

```{r, eval = FALSE}
# define list of Patient resource ids
ids <- c("4b7736c3-c005-4383-bf7c-99710811efd9", "bef39d3a-62bb-48c0-83ff-3bb70b51d831",
		 "f371ed2f-5cb0-4093-a491-9df6e6bfcdf2", "277c4631-955e-4b52-bd40-78ddcde333b1",
		 "72173a13-d32f-4489-a7b4-dfc301df087f", "4a97acec-028e-4b45-a72f-2b7e08cf80ba")

#split into smaller chunks of 2
id_list <- split(ids, ceiling(seq_along(ids)/2))

#Define function that downloads one chunk of patients and serializes the result
extract_and_serialize <- function(x){
                            b <- fhir_get_resources_by_ids(base_url = "http://hapi.fhir.org/baseR4",
                                                           resource = "Patient",
                                                           ids = x)
                            fhir_serialize(b)
}

#Download using 2 cores on linux:
bundles_serialized <- parallel::mclapply(
	X = pat_list,
	FUN = extract_and_serialize,
    mc.cores = 2
)

#Unserialize the resulting list and create one fhir_bundle_list object from it
bundles_unserialized <- lapply(bundles_serialized, fhir_unserialize)
result <- fhir_bundle_list(unlist(bundles_unserialized, recursive = FALSE))

```

b) You have a request that downloads multiple resource types, like `"http://hapi.fhir.org/baseR4/Encounter?_include=Encounter:patient"`, which downloads all Encounters as well as the Patient resources the Encounter is referencing. This type of request will often take a lot of time and can (depending on your system) be sped up if you only load the encounters in a first step, extract the ids of the referenced Patient resources and download those in parallel in a second step:

```{r, eval=FALSE}
#Download all Encounters
encounter_bundles <- fhir_search(request = "http://hapi.fhir.org/baseR4/Encounter")

#Flatten
encounter_table <- fhir_crack(
	bundles = encounter_bundles,
	design = fhir_table_description(resource = "Encounter")
)

#Extract Patient ids
pat_ids <- sub("Patient/", "", encounter_table$subject.reference)

#Split into chunks of 20
pat_id_list <- split(pat_ids, ceiling(seq_along(pat_ids)/20))

#Define function that downloads one chunk and serializes the result
extract_and_serialize <- function(x){
                            b <- fhir_get_resources_by_ids(base_url = "http://hapi.fhir.org/baseR4",
                            							   resource = "Patient",
                                                           ids = x)
                            fhir_serialize(b)
}

#Download using 4 cores on linux:
bundles_serialized <- parallel::mclapply(
	X = pat_id_list,
	FUN = extract_and_serialize,
    mc.cores = 4
)

#Unserialize the resulting list and create one fhir_bundle_list object from it
bundles_unserialized <- lapply(bundles_serialized, fhir_unserialize)
result <- fhir_bundle_list(unlist(bundles_unserialized, recursive = FALSE))

```

## Download random samples from a server
Sometimes it can be useful to download a random sample of resources from a server. The fhircrackr offers a function `fhir_sample_resources()` which takes a base url, a resource type and (optionally) some FHIR Search parameters and returns a random sample with a given size of those resources.
For example you could download 10 random Patient resources of all female patients born before 1960 like this:

```{r, eval=F}
bundle <- fhir_sample_resources(
	base_url    = "http://hapi.fhir.org/baseR4",
	resource    = "Patient",
	parameters  = c(gender = "female", birthdate = "lt1960-01-01"),
	sample_size = 10
)
```

```{r, include=F}
bundle <- fhir_unserialize(fhircrackr:::female_pat_bundle)
```
This request may take some time because in the first step, the resource (aka logical) IDs of all resources matching the request (i.e. all Patient resources of females born before 1960) are downloaded. This is necessary because the sampling is actually done in this vector of resource IDs.

The following code shows that the result is actually 10 Patient resources who are female and born before 1960. If you want to know more about how to extract information from the resources like this, please see the vignette on flattening resources.

```{r}
pat <- fhir_table_description(resource = "Patient",
							  cols = c("id", "gender", "birthDate"))

fhir_crack(bundles = bundle, design = pat)
``` 

Internally `fhir_sample_resources()` performs the following steps:

1) Extract the logical IDs of all resources matching the resource type and search parameters given in `resource` and `parameters` with the function `fhir_get_resource_ids()`. This function uses the `_elements` parameter of FHIR Search to avoid downloading all resources in full and you can use this function as a standalone function too, see `?fhir_get_resource_ids()`.

2) Draw a random sample (without replacement) from the vector of IDs created in 1).

3) Download the resources belonging to the sampled IDs using `fhir_get_resources_by_ids()`

If you want to sample resources based on another element then the logical ID, e.g. based on an identifier value or based on a reference, you can use the function `fhir_sample_resources_by_ids()` provided you have a vector of identifiers/references you want to sample from. Note that in this case the number of actually returned resources won't necessarily match the number in `sample_size`, because as opposed to the logical ID, an identifier or reference doesn't have to be unique for each resource. 

## Download Capability Statement
The capability statement documents a set of capabilities (behaviors) of a FHIR Server for a particular version of FHIR. You can download this statement using the function `fhir_capability_statement()`:

```{r, eval=F}
cap <- fhir_capability_statement(url = "http://hapi.fhir.org/baseR4")
```

`fhir_capability_statement()` takes the base URL of a FHIR server and returns a list of three data frames containing all information from the capability statement of this server. The first one is called `Meta` and contains some general server information. The second is called `Rest` and contains information on the operations the server implements. The third is called `Resources` and gives information on the resource types and associated parameters the server supports. This information can be useful to determine, for example, which FHIR search parameters are implemented in you FHIR server.

## A note on HTML in resources
FHIR resources can contain a considerable amount of HTML code (e.g. in a narrative object), which is often created by the server for example to provide a human-readable summary of the resource. This data is usually not the aim of structured statistical analysis, so in the default setting `fhir_search()` will remove the html parts immediately after download to reduce memory usage (on a hapi server typically by around 30%, see `fhir_rm_div()`). The memory gain is payed with a runtime increase of 10%-20%. The html removal can be disabled by setting rm_tag = NULL to increase speed at the cost of increased memory usage.

## Next steps
To learn about how `fhircrackr` allows you to convert the downloaded FHIR resources into data.frames/data.tables, see the vignette on flattening FHIR resources.