Using Azure Function App with in a near real time system

I recently architected and built a system with real time requirements where devices needed to trigger an event on some backend that would orchestrate various different workflows based on the event it received.

Hearing about Azure Functions and in particular Azure durable function, I thought this would be a great candidate to solve the problem at hand. This article will focus on some of the finer details of Azure Functions and how it played with the system design.

Why Azure Functions

At a high level serverless functions or Function As A Service (FAAS) provides a way of providing a single on-demand feature to be provisioned and run as a single function (https://en.wikipedia.org/wiki/Function_as_a_service Oct:2020).

Now there may also be questions about why would I look at FAAS instead of a full fledged application (like .Net Core or Spring Boot). Essentially it came down to the fact that I didn’t see that there would be too much code to write and I needed to build deploy and test very quickly. I didn’t want to set up all the boiler plate but rather get cracking quickly to prototype several design problems from get go.

Which FAAS is best? While there are many FAAS solutions out there, from AWS Lambdas to GCP Cloud Functions, the organisation I worked in was primarily a .Net/Azure organisation. Being closer to the other data sources in the same Platform made it more compelling than going multi-platform peering (where different components are hosted in different cloud providers and connected together through either network peering, or API gateways).

Further to this Azure Functions also provides features around state management, reusable activities and orchestration of activities that is not out of the box with the other providers. In Azure these functions are classed as Durable Functions. While non-durable functions are executed within a set timeframe and should do a single task that should be completed, Durable Function provide the ability to provide stateful orchestration of function execution.

What does stateful orchestration mean? If you’re familiar with the Actor Pattern (https://en.wikipedia.org/wiki/Actor_model: 2020) it’s similar whereby the idea behind durable function is to send a payload to a function that would manage the work by calling other functions and coordinate the flow and execution of the results. The good thing about durable functions is that it can last longer than the execution of non-durable functions (behind the scenes the state of the durable function is stored in Azure Storage Table using Event Sourcing pattern: https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-orchestrations?tabs=csharp: 2019).

They are composed of the following function types (https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-types-features-overview: 2019):

  • Client Functions:
    These are functions that are called by external entities and are triggered by an event. Examples include triggering http(s) call to trigger the function, writing a record to a table that will trigger a function
  • Orchestration Functions:
    These are functions that act to coordinate calls to other functions to co-ordinate. They can be check-pointed when yielding to other function executions. It has the advantage of being able to be long-lived execution. The execution can run minutes, days, weeks, months, forever.
  • Activity Functions:
    These functions are used as building blocks for coordinating work. They are asynchronous functions that can be called from one or many activities. They should be idempotent in their execution, meaning if the same input is executed multiple times it would give the same result.
  • Entity Functions
    These are a way of storing and updating small pieces of state. If you remember EJB’s it’s a similar concept. With Entity functions you define a class that wraps an object with some stateful data. Entity functions are used to update this data and be able to use this data from other functions.

So given we now know a little more about the why not other FAAS and a little more about durable orchestration, what were the other benefits I saw about using Azure Functions.

  • Durable orchestration:
    With https triggers slated at 1 sec interval for triggering and an ability to process up to 100 ms. Given that I had multiple devices with different scenarios Durable Orchestration seemed a good way of managing different state processes.
  • Simplicity
    The Device that sent event were able to send https post when it detected a trigger event. Http trigger events for Azure functions seemed easy.
  • Scalability/Availability
    Elasticity to scale out or up. There is potential issue of cold starts (this is where there is a delay in provisioning and starting a function to execute) but more about that later.
  • Flexibility
    I like coding and not worrying about infrastructure so have been a big fan of serverless technologies.

Problem 1) Handling events sent by the device

The Azure Functions had to distinguish different events sent by the device that encapsulated a trigger event of some sort. Based on a detection event there are basically a set of finite states to manage with a bunch of orchestrations.

This was the basic orchestration and state management needed.

The main parts of Azure function that was used comprised of:

  • http trigger
  • orchestration logic and
  • activities

Lets look at some example code:

[FunctionName("HttpNPDetectedTrigger")]
public async Task<IActionResult> Run(
//defining the Function to be trigger by http call
[HttpTrigger(AuthorizationLevel.Function,
"post",
Route = "detection-event")] HttpRequest req,

//injecting the durable client used to trigger the orchestration
[DurableClient] IDurableOrchestrationClient starter,

ILogger log) {
//this is just some dummy code to transform req to data
DataFromRequest data = DataFromRequest.Builder(req);

// create a id that will uniquely identify the orchestration
string orcId = $"{data.obs}-{data.eventType}-{Guid.NewGuid()}";
// kick off the specific orchestration based on the event
// by the name of the orchestration
switch (data.DetectionEventEnum)
{
case DetectionEvent.EVENT_TYPE1:
await starter.StartNewAsync("Event1Orc", orcId, data);
break;
case DetectionEvent.EVENT_TYPE2:
await starter.StartNewAsync("Event2Orc", orcId, data);
break;
}

What we see is a normal azure function app with a http trigger. A Post to the endpoint with data reads the request object for the data and bundles it into a DTO that we can pass to the orchestration function.

What is important is that you define a unique id for the orcId of the orchestration

string orcId = $"{data.obs}-{data.eventType}-{Guid.NewGuid()}";

If the orcId is not unique you will get collisions and a 500 error is thrown. Depending on the design of your application the instance id can be used as a guard to avoid creating the same execution to happen again based on the orcId — just think unique constraint.

Based of the eventType sent by the camera we can kick off different orchestrations. The orchestration function is identified by the name.

await starter.StartNewAsync("Event1Orc", orcId, data);

Notes about the orchestration, as per https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-orchestrations?tabs=csharp:

Orchestrator functions define function workflows using procedural code. No declarative schemas or designers are needed.

Orchestrator functions can call other durable functions synchronously and asynchronously. Output from called functions can be reliably saved to local variables.

Orchestrator functions are durable and reliable. Execution progress is automatically check-pointed when the function “awaits” or “yields”. Local state is never lost when the process recycles or the VM reboots.

Orchestrator functions can be long-running. The total lifespan of an orchestration instance can be seconds, days, months, or never-ending.

Orchestrator functions use event sourcing to ensure reliable execution and to maintain local variable state. The replay behavior of orchestrator code creates constraints on the type of code that you can write in an orchestrator function. For example, orchestrator functions must be deterministic: an orchestrator function will be replayed multiple times, and it must produce the same result each time.

The last point is important. When you run the orchestration code and set a break point, you may see multiple executions of the same function. It should try to be idempotent. This actually caused interesting side effects if you use, for example, a date time as part of the processing. You’ll get a warning in the logs about the function not being idempotent.

The purpose of an orchestration function is to co-ordinate a bunch of deterministic work capabilities together a seamless whole. These unit of work is called an activity. It is important to note that activities can be called at any point in time. They should be idempotent to avoid side effects when two different orchestrations call the same activity.

Building the example orchestrator :

//Name of the Function is used for identifying the function when
//calling the function from another function
[FunctionName("Event1Orc")]

public async Task<TaskStatus> Run(
// the context is used to get the input and has other
// useful functions when doing orchestration coordination
[OrchestrationTrigger] IDurableOrchestrationContext ctx,
ILogger log) {
// this will retrieve data passed from the trigger
DataFromRequest data = ctx.GetInput<DataFromRequest>();
// call an activity function by name
Observation obsData = await ctx.CallActivityAsync<bool>
("GetMeSomeData", data.obs);
// branching your activity calls
if (obsData) {
await context.CallActivityAsync<bool>
("CheckObsData", obsData");
await context.CallActivityAsync<bool>
("UpdateObsWf", obsData");
} else {
await context.CallActivityAsync<bool>
("ValidateNewObsData");
await context.CallActivityAsync<bool>
("AddObsData");
}
return TaskStatus.RanToCompletion;
}

The above code uses the data passed to the orchestration from the trigger function. The orchestration itself is just a function on a class and is identified by the FunctionName annotation. The OrchestrationTrigger annotation identifies the function to be triggered by a call to an orchestration. The context object provides the context information for the orchestration. It has a bunch of useful things like getting the data passed from the trigger, creating a call to an activity function. In the code above there are different activities called. Each activity doing a specific interaction.

Defining an activity function:

[FunctionName("CheckObsData")]
public async Task<Observation> CheckHasAnObservation(
[ActivityTrigger] string obs,
ILogger log){
return await _obsService.CheckObs(obs);
}

The Activity above shows how to define a function on a class that will trigger an activity based on the FunctionName “CheckObsData” and is triggered as an ActivityTrigger defining the data used for the activity.

Key learnings:

  1. Using a unique Instance Id: The Instance Id for the trigger is used to uniquely identify a specific orchestration. This is so you can pause and restart a specific orchestration. It could also be used as a primary key to avoid duplicated keys executing more than once. If a key is seen more than once an error is raised.
  2. Avoid fine grained activity function: Activities, like orchestrators, should be deterministic and asynchronous. One important design decision that needs to be made is how fine or coarse grained should the activities be. Fine grained can be good if you want to have ease of reuse, however if it’s too fine grained and the activity is doing an update, you could run into issues with critical code blocks which can cause side effects as multiple orchestrations call the same activity.
  3. Use services to encapsulate domain to make it easy to test: Could not find a way to write tests to test the activity or orchestration functions. In Spring you can write E2E and Container tests where Spring objects are executed in the context of the Container. I couldn’t work out a way of doing the same for Azure Functions. The best way I could write test would be to ensure that activities use a service and ensure the service is testable.

Problem 2) Slow and steady does not win the race

In a system design that relies on real time events firing, we discovered a few issues around ensuring timeliness:

  • Cold starts:
    Cold starts are the time taken to provision or start a function when it hasn’t been executed for a while (https://azure.microsoft.com/ru-ru/blog/understanding-serverless-cold-start/:2018). Even though there was a timer function that was used to create a heartbeat (an endpoint that can called to provide status information that a system is available), there was still times when executions would take a little longer to execute. For higher environments we moved to Premium Plan to ensure that we had dedicated resources.
  • Critical code blocks:
    Some of the activities was very fine grained and could be executed multiple times from different orchestrations. This caused issues with critical code blocks being executed by multiple orchestrators that could cause side effects in the orchestrators due to data being updated by multiple orchestrators.
  • Random 30 sec waits:
    Random 30 second waits between triggering and execution
    We could see in our App Insight logs the trigger event occurring.

How did we fix the Cold starts?

We did enable Premium Plan in higher environments. This did solve the Cold start issues however we used Consumption plan for lower environments. This introduced an interesting consideration when designing the ARM template for your Devops deployment.

Within our DevOps release pipeline we defined a flag isPremium that can be overridden by the Release Pipeline with value of true or false. Then within the ARM template you can use the variable to specify what to serverFarm to provision.

Here’s an example of how you could use the isPremiumFlag

{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"isPremiumPlan": {
"type": "bool",
"metadata": {
"description": "flag to indicate whether to use premium plan or not"
}
},
},
"variables": {
"premiumHostingSku": {
"name": "EP1",
"tier": "ElasticPremium"
},
"consumptionHostingSku": {
"name": "Y1",
"tier": "Dynamic"
},
...
},
"resources": [
{
"type": "Microsoft.Web/serverfarms",
"apiVersion": "2019-08-01",
"name": "[parameters('hostingPlanName')]",
"location": "[parameters('location')]",
"sku": "[if(parameters('isPremiumPlan'), variables('premiumHostingSku'), variables('consumptionHostingSku'))]",
...
},
]
}

By setting a flag via parameters injected as part of the release pipeline you can then define specific variables for consumption versus premium plan. Using the “[if” function in the arm definition you can then inject the right variable or string value for consumption or premium plan to create your serverfarms.

How did we fix the critical code blocks?

Within our orchestration there were activities that needed to be chained to create a cohesive unit of work. Activities, as we have discovered, are asynchronous and can be called my more than one functions at the same time. In a multithreaded environment, when you have multiple threads accessing the same function that changes the state of variable used within the method, we can have side effects. This is called critical section or critical code block (https://en.wikipedia.org/wiki/Critical_section: 2020). Much like in a multithreaded environment, our asynchronous functions are returning values and updating the database at the same time. In an asynchronous environment with activity functions, if your activities are very fine grained you could find yourself in a situation where a few activities update the database while another activity is reading the same row in the database, this can create side effects and breaks idempotency.

In multithreaded environments you often look at creating a synchronised code block or a semaphore to only allow one thread to execute at time. Alternatively, if you have multiple different executions that are not related you could also look at implementing a thread barrier (https://en.wikipedia.org/wiki/Barrier_(computer_science): 2020).

To avoid side effects with our activities we instead leaned on another part of the Azure Function framework “Entities”.

For those familiar with the bad old J2EE days, Entities are memory objects that lived within a container and could be annotated with a transaction paradigm (https://en.wikipedia.org/wiki/Jakarta_Enterprise_Beans: 2020).

In Azure Functions:

“Entity functions define operations for reading and updating small pieces of state. We often refer to these stateful entities as durable entities.” (https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-entities?tabs=csharp: 2019).

One aspect about Entities that are great is that you can lock Entities from being updated by multiple threads. Essentially we used Entities as a semaphore.

The following code demonstrates how you could use Entities to manage updates to a small piece of state that is important to your application state.

Firstly define an Entity class:

public class ObsEntity 
{
public ObsModel CurrentValue { get; set; }
// this is the trigger for the entity to be called
[FunctionName(nameof(ObsEntity))]
public static Task Run(
[EntityTrigger]
IDurableEntityContext ctx) => ctx.DispatchAsync<ObsEntity>();
}

The above code has an Entity Trigger Function definition where the FunctionName defines the name of the Entity to allow it to be callable. The EntityTrigger is used to allow the Entity to be triggered.

Using the Entity as a semaphore can be see below:

// within your activity or orchestration
EntityId obsEntId = new EntityId(nameof(ObsEntity),
$"{req.ObsId}");
// anything wanting to access with the same parkingEntId is blocked
using (await context.LockAsync(obsEntId))
{
}
// after the using, the lock is removed and disposed

An Entity is created or returned based on the Id used to create an EntityId. This is used to bootstrap a new Entity with an Id. From here you could reuse the same Entity based of the EntityId. The EntityId is a singleton within the Function App and used to reference a small bit of state that can be updated. To avoid multiple activities updating the same entity, you can lock the EntityId as seen above. As soon as the using block is completed, the lock is removed. By defining a EntityId using variables/data/state that is critical to the a code block you can create a semaphore that avoids multiple activities updating the same state/data/value at the same time.

How did we fix the random 30 sec wait after an orchestration triggered?

One quirky observation we noticed was randomly there would be a 30s wait between a trigger being fired and when the orchestration would start.

The answer was hidden here:

Long story short — to avoid back off queue issues an arbitrary 30 seconds delay can occur randomly (https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#queue-polling). You can set maxQueuePollingInterval to a lower value in your host.json file.

{
"version": "2.0",
"extensions": {
"durableTask": {
"hubName": "MyTaskHub",
"storageProvider": {
"connectionStringName": "AzureWebJobsStorage",
"maxQueuePollingInterval": "00:00:05",
}
},
}
}

Problem 3) What just happened!?!?!

To be able to support and understand what happened there were a number of important patterns employed:

  • App insights and Logging important code blocks with JSON payloads.
  • Event Source Analytics table
  • Azure Function App table

Application Insights

Application insights is powerful with its SQL-like query language (much better than trawling through CloudWatch or Splunk). It was an important component that was used to create different insights on what was happening between the camera and Azure function app to see the payloads and timings between the on-prem and the cloud. The use of JSON in the logs was useful to pull out certain, non PII, but stateful data to understand the running state of system.

Some useful functions that can be used in include:

  • parseurl: this can be used to parse the url. It’s quite useful, for example, if you have query parameters that you wish to inspect. eg
request
| extend parsed_url = parseurl(url)
| extend param1 = parsed_url["Query Parameters"]["param1"]
  • parse_json: if you standardize your logs to have a specific JSON format, using parse_json makes it easy to check specific logs within your application. Having a correlation id that can be checked against is a good way to inspect your system. The orchestration id, for example, is a great way to be able to link not only your logs with a useful correlation id, but by using the orchestration id, you’re able to also see the execution of functions in the backing Azure Function app table (more about this later). Here’s a simple example of using json_parse
trace
| extend json_msg = parse_json(message)
| extend attr1 = json_msg["attr1"]

Event Analytics table

All the events we generated in the azure function app was logged to our own custom Azure Table that use an event-source paradigm. It provided the event, date and time the event occurred and some key data points observed when the event occurred. Things like whether the event was associated to an order, key statuses of the order at the time and other useful debugging information. It was a good way to help stitch the picture of what was going on. The use of a single arrival event id helped make it easy to see all the events that happened for a particular event making it a good way of linking events for a particular camera detection event.

Azure Function App table

The Azure Function App table has the trigger id for each trigger event. This is really useful to work out what activities were run as part of the orchestration. Within the tables you will find a DurableTaskHistory table. This table will have the OrchestrationId as PartitionKey. The EventType column shows events and activities that executed. Use the Name field to see the Activity that ran for a given OrchestrationId. This is really useful to work out which Activities ran and which did not, giving you a clue on where your Orchestration could have bombed out.

Problem 4) Show me the money

With Cloud providers it’s important to have an understanding of how your architecture will run and how it is charged. Azure Function apps provide Consumption and Premium plans. With Consumption you only pay for what you used, however if your Azure Function App goes stale, the resources can be released leading to cold starts.

To run the solution in a near real time manner we did, in the end, turn on Premium Plan which did increase the cost. To be fair, it does show that Azure Function App can scale up and out nicely if you don’t mind the cost.

Conclusion

Having played with both Google functions, AWS Lambda, I can fully appreciate the Orchestration and Entity capabilities of Azure Functions. Sure you can hook Lambdas to SNS for Pub/Sub or Kineses for Real time streaming analytics, but for a simple POC, I found Azure functions pretty quick to set up the and orchestrate the problem space I was given. It provides a great deal of flexibility around designing various different design constraints in semi-real time and complex orchestrations.

While there are many benefits to Azure Functions using Durable Functions:

  • you’re not constrained to a single execution and can have long running orchestrations
  • you can reuse and coordinate work streams and branch your work load
  • the state is maintained in an event source table so you have a level of checking what was actually executed
  • you spend more time writing code than setting up boiler plate or worrying about infrastructure
  • you can easily run azure functions locally without having to deploy;

there are a few traps, as I’ve shown in this article:

  • Give some thought on how fine grained your activities are
  • Consider critical code blocks that break idempotency
  • Managing cold starts
  • Unexplained 30 second waits on Azure Table storage
  • Watch out for the cost

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Quentin Ng

A passion for Humanity in Computer Design to inform, engage and streamline the business process.