Building a completely serverless, secure, VOD e2e platform in AWS in 5 days.

19 min readJan 30, 2021

1 day of planning and 4 days of coding.

Last year, while on the lookout for my next contract opportunity, I realised that I was a little out of practice with AWS. My last couple positions involved mostly web development with an aspect of infrastructure management, but nothing too in depth when it came to the AWS ecosystem or any of its moving parts. I took this time off as an opportunity to get myself back up to scratch, as well as practice some implementations that I’d theorised about but hadn’t had the chance to actualise them yet.

This article isn’t a tutorial on building this platform but rather my experience building it and the explanation behind some of the decisions I made along the way. Linked here is the Github repository if you are curious at how it all sits together and what the final code ended up looking like. The Terraform script will probably be the most interesting part, I know i’ll be referencing it in the future for sure.

Starting out I had a few goals

The platform would have user a user login and signup process. For this Amazon Cognito was chosen.
DynamoDB would be the database of choice. The streams functionality works perfectly for this use case.
The API would be a Node.js Lambda attached to API Gateway, the shape of this wasn’t entirely known yet.
There would be a React frontend for both ‘Creators’ and ‘Users’

Day 1:

Planning

I wanted to conduct this project was in a hackathon fashion. I would set a time boundary (5 days) and a target E2E MVP to reach. There would also be stretch goals that could be met if time allowed for it. There were a couple challenges that I have had come across before with this sort of architecture that I also wanted to address and see how much of a problem they would still pose.

The first day would be planning the architecture of the platform. Highlighting the necessary components, DB tables and workflows. This was still time-boxed and I knew not everything would have been covered, but it did give me the general idea of what needed to be built. In reality this step could take a very long time to conclude, but I wasn’t setting out to design the perfect system right away.

This was the original design for the platform, with some stretch goals in mind.

The components here are broken up into logical steps, as well as the order the development would be done in.

The signup workflow would be day one. With the goal at the end of the day being a user who can sign up, log in and be given a session token.

The video creation workflow would be using the user session token. An authoriser Lambda would be used to validate the token and add user information to the request. Ideally this would have a green path completed when the user could create a video, update the metadata an acquire a URL they could use to upload their video file.

The video processing workflow was the most unknown in terms of AWS interactions. The core parts were S3, Mediaconvert and Cloudfront. Once a video was uploaded into the bucket it would be processed by Mediaconvert and placed into an S3 bucket with a publicly accessible Cloudfront distribution on top of it.

A light React frontend would be used to facilitate the user interactions and viewership portion of the platform. A stretch goal for this would be to use Elasticsearch to store the published videos and a user being able to search against it.

Serverless Framework Experimentation

Last time I attempted to build a serverless platform it was recommended to me to use the Serverless Framework. I have actually in the past even created typescript templates for it, as it was something that I had played around with albeit not to any huge extent. Suffice to say I was a bit unimpressed with the approach here. I spent a few hours struggling with the config and looking up example after example. All I wanted to do was add a Cognito resource and add a user to it.

This seemed already to not really be where the strength of the framework lies. I’ll admit that getting a Lambda hooked up to an API Gateway endpoint was easy-peasy. Anything after that was a pain. The documentation was fragmented and advice elsewhere didn’t seem to have much of an alignment. There were also a few points that really bothered me.

Adding in the outputs from resources to a Lambdas environment variables had a race condition: There first needed to be a run to set up the resources and then a second run to use the variables I wanted from them, in this case the CognitoID and ClientID. There didn’t seem to be an easy way to force the creation of the function to wait till either were created.
Tearing down the stack at the end of the day with ‘serverless remove’ did nothing. Most annoyed. It gave me a ‘Stack all done’ sort of message although all the resources still stayed up. This wasn’t the case before I was using the Serverless dashboard integration, but adding the integration seems to have broken this functionality. Now I have to manually delete the resources.

I’m glad that the issues I had presented themselves immediately. This way I could pivot right away instead of wasting time further down the line. I decided to swap to using Terraform for all my AWS resourcing alongside extra scripts for the bundling and deployment of the Lambda functions. There are plenty of resources for Terraform and I’m way more comfortable with the functionality.

Summing up Day 1

The exercise of planning out the platform was really enjoyable, it’s not something that as a dev you are doing very often and building a platform like this is quite the warmup, there are a lot of angles to consider. Even if it’s a little wrong in the end, having a complete overview is very valuable.

Day 2 — Sign up workflow

The object of today was to have a user sign up and login, receiving a session token. The design of this part of the platform was an improvement of a previous design that I had used a couple years ago. I also wanted the infrastructure and deployment to be completely scripted. I have forgone using an official CI/CD platform to run my deployments off in favour of much more rapid deployments from my local machine. The full deployment script can be found here.

The premise is the user signs up through the signup Lambda, a record is created in both the Cognito pool and in the users table. Cognito has all the benefits of signup security, integrations, 2fa and all those bells and whistles. I added a custom field into Cognito to store the UserID of the Users table, so when you log in, that’s how the link is made to get the rest of the user info. I had forgotten about the confirmation step that Cognito had. I opted to manually confirm my test users, but it is possible to programmatically confirm users instead. Users will need to be confirmed before they can officially log in.

I also assumed that Cognito would automatically send confirmation emails as part of its default configuration, but that didn’t seem to be the case. I didn’t get back to that.

Deployments and Terraform

Terraform was going to be my way forwards for all my infrastructure. It’s powerful, has a huge community and is the standard in many workplaces. I have had experience with it before although there were definitely some things I learned throughout this process. Following this tutorial I was able to get my end to end API Gateway -> Lambda instance all stood up. It was useful for refreshing my memory as well as getting me a baseline to start with.

API Implementation

Originally I thought I would use a single Lambda file per API endpoint, or at least one API Gateway per user type. Neither of these ended up being the case.

Instead I ended up using a serverless-express instance for each API type. I also modified the original API implementation from the tutorial to have the {proxy+} route for the API nested under a /users route rather than the root. This was I could reuse the same API resource for the multiple API’s I intended on using. Noting here the Authorization is not present on this endpoint either, as users won’t yet have a session token to use. A future improvement here would be to lock down the route to use only POST as that’s just what the API needs.

I went with a fairly straightforward approach for each API route, 1 file per route: users/viewers/creators. Each one contains its own API setup and its own routes. Above you can see the three that were used: Healthcheck for smoke testing and Signup/Login for the process.

Amazon Cognito

In my last iteration of using Cognito I had made a mistake, I tried using the token and IAM information that Cognito passed back when a user is authorised. I chose a different route this time. Instead opting to use Cognito only to handle the signing up and log in verification. Once the user was verified, a JWT was passed back that I generated within the login Lambda and encoded using jsonwebtoken. The JWT can be verified at a later date (within the custom authoriser) and Cognito no longer has to be a part of the system.

A stretch target of this component would have been to use a DynamoDB table to store certificates that could then have their DB key embedded in the JWT. When the credentials were initially encoded the latest certificate in the table would be used, a certificate that could be created hourly using a timed Lambda. When the token was decoded the key could be used to retrieve the certificate to verify the JWT. As well as having a rolling certificate, other security benefits are granted via this method, such as being able to entirely delete a certificate to instantly invalidate a small window of session tokens. In the end I hard coded a certificate string to use instead. After all, this is a hackathon.

Users table

When building out the users table there were a few design decisions that I made where there wasn’t an immediate use for the values, there was a plan.

‘LastLogin’ and ‘SignupDate’: For auditing.
‘AccountStatus’: For user management. Not used yet.
‘CertificateAgeLimit’: The youngest a session token can be to still be valid. Used for invalidating logins before X Date.
‘Username’: So I could retrieve a value back out of Dynamo when logging into see if it all worked.

Cloudwatch

From my experience, the usual API gateway -> Lambda setup doesn’t include allowing Cloudwatch to be used. Having logging on your Lambdas is invaluable and it’s very easy to configure within Terraform. I strongly recommend spending the tiny amount of time to set this up for all Lambdas.

Tip: Because API gateway and Terraform are a little funky together, you need to taint the Gateway resource to force a new deployment if you change something.

Summing up Day 2

Today the goal was achieved. I was able to sign up a user, authorise them and then use the same credentials to log in. I was very happy with how the approach of using a JWT turned out compared to using the IAM credentials. While the stretch goal wasn’t reached, I had confidence in its design.

Day 3 — Video creation workflow

The goal of day 3 was to create an authenticated API where logged in users could create a video record, edit the video record and then retrieve a URL they could use to upload their video to the RawVideo S3 bucket.

API Gateway custom authoriser

The Creators API itself was almost a complete copy of the Users API. One key difference though is the custom authoriser.

For the authoriser I created a small Lambda. It would intercept the request, decode and validate the Authorizer JWT header and retrieve and validate the user details from the users table. This would be the last Lambda that actually accesses the users table. The user details would then be forwarded onto the Lambda proxy handler. Any time after now that the API was reached it could have the users details as well as confidence that it was a secure request. I also went ahead and added an extra ‘refreshed’ JWT to the request to be returned with the rest of the data. This way the if the frontend needed to keep its session alive (2 hour expiry) it could do so while the user was logged in.

The current API state. Including Authoriser.

The new view of the API shows the authoriser attached to the {proxy+} resource on the creators API. However, there was a bug within this implementation. On of the quirks of the authoriser is that it can cache your request and this has the added benefit of being faster for subsequent requests but it had the drawback of not being compatible with the {proxy+} method at all. What happens is the first request comes in under POST: creators/video which authorises fine. But now the endpoint has a 300 second cache ONLY for that endpoint. So if a second request were to come in for GET: creators/videos it will return a 403 because the cache will only respect the first endpoint it encounters.

The 300 second time-to-live for the cache is the default within Terraform and must be explicitly set to zero. This fixes the issue all together.

DynamoDB

There were two tables in this portion of the implementation: Videos for tracking the actual video data and RawVideos for tracking the upload process. When a user requests to upload a video a RawVideo record is created. The Videos table was also granted a Global secondary index to be able to query the table using UserId rather than video ID (i.e. to get all videos with UserID X).

Terraform example for the Videos global secondary index.

I also tried a new version of the DynamoDB library today. Previously I had been using the new AWS.DynamoDB() method. Which is where you need to query and retrieve your datatypes by explicitly declaring their ‘type’. The new version I tried using new AWS.DynamoDB.DocumentClient() which overall simplifies the querying data storage aspect of the libraries.

Left: DocumentClient syntax. Right: The other client

While it doesn’t look like a huge difference, it makes a big impact in the long run.

Cleanup

I also took some time today to refactor a little. I began with adding typescript to the Lambdas. I’m a big fan of typescript and I was feeling the pain of not having it in the solution. Being able to add it to the solution so I had type checking and typings was great. The actual typescript code I wrote wasn’t my best, but that wasn’t my main focus. I also changed some of the naming within Terraform, a lot of the generic API names were ‘users’ which was beginning to get confusing. I also had decided on a name for the platform ‘Valvid’ and renamed my assets accordingly.

Summary of Day 3

Overall the progress today was slightly slower than anticipated, although I did have external distractions so I didn’t get as much focus on it as I wanted. The S3 upload URL was able to be returned so I did achieve my goal!

Day 4 — The Frontend

The focus of day 4 was initially supposed to be the upload and processing of videos but I spent some time throwing together a light React frontend because it would be required to upload the videos easily. There isn’t a huge focus on this part of the platform and it was slapped together quite roughly, for my personal experience in the project it wasn’t the practice I needed.

I’ve used https://github.com/facebook/create-react-app#create-react-app-- to get up and running. npx create-react-app valvid —-template typescript

I did polish the frontend up more than I intended to. I ended up with a signup/login form, state management and you could create, view and edit past videos as well. I finished just before using the S3 signup URL.

API Gateway and CORS

One thing I was particularly wary of was the CORS issue you get with API gateway. I recall a couple years ago with a similar stack there was problems with CORS and the frontend and the Terraform required to enable that across all endpoints.

This is all that was required in the end to get CORS working on the Creators route. I expected much more of a problem but this was easy peasy!

Summary of Day 4

All in all I only got around 3 hours to work on the platform today however, the frontend ended up in a better spot than I intended.. Originally it was just going to be a couple different forms in a grid but it ended up having a more interactive flow to it.

Day 5 — Video processing, the finale

Today was ambitious. I was going to have videos uploaded to S3, processed by MediaConvert and then placed into an S3 bucket, ready to be served via Cloudfront. The video records would be updated to ‘Processed’ and then published by the creator. Published videos would be be searchable via a third public API and streamed through the frontend.

This section definitely had the most moving parts in theory, although some of the queueing logic was left out. The stretch goal of this section was to have logic surrounding red routes and healing the system. If videos were to fail in sections, it would clean up and notify the RawVideos table. Only really the green route ended up getting finished though.

Uploading to S3 and triggering a Lambda

For uploading to S3, I created a POST endpoint in the creators API that generated a presignedURI the user could then use to upload directly to S3. This is a handy way of doing this as you don’t even have to handle files being uploaded to your server (Lambda). This saves on bandwidth and prevents the need for pesky local file system storage. Initially I used the getSignedURL function from the S3 library but in the end I used createPresignedPost which worked well for this use case. I added an expiry of 600 seconds and a content length of ~1gb to limit users from uploading terabyte sized files.

S3 had a notification trigger that can be attached to through various means. Initially I intended on creating an SNS topic that published to an SQS queue, and if I were to production harden this system I would still use them. Instead of the Pub/Sub model I attached a simple Lambda to be triggered by the file upload instead. This Lambda (RawVideoUploaded) would handle the video processing.

MediaConvert

This was my first hands-on experience with AWS MediaConvert. It was a little bit of a strange service to use and the documentation wasn’t fantastic but I did manage to find an example to get my first videos processing. The actual config parameters for the video were 120 lines! This included things like audio, codecs, file locations and keys.

The first step of this process is to add a Job queue into Terraform. The Job queue is pretty much what it sounds like, a queue for all the jobs that MediaConvert will be attempting to process.

The Job queue ARN is passed into the RawVideoUploaded Lambda and used to add the items into MediaConvert. Once the file has been processed there is information given back to the user that is somewhat useful. In my case it took a couple tries in order to get my first video to successfully process.

There was nothing that was particularly difficult about this step in the process. I found an example of a config that would work for the job queue and apart from the audio track I didn’t deviate much from it. The errors were mostly effective, the example above indicated there weren’t multiple auto tracks on my video so I just removed the boilerplate audio track[1]. The video processing itself was very fast, in most cases finishing after a few seconds. I did struggle a little bit with the policy forwarding from the Lambda but it was mostly config I had missed.

Once the files are placed into the ProcessedVideos S3 bucket another Lambda is triggered. The job of this Lambda was going to be updating the RawVideo record and then another Lambda triggered off the DynamoDB stream to update the Video record. I ended up querying the RawVideo record for the VideoID and directly updating the Video record, bypassing the RawVideo record all together.

Cloudfront

Once the file is processed, it’s placed into a target S3 bucket. The videos in here are all ready to be served but we don’t want to be streaming right out of the bucket itself; S3 is slow and expensive and accessing files on mass without a CDN is a bad idea. I ended up creating a Cloudfront distribution on top of this S3 bucket and gave it read access to the files.

You do this by creating an aws_cloudfront_origin_access_identity and adding a bucket policy with access identity as the principal. That way only Cloudfront has get access and can cache the videos itself. The Cloudfront distribution URL I used was the automatically generated endpoint which I embed as an environment variable into the Viewers API

Creators API changes

For the Creators API I needed to add some extra state tracking for each Video. Eventually landing on using the attribute UploadStatus to track a Raw videos process: ‘UPLOADED’, ‘PROCESSED’. Once a processed video has been flagged the user can call the /publish endpoint to publish their video, ready for viewers to find.

Viewers API

The final API of the platform is now introduced. Almost identical to the original Users API, it just doesn’t use an authoriser. I was able to copy the config exactly with only the route and handler change. This section was quickly thrown together with a single endpoint to be used: /viewers/videos. The endpoint does a query on the Videos table (in reality something like ElasticSearch or anything other than DynamoDB would be better) to find all published videos. I added a third Global Secondary Index (Using VideoStatus + VideoID) to the videos table to facilitate this.

Tip: “Status” is a keyword in DynamoDB and updates cannot be done via the DocumentClient library using it, I had to change the field to “VideoStatus” instead.

Once the videos were retrieved, they filename of the video had the Cloudfront url added to it and were ready to be served!

The first end to end processed and published video.

This view is the culmination of all my work during this hackathon. I added a viewers mode to the original frontend so a separate one didn’t need to get built just to display a list of videos. The frontend could use some styling, but I really wanted to get all the other moving parts working first.

What you see here is the result of:

A user creation process with anonymous users who can then be approved
A user login process which grants secure access to the ‘Creator’ API
Videos that can be created and updated in draft mode
Upload functionality for video files
Video processing using AWS MediaConvert
Video process tracking with event updates to keep the status of the video in sync
A Cloudfront distribution to serve the videos that are processed.
Video publishing action
A public mode where you can access the ‘Viewers’ API get a list of all published videos that can then be viewed
Videos displayed in a html5 compatible video format that can be streamed right to the browser
All while using a completely serverless architecture with the entire platform as code! Terraform file length: 777 lines!

Summary

This was a great exercise to take on. There were a lot of elements that not everyone gets a lot of practice with and enough moving parts and scope that the task didn’t seem frivolous. A few times during the process I had to take some shortcuts that I initially didn’t intend on doing; things like smoke testing my deployments, CI/CD, automated tests and forgoing the pub/sub model with SNS + SQS.

This process has proved a serverless end-to-end in the ‘green’ sense: there are no recovery patterns, backups and there were a few broken states where videos could fail to process and the draft would be stuck. This is a hackathon project and there would still be a fair bit of work to harden the system for actual use.

I was really happy with using Terraform for all the infrastructure, it worked brilliantly and supported everything I needed to do with little to no struggle. There were some default values like the cache TTL on the authoriser but nothing major. The final solution ended up looking more like this:

The approximate final architecture of the exercise

This is a little easier to read than the initial design, although it does lack some of the healing patterns I mentioned earlier. Overall seeing the end to end was amazing and when that first video was able to be streamed, I was ecstatic! In the end I probably spent about 27 hours of coding the solution and researching how to do the implementations I needed. I would have liked to have had a little more time during the week so I could have finished some of the stretch goals.

This entire project was a delight to work on and I’m happy with the result. It was a great refresher for the AWS ecosystem and every step I learned something new. Thanks for reading!

The repository for all of this code can be found here: https://github.com/JoshuaToth/ServerlessVODPlatform