import React from 'react';

const marginBottom = "1rem";
const myStyle = {marginBottom};

export default function LambdaLessons(props?: any) {
    return <div>
        <h1 style={myStyle}>Guidelines for using Lambdas in Production</h1>
        <h2 style={myStyle}>
            Introduction
        </h2>
        <p style={myStyle}>
            AWS Lambda functions are a great way to improve the horizontal scalability of your software
            infrastructure while simultaneously keeping costs low.  They are a great fit for
            most event-based architectures, especially those with inconsistent rates of traffic.
        </p>
        <p style={myStyle}>
            In this post, I'm going to go into some of the less-advertised nuances to  lambdas
            that crop up when a company takes lambdas to scale in production.
            If you're just starting out with lambdas, start by reading <a href={"https://docs.aws.amazon.com/lambda/latest/dg/welcome.html"} target={"_blank"}>AWS' documentation</a>.
            The intended audience for this article is people who have already been using lambdas and
            are trying to make sure they've dotted all their i's and crossed all their t's before
            subjecting their systems to larger workloads in production.  Most of the lessons presented here
            are things I learned while building out <a href={"https://swiftmissive.com"}  target={"_blank"}>swiftmissive.com</a>, which
            is an extremely low-cost email service provider that runs mostly on serverless tech.
        </p>
        <p style={myStyle}>
            In general, the key here is to balance all of the concerns below to ensure reasonable, semi-predictable, scalable throughput.
        </p>
        <h2 style={myStyle}>
            Duplicate messages
        </h2>
        <p style={myStyle}>
            A lesser-known caveat of AWS-controlled asynchronous event sources (SNS, SQS, event bridge) is that messages
            will be duplicated once in every 100,000 or so lambda invocations.  You will need to account for this in your
            code somehow.  The docs often say, "Just make your code idempotent! It's easy!"
            But realistically you are probably going to need to keep track of message ids provided in the message, and de-duplicate
            using Redis or some other method.  I've actually found it pretty easy (and weirdly fast) to use a simple
            postgres table for my storage of these messages, up to some tens of millions without any slowdown.  Your mileage
            may vary though so make sure to load test whatever solution you use.  If you go the id-deduplication route, make sure to:
        </p>
        <ul>
            <li style={myStyle}>Provide a worker name/id of some sort along with the message id that you are de-duplicating.  This is for
                situations where you might want to have a lambda invoke a second lambda to have it do additional processing
                on the same input message.  If you don't put the worker id as part of the unique identifier of the input
                in this case, then your second lambda might end up rejecting the message because it thinks it's already
                fully processed.</li>
            <li style={myStyle}>Make sure to age off your message ids.  They are GUIDs, which are incredibly hard to accidentally duplicate,
                but they will eventually slow down whatever storage system you are using anyways, so you'll need to clear them out.</li>
        </ul>
        <h2 style={myStyle}>
            Retries
        </h2>
        <p style={myStyle}>
            This is not really a lambda-specific suggestion, but occassionally things outside of your control are
            going to fail.  You can avoid a lot of the bad consequences of this by using retry logic in your code.
            Additionally, you should be aware how AWS will reprocess messages that it received an error code for and
            make sure that you are not reprocessing parts of your input batches.
            Basically, I suggest trying to:
        </p>
        <ul>
            <li style={myStyle}>Wrap all network calls in retry logic, including calls to databases, microservices,
                and everything else.  Occasionally, AWS just doesn't perform network requests correctly.  If you don't
                have retry logic built in then your lambda could die in the middle of a job and leave the work in an
                undetermined state.  Retry logic goes a long way to help you avoid those types of issues.</li>
            <li style={myStyle}>Have a dead letter queue (DLQ) for messages that your lambda is having trouble processing.  You can allow up to
                1,000 or so failures (including throttles) before putting a message into a DLQ.</li>
            <li style={myStyle}>If you are allowing batches of SQS messages to be pushed to your lambdas, then you should
                explicitly delete SQS messages in the batch after you are done processing them.  This is because
                you might not be able to process all of the messages in a batch, which would result in your lambda
                returning an error code that tells AWS not to delete the message from SQS automatically.<br /><br />
                Technically, if you followed my other advice on de-duplication via ids then this really shouldn't have any effect
                on your system, but I'd probably add in this extra precaution anyways.</li>
        </ul>
        <h2 style={myStyle}>
            Memory & vCPU limits
        </h2>
        <p style={myStyle}>
            Figuring out what memory & vCPU to apply to your lambdas seems to be more of an art than a science
            (which might also just mean I don't know a good formula for figuring out what to set haha).
            The main thing I can really say about this is that adding more power to your lambdas may
            in fact reduce costs, because some processes might run more than twice as fast if given twice as much
            memory.  Other than that, I don't have much advice here other than to just try a few different values if
            your lambdas are nearing their memory limit and see if it reduces the lambda's runtime by a higher
            factor than you increased the memory by.
        </p>
        <h2 style={myStyle}>
            Timeouts
        </h2>
        <p style={myStyle}>
            Timeouts occur when your lambdas take longer to process their input than the time allotted to them in their configurations.
            They will immediately stop execution once they hit their timeouts, often leaving their work in an unknown state.
            For that reason it is important to take timeouts very seriously and to avoid them as much as possible.
        </p>
        <p style={myStyle}>
            That is not to say that timeouts are all bad. There is literally no way to externally halt a lambda
            once it starts running.  So, if you accidentally put in an infinite loop or something like that, the
            timeout value is the only thing preventing a single lambda from reducing your database to a smoldering
            pile of ash.  The best you can do when you add an infinite loop (or similar disaster) to your lambdas is to
            prevent further lambda instances from running. You can do this by setting the reserved concurrency on the
            lambda to 0 via the AWS console.  Hopefully with
            proper testing you will be able to avoid running into a run-away lambda situation in production.
        </p>
        <p style={myStyle}>
            Here are some other things to consider when setting your timeout limits that are probably not obvious until you start using them:
        </p>
        <ul>
            <li style={myStyle}>Input batching
                <ul>
                    <li style={myStyle}>
                        Some input/event sources will batch up messages and will send the entire batch to your lambdas.
                        For instance, SQS will deliver between 1-10 messages in a batch to each lambda invocation.
                    </li>
                    <li style={myStyle}>
                        Make sure to adjust your timeouts and/or your input sources, where applicable, to achieve a good balance.
                        If your SQS inputs take 7 minutes each to process, it would probably be
                        best to set your timeout to 15 minutes and to configure SQS to only deliver 1 message at a time to each lambda.
                        On the other hand, if they take 100ms each, then you are safe to set a lower timeout and/or to allow
                        the maximum number of messages (10) to be delivered by SQS in a batch, which should save you a bit of money.
                    </li>
                </ul>
            </li>
            <li style={myStyle}>What will happen if the current lambda doesn't finish before the new one starts?
                <ul>
                    <li style={myStyle}>By setting reserved concurrency to 1 for your lambda, you can ensure that only 1 will run at a time where needed.</li>
                </ul>
            </li>
            <li style={myStyle}>If lambdas must absolutely not fail in the middle of their code, consider some kind
            of job-tracking table where your lambda can keep track of how far it got and pick up where it left off on its next run.
            (Keep in mind AWS will automatically redeliver the input message if you are using an AWS-controlled input source like SQS, etc.)</li>
        </ul>
        <h2 style={myStyle}>
            Setting the right concurrency limits
        </h2>
        <p style={myStyle}>
            Concurrency limits determine how many instances of each lambda can run at once.  By keeping these numbers
            low, you prevent your lambdas from destroying your database and other services (including
            other lambdas).  You'll want to set sane concurrency limits for every one of your lambdas to avoid unleashing
            a monster you can't control in your own production architecture.
        </p>
        <p style={myStyle}>
            An easy way to think of lambda concurrency is to imagine the plumbing in a house as water flows through the
            pipes, say after someone just washed their hands.  The water enters the drain and goes into the pipes, then those pipes plug into a larger
            pipe that combine its water with the water from the shower or wherever, and so on until eventually all these
            pipes kind of connect together and all the water flows into the sewer through one big pipe.  Its not a perfect analogy, but some of the
            principals are the same:
        </p>
        <ul>
            <li style={myStyle}>
                Pipes should get bigger instead of smaller to avoid backups in the case of multiple sources
                triggering lambdas at once, and to keep it obvious where issues are occurring if such backups do occur.
                In the case of lambdas, this means the lambdas that live later in your pipeline should have higher relative throughput/concurrency.
                This helps to avoid having input messages back up behind them and causing unexpected processing delays ("why did that one email just take 5
                hours to send even though I sent it through the 'send immediately' pipeline? Oh, because it got stuck behind
                5,000 'send-whenever emails'..." for instance.)  It also helps to avoid throttles.
            </li>
            <li style={myStyle}>
                The final output pipe is only so large, aka there isn't any point in trying to combine 10 large pipes into 1 equally large pipe.<br /><br />
                In the event processing world, the parallel here is that the services handling the output from your
                event pipeline are going to have limits of their own, so it doesn't really make sense to produce output
                a billion times faster than it could ever be used in the end. <br /><br /> For instance, at swiftmissive we are sending emails at
                a certain rate (the default is 14/second).  If we were to queue up emails at a rate of 100,000 emails per second,
                then we would still only be sending 14 emails per second, and we'd be saving 99,986 emails per second in a queue,
                where they'd wait around for about 7,000 seconds before they are sent.  Since there is no additional gain to be had by
                leaving a bunch of messages waiting in a queue, we can pick a maximum reserved concurrency for our intermediate
                lambdas that produces an overall output of about 14 or so messages per second (as measured at the end of the pipeline) instead.
                That way we don't waste any overall account concurrency, don't kill our database by running more intermediate lambdas
                than we need, and can have monitoring on our queues that will help us detect when something is starting to back up our event processing pipeline.
            </li>
            <li style={myStyle}>
                You can only fit so many pipes in the house, just like you can only have so many lambdas running concurrently
                in an AWS account (1,000 by default).<br /><br/>I suggest when you are setting up your system,
                that you use only up to this default 1,000 concurrency limit at first, even if your pipeline doesn't
                produce output at the desired rate.  The idea being that you can get a rough idea of the <i>relative</i> concurrency
                that you want between your lambdas, which you can do by monitoring for throttles and timeouts, and so that
                you can determine how beefy you want each of your lambdas to be as well (e.g. 128mb of memory VS 1024mb).
                Then, when you have the relative concurrencies and memory/vCPUs figured out, you can
                request a limit increase from AWS for your overall account and can then just scale up the reserved
                concurrencies across your entire pipeline by some percentage.<br /><br /> Keep in mind that not all lambdas
                will scale linearly, though.  For instance, the load on the database might increase exponentially even if
                the number of heavy-querying lambdas are only increased linearly.  Make sure to test to see how things
                are scaling once you get your overall account concurrency increased and start to bump the numbers
                on each of your lambdas.
            </li>
        </ul>
        <p style={myStyle}>
            And finally, one last suggestion that the plumbing analogy doesn't work very well for...  I've read somewhere
            that AWS delivers messages to lambdas from SQS via 5 threads, so you should
            ideally aim for a minimum reserved concurrency of 5 in your SQS-sourced lambdas, if possible, to reduce throttles.
            I think this is just for SQS-sourced lambdas, though, and I think the worst thing that could happen would just be throttles.
            It's really just a general guideline.  In the case of cloudwatch-triggered crons, I would actually recommend not
            having a concurrency of 5 in most cases.
        </p>
        <h2 style={myStyle}>
            Throttles
        </h2>
        <p style={myStyle}>
            Throttling occurs when AWS' behind-the-scenes input-delivery system tries to deliver inputs to lambdas
            that are already operating at their maximum scale/capacity.  For example, if you have some lambdas already
            running at their maximum concurrency, but you still have inputs in your queue, then AWS will try to
            deliver those inputs anyways, and will fail to do so with a "throttle" status in the Lambda console.
        </p>
        <p style={myStyle}>
            Throttling is really not the end of the world, but it may signal that you should adjust the earlier steps
            in your pipeline so that they produce less output (by reducing reserved concurrency for example),
            which will result in fewer throttles. Here are a couple of other things to be aware of:
        </p>
        <ul>
            <li style={myStyle}>
                Most input sources will try repeatedly to deliver a message in the event of a throttle.  SNS
                will retry for <a href={"https://docs.aws.amazon.com/sns/latest/dg/sns-message-delivery-retries.html"} target={"_blank"}>up to 23 days</a>,
                while SQS will retry for <a href={"https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-basic-architecture.html"} target={"_blank"}>
                up to 14 days.</a>&nbsp;
                I'm not currently sure what happens if a cloudwatch cron-event hits a throttle.
            </li>
            <li style={myStyle}>
                You should set up a DLQ if you are worried about messages not being processed fast enough.
                This will extend the amount of time you have to discover the problem and do something about it by about 14
                days.  I recommend setting up alerts on your DLQs if you go this route.
            </li>
            <li style={myStyle}>
                Throttles count towards a message's <a href={"https://aws.amazon.com/blogs/aws/amazon-sqs-new-dead-letter-queue/"} target={"_blank"}>
                max-receive-count.</a>  So, if you have a DLQ attached to an SQS queue,
                and the lambda processing the SQS queue's messages is struggling to keep up with its input, then you might end up
                with messages in your DLQ even if they were never touched by your lambda.  Keep this in mind when setting the
                max-receive-count on your queues so that you don't end up skipping messages that your lambda is just taking
                a long time to get around to.
            </li>
        </ul>
        <h2 style={myStyle}>
            Logging
        </h2>
        <p style={myStyle}>
            Lambda logging is, unfortunately, a nightmare.
        </p>
        <p style={myStyle}>
            My main problems are: log groups are super annoying to navigate, and nearly impossible to combine in any sane
            way in order to tell how an event is flowing through all parts of a system. And second: cloudwatch and log insights (which allows you to
            search text across log groups) are expensive.  I've tried quite a few different ways to get around this
            but didn't find anything good that was also cheap, so I stuck with the default cloudwatch log collector and
            cloudwatch insights in the end.  Alternatives I tried were:
        </p>
        <ul>
            <li style={myStyle}>
                Creating an ec2 box with a simple logging server on there.  This worked pretty well, but realistically there
                is a limit to the number of calls you can make to a box like this per second.  Once you start scaling up a
                server just so that you can do logging, you start to realize you are pretty much just creating a version
                of New Relic by yourself.  This did end up being probably the cheapest thing I tried, though.  Plus, you can stream
                from an ec2 box to watch things happening in your system in real time until you hit the networking limit and it
                starts dropping messages.
            </li>
            <li style={myStyle}>Creating an SQS queue and writing all log messages to it.  I'd then have a script that would pull the messages
                from SQS, sort them (using microsecond-granularity timestamps), and output them to a file.  I could
                also get a good idea of what was going on in the entire system this way, but even non-FIFO queues
                ended up being quite expensive using this method.
            </li>
            <li style={myStyle}>
                Setting up kinesis and pushing all the logs to a kinesis stream.  This was pretty effective and
                pretty cheap.  As I ran longer tests while streaming the data via a custom script, though, after
                a day or so the connection would break and I'd end up having to seek to the current spot from
                the beginning of the stream.  This took a very long time and if I remember correctly sometimes
                never even completed.  To be fair I didn't spend much time on this method so I'm sure there
                are ways to work around most of the issues.  But I was looking for something
                that would save me time and money over the other methods (including the default of cloudwatch) and
                it didn't seem like the learning curve here was worth it.  I'd probably recommend this approach if
                you do go for a roll-your-own logging solution, though.
            </li>
            <li style={myStyle}>
                Piping logs to New Relic.  This ended up being probably the easiest logging method to set up but was also quite
                expensive for my poor little startup.  Remember to set up some kind of age-off for your logs in cloudwatch if you go this route,
                or you'll end up paying for their storage even though you're not using them.
            </li>
        </ul>
        <p style={myStyle}>
            By using the alternatives I found quite a few bugs that I would otherwise have missed by not being able to look at my entire test-level
            system all at once.  But in the end the roll-your-own solutions were just too much maintenance and the
            prebuilt systems were too expensive, so I ended up just outputting to STDOUT and letting things end up
            in cloudwatch.  I begrudgingly recommend either doing that as well,
            or just using a prebuilt logging system like New Relic or Datadog if you have the money flowing in.
        </p>
        <p style={myStyle}>
            Outside of logging systems, I was able to whip up a quick
            custom logging library that helped a lot with reading the logs.  Some of the tactics I used
            might help you as well.  In my library I could:
        </p>
        <ul>
            <li style={myStyle}>Set a global prefix for the lambda being run.  A mailer lambda might have a prefix of [M] for example.
                This helped dig through logs where a bunch of lambdas were outputting simultaneously.</li>
            <li style={myStyle}>Set additional thread/message prefixes on the fly that last the remainder of the thread.
                This helped with tracing through batches of inputs or threads that I spun off within the context of processing each input.</li>
            <li style={myStyle}>Colorize output using <a href={"https://www.lihaoyi.com/post/BuildyourownCommandLinewithANSIescapecodes.html#colors"} target={"_blank"}>
                ANSI color codes.</a>  In my case, a simple wrapper for the default logger that provided Errorf(), Warnf(), and
                Infof() functions that would colorize messages before output turned out to be invaluable.  It is especially useful
                when running code/tests locally, but you can also view colorized output in
                cloudwatch log-groups using <a href={"https://github.com/ilhan-mstf/colorize_cloudwatch_logs"} target={"_blank"}>this chrome plugin</a>.
                Unfortunately the plugin doesn't let you view colorized output in the cloudwatch insights area, but you can
                modify it somewhat easily to do so.  I sort of just hacked at the plugin until it colorized the insights area (and broke the log group
                version in the process) so I'm too embarrassed by my modified solution to provide it online, sorry ;p
            </li>
        </ul>
        <h2 style={myStyle}>
            For larger-scope tests, "test in prod"
        </h2>
        <p style={myStyle}>
            I'm honestly not sure how far the serverless framework and similar run-cloud-stuff-locally frameworks have
            come since I tried them, but last I checked (maybe 2018ish) they were not very good at doing everything that the
            cloud does in the way that the cloud does it, and certainly not without creating basically an entire separate
            set of scripts and docker containers and things that you have to maintain and that you wouldn't really be using in production, which
            almost defeats the purpose. <br /><br />
            So, if you really want to test your code I recommend "testing
            in prod" instead.  What this means is, as long as you are using an infrastructure-as-code framework
            like <a href={"https://www.terraform.io/"} target={"_blank"}>Terraform</a> or
            <a href={"https://aws.amazon.com/cloudformation/"} target={"_blank"}>CloudFormation</a>, then it should
            be easy to spin up an exact copy of your production hardware in a new account of your own.  Then, you can
            go into the account and run whatever tests you want.  When you are done testing, you can spin down your
            architecture to save costs.  Or, if you are using all lambdas, it's almost-free to just leave them
            sitting there without using them (but not exactly $0 cost because of the way they check input queues every 20 seconds).<br /><br />
            Be very careful if you go with this approach:
        </p>
        <ul>
            <li style={myStyle}>You might accidentally spin down your prod architecture.  If you are someone with production-level access,
            you should find a way to be positive this doesn't happen as it will destroy your entire company ;p  It
            can help to:<br/><br/>
                <ul>
                    <li style={myStyle}>
                        Create scripts that do the calls for you (since apply and destroy calls will decide which environment to
                        alter based on which directory you're in and you might not be in the right directory when you just run destroy).
                        Think spinuptestenv.sh and spindowntestenv.sh for example.
                    </li>
                    <li style={myStyle}>
                        For your critical resources, try to add code that will prevent the resource (e.g. a database) from
                        being replaced or destroyed.  This is really a good idea regardless of how you're testing, really.<br /><br />

                        In terraform you can use prevent_destroy tags like this.  They will
                        not only stop accidental destruction but will also prevent terraform from doing something stupid
                        like bouncing an ec2 server because there is a new AMI available for ec2 or other changes
                        that it thinks it needs to recreate resources for.<br /><br />
                        <code>
                            resource "aws_db_instance" "example" {"{"} <br />
                            &nbsp;&nbsp;# ...<br />
                            &nbsp;&nbsp;<br />
                            &nbsp;&nbsp;lifecycle {"{"}<br />
                            &nbsp;&nbsp;&nbsp;&nbsp;prevent_destroy = true<br />
                            &nbsp;&nbsp;{"}"}<br />
                            {"}"}<br />
                        </code>
                    </li>
                </ul>
            </li>
        </ul>
        <h2 style={myStyle}>
            Use lambdas like little ec2 boxes via lambda layers
        </h2>
        <p style={myStyle}>
            Lambda layers are a pretty interesting addition to the lambda toolset.  Basically what they allow you to do
            is to install Linux executables on a drive and then to have that drive available to your
            lambda code whenever it is running.  If you need to scp a file somewhere but your language
            of choice doesn't have a very good scp library, for example, then you could potentially create a lambda
            layer with scp on there, and then call scp like a shell command in order to move the file.
        </p>
        <p style={myStyle}>
            From what I've seen this can basically turn your lambdas into little nano-sized ec2 servers where you can
            build and push code artifacts, do imagemagick image manipulation, etc, all from within a lambda that is
            costing you $0 until you are using it.  It's a pretty great way to host Linux-based functionality that you would otherwise
            default to an ec2 server for—as long as whatever the lambda is doing takes less than 15 minutes, anyways.
        </p>

        <h2 style={myStyle}>
            Conclusion
        </h2>
        <p style={myStyle}>
            Hopefully this blog post will help some of you get up and running at scale with your lambdas!  Good luck!
        </p>
    </div>
}