Browsing Category

Practical examples

Cloud Strategy, Practical examples

Adding new relic observability to Java (micro)services

January 6, 2021

Our old monolithic applications used to be very simple and keeping everything at our sight was very easy. We had one or two databases, one or more app servers, and that’s it! Everything ready to turn into chaos. Modern architecture patterns have one big tradeoff, which is they require a plethora of components and that increases the difficulty to keep eyes over such a big environment.

For that reason, plugging in applications that will help us with the observability, and even developing our own tools is needed so we can understand what is going on with our apps. Otherwise we can easily fall into a rabbit hole looking for all the edges to see which one has the root problem causing our clients to slow down.

The Market Options

Talking about tools, I did compare Dynatrace, New Relic, Elastic and Splunk. For Now, New Relic is the chosen one simply due to budget. Elastic seems to lack some features and is speeding up. Splunk, Dynatrace and New Relic’s AI powered features are amazing.

Adding New Relic to your microservice

In this article I’m gonna cover the addition of New Relic observability to Java microservives.

  1. Create your free account and grab your account Id – The free account will allow you to upload up to 100gb of data per month.
  2. Then you’ll need the yaml with configurations for your account and app:
    • Once you are logged in click on “APM” on the top bar, then click “Add more on the top right area”. This is gonna open a tab with several options. Just click Java and you will see somthing like this:
  3. Also use the command with black background to download the new relic jav you’re gonna need in a step beyond.
  4. Once you have downloaded the zip file, unzip it and grab the newrelic.jar
  5. Finally, place the generated yaml file along with the jar in a folder that is accessible by your project
  6. At last, change your dockerfile to include the -javaagent command

PS: in this example I’m adding a specific version of the newrelic jar. For production purposes, I recommend storing the jar in your own library server, such as Nexus or Jfrog.

Watching your app

Minutes after deploying your pod, you’ll be able to login to New Relic and start seeing the dashboards built for your application.

Important metrics that can be seen right without any additional configuration are:

  • Throughput (the lower the better).
  • App server response time (the lower the better).
  • Most called URLs.
  • Time each process takes to process your requests divided in layers: app server, database, response time, etc.
  • Hosts where your app is running (pods).
Cloud Strategy, IT is business, Practical examples

Make your Jenkins as code and gain speed

November 8, 2020

TL;DR: example of Jenkins as code. There’s a step by step to configure your Jenkins as code using Ansible tool, and configuration-as-code plug-in at the end of the article. The final version of the OS will have Docker, Kubectl, Terraform, Liquibase, HAProxy (for TLS), Google SSO instructions, and Java installed for running the pipelines.

Why having our Jenkins coded?

One key benefit from having infrastructure and os level coded is the safety it gives to the software administrators. Think with me: what happens if your Jenkins stops working suddenly? What if something happens and nobody can log into it anymore? If these questions make you chill, let’s code our Jenkins!

What we will cover

  1. This article covers the tools presented in the image above:
    • Vagrant for local tests.
    • Packer tool for creating your SO image with your Jenkins ready to use
    • Ansible for installing everything you need on your SO image (Jenkins, Kubectl, Terraform, etc).
    • JCasC (Jenkins Configuration as Code) to configure your Jenkins after it is installed.
    • You can also find some useful content for the Terraform part here and here.

See all the code for this article here:

Special thanks to many Ansible roles I was able to found on GitHub and geerlingguy for many of the playbooks we’re using here. 

1. How to run it

Running locally with Vagrant to test your configuration

The Vagrantfile is used for local tests only, and it is a pre-step before creating the image on your cloud with Packer

Vagrant commands:

  1. Have (1) Vagrant installed (sudo apt install vagrant) and (2) Oracle’s VirtualBox
  2. How to run: navigate to the root of this repo and run sudo vagrant up. After everything is complete, it will create a Jenkins accessible from your host machine at localhost:5555 and localhost:6666. This will create a virtual machine and will install everything listed on the Vagrantfile
  3. How to SSH into the created machine: run sudo vagrant ssh
  4. How to destroy the VM: run sudo vagrant destroy

Using packer to build your AMI or Az VM Image

Packer is a tool to create an OS image (VM on Azure OR AMI on AWS)

Running packer:

  1. packer build -var 'client_id=<client_id>' -var 'client_secret=<client_secret>' -var 'subscription_id=<subscription_id>' -var 'tenant_id=<tenant_id>' packer_config.json
  2. Once you have your AMI or Az VM Image created, go for your cloud console and create a new machine pointing to the newly created image.

Checkout the file packer_config.json to see how packer will create your SO image and Azure instructions for it

PS: This specific packer_config.json file is configured to create an image on Azure. You can change it to run on AWS if you have to.

2. Let’s configure our Jenkins as Code!

I’m listing here a few key configurations among the several you will find in each of these Ansible playbooks:

  1. Java version: on ansible_config/site.yml
  2. Liquibase version: on ansible_config/roles/ansible-role-liquibase/defaults/main.yml
  3. Docker edition and version
  4. Terraform version
  5. Kubectl packages (adding kubedm or minikube as an example) on ansible_config/roles/ansible-role-kubectl/tasks/main.yml
  6. Jenkins configs (I will comment further)
  7. HAProxy for handling TLS (https) (will comment further)

3. Configuring your Jenkins

Jenkins pipelines and credentials files

This Jenkins is configured automatically using the Jenkins plugin configuration as code. All the configuration is listed on file jenkins.yaml in this root. On that file, you can add your pipelines and credentials for those pipelines to consume. Full documentation and possibilities can be found here:

Below is the example you will find on the main repo:

  1. You can define your credentials on block one. There are a few possible credential types here. Check them all on the plugin’s docs
  2. With this, we create a folder
  3. Item 3 creates one pipeline job as example fetching it from a private GitLab repo that uses the credentials defined in item 1

Jenkins configuration

The plugins that this Jenkins will have installed can be found at: ansible_config/roles/ansible-role-jenkins/defaults/main.yml. If you need to get your current installed plugins, you can find how-to here:

On the imag below we can see:

  1. Your hostname: change it to a permanent hostname instead of localhost once you are configuring TLS
  2. The plugins list you want to have installed on your Jenkins

You can change Jenkins default admin password on file ansible_config/roles/ansible-role-jenkins/defaults/main.yml attribute “jenkins_admin_password”. Check the image below:

  1. You can change admin user and password
  2. Another configuration you will change when activating TLS (https)

Jenkins’ configuration-as-code plug-in:

For JCasC to work properly, the file jenkins.yml in the project root must be added to Jenkins’ home (default /var/lib/jenkins/). This example has the keys to be used on pipelines and the pipelines as well. There are a few more options on JCasC docs.

Activating TLS (https) and Google SSO

  1. As shown on step “Jenkins Configuration”‘s images: Go for ansible_config/roles/ansible-role-jenkins/defaults/main.yml. Uncomment line 15 and change it to your final URL. Comment line 16
  2. Go for ansible_config/roles/ansible-role-haproxy/templates/haproxy.cfg. Change line 33 to use your final organization’s URL
  3. Rebuild your image with Packer (IMPORTANT! Your new image won’t work locally because you changed Jenkins configuration)
  4. Go for your cloud and deploy a new instance using your just created image
3.1 – TLS: Once you have your machine up and running, connect through SSH to perform the last manual steps: TLS and SSO Google authentication:
  1. Generate the .pem certificate file with the command cat > fullkey.pem. Remember to remove the empty row that is kept inside the generated fullkey.pem between the two certificates. To look at the file use cat fullkey.pem
  2. Move the generated file to your running instance’s folder /home/ubuntu/jenkins/
  3. Restart HAProxy with sudo service haproxy restart

Done! Your Jenkins is ready to run under https with valid certificates. Just point your DNS to the running machine and you’re done.

3.2 – Google SSO:

  1. Log in to Jenkins using regular admin credentials. Go to “Manage Jenkins” > “Global Security”. Under “Authentication” select “Login with Google” and fill in like below:
  • Client id = client_id generated on your G Suite account.
  • Client secret = client_secret
  • Google Apps Domain =

PS: More information on how to generate a client ID and client secret on the plugin’s page:

Cloud Strategy, Practical examples

Build Azure Service Bus Queues using Terraform

September 12, 2020

TL;DR: 7 resources will be added to your Azure account. 1 – Configure Terraform to save state lock files on Azure Blob Storage. 2 – Use Terraform to create and keep track of your Service Bus Queues

You can find all the source code for this project on this GitHub repo:

Azure Service Bus has two ways of interacting with it: Queues and Topics (SQS and SNS on AWS respectively). Take a look at the docs on the difference between them and check which one fits your needs. This article covers Queues only.

What are we creating?

The GRAY area on the image above shows what this Terraform repo will create. The retry queue automation on item 4 is also created by this Terraform. Below is how the information should flow in this infrastructure:

  1. Microservice 1 generates messages and posts them to the messagesQueue.
  2. Microservice 2 listens to messages from the Queue and process them. If it fails to process, post back to the same queue (for up to 5 times).
  3. If it fails for more than 5 times, post the message to the Error Messages Queue.
  4. The Error Messages Queue automatically posts back the errored messages to the regular queue after one hour (this parameter can be changed on file modules/queue/
  5. Whether there’s an error or success, Microservice 2 should always post log information to Logging Microservice

Starting Terraform locally

To keep track of your Infrastructure with Terraform, you will have to let Terraform store your tfstate file in a safe place. The command below will start Terraform and store your tfstate in Azure Blob Storage. Use the following command to start your Terraform repo:

terraform init \
    -backend-config "container_name=<your folder inside Azure Blob Storage>" \
    -backend-config "storage_account_name=<your Azure Storage Name>" \
    -backend-config "key=<file name to be stored>" \
    -backend-config "subscription_id=<subscription ID of your account>" \
    -backend-config "client_id=<your username>" \
    -backend-config "client_secret=<your password>" \
    -backend-config "tenant_id=<tenant id>" \
    -backend-config "resource_group_name=<resource group name to find your Blob Storage>"

If you don’t have the information for the variables above, take a look at this post to create your user for your Terraform+Azure interaction.

Should everything goes well you should get a screen similar to the one below and we are ready to plan our infrastructure deployment!

Planning your Service Bus deploy

The next step is to plan your deployment. Use the following command so Terraform can prepare to deploy your resources:

terraform plan \
     -var 'client_id=<client id>' \
     -var 'client_secret=<client secret' \
     -var 'subscription_id=<subscription id>' \
     -var 'tenant_id=<tenant id>' \
     -var-file="rootVars.tfvars" \
     -var-file="rootVars-<environment>.tfvars" \
     -out tfout.log

Some of the information above are the some as we used in Terraform init. So go ahead and copy them. The rest of them are:

  • -VAR-FILE – The first var file one has common variables for all our environments.
  • -VAR-FILE – The second var file has a specific value for the current environment. Take a look at the rootVars-<all>.tfvars files.
  • TFOUT.LOG – This is the name of the file to which Terraform will store the plan to achieve your Terraform configuration

Should everything goes well you’ll have a screen close to the one below and we’ll be ready to finally create your Service Bus Queues!

Take a look at the “outputs” section. These are the information Terraform is gonna retrieve us so our DEV team can use it.

Deploying your Service Bus infrastructure

All the hard work is done. Just run the command below and wait for about 10 minutes and your AKS will be running

terraform apply tfout.log

Once the deployment is done you should see a screen like this:

Once you are done you have the connection strings so the DEV team can configure the microservices to use your Queue.

To read more on how to integrate your applications to the Queues, Microsoft has the docs for Java, Node, PHP, and a few others.

Cloud Strategy, Practical examples

Build and configure an AKS on Azure using Terraform

September 9, 2020

TL;DR: 3 resources will be added to your Azure account. 1 – Configure Terraform to save state lock files on Azure Blob Storage. 2 – Use Terraform to create and keep track of your AKS. 3 – How to configure kubectl locally to set up your Kubernetes.

This article follows best practices and benefits of infrastructure automation described here. Infrastructure as code, immutable infrastructure, more speed, reliability, auditing and documentation are the concepts you will be helped to achieve after following this article.

You can find all the source code for this project on this GitHub repo:

Creating a user for your Azure account

Terraform has a good how to for you to authenticate. In this link you’ll find how to retrieve the following needed authentication data:

subscription_id, tenant_id, client_id, and client_secret.

To find the remaining container_name, storage_account_name, key and resource_group_name, create your own Blob Storage container in Azure. And use the names as the suggestion below:

  • The top red mark is your storage_account_name
  • In the middle you have your container_name
  • The last one you have your key (file name)

Starting Terraform locally

To keep track of your Infrastructure with Terraform, you will have to let Terraform store your tfstate file in a safe place. The command below will start Terraform and store your tfstate in Azure Blob Storage. So navigate to folder tf_infrastructure and use the following command to start your Terraform repo:

terraform init \
    -backend-config "container_name=<your folder inside Azure Blob Storage>" \
    -backend-config "storage_account_name=<your Azure Storage Name>" \
    -backend-config "key=<file name to be stored>" \
    -backend-config "subscription_id=<subscription ID of your account>" \
    -backend-config "client_id=<your username>" \
    -backend-config "client_secret=<your password>" \
    -backend-config "tenant_id=<tenant id>" \
    -backend-config "resource_group_name=<resource group name to find your Blob Storage>"

Should everything goes well you should a screen similar to the one below and we are ready to plan our infrastructure deployment!

Planning your deploy – Terraform plan

The next step is to plan your deploy. Use the following command so Terraform can prepare to deploy your resources:

terraform plan \
    -var 'client_id=<client_id>' \
    -var 'client_secret=<secret_id>' \
    -var 'subscription_id=<subscription_id>' \
    -var 'tenant_id=<tenant_id>' \
    -var 'timestamp=<timestamp>' \
    -var 'acr_reader_user_client_id=<User client ID to read ACR>' \
    -var 'acr_reader_user_secret_key=<User secret to read ACR>' \
    -var-file="<your additional vars file name. Suggestion: rootVars-dev.tfvars>" \
    -out tfout.log

Some of the information above are the some as we used in Terraform init. So go ahead and copy them. The rest of them are:

  • TIMESTAMP – this is the timestamp of when you are running this terraform plan. It is intended to help with the blue/green deployment strategy. The timestamp is a simple string that will be added to the end of your resource group name. The resource group name will have the following format: “fixedRadical-environment-timestamp”. You can check how it’s built on file tf_infrastructure/modules/common/
  • ACR_READER_USER_CLIENT_ID – This is the client_id used by your Kubernetes to go and read the ACR (Azure Container Registry) to retrieve your docker images for deployment. You should use a new one with fewer privileges than the main client_id we’re using.
  • ACR_READER_USER_SECRET_KEY – This is the client secret (password) of the above client_id.
  • -VAR-FILE – Terraform allows us to add variables in a file instead of on the command line like we’ve been using. Do not store sensitive information inside this file. You have an example on tf_infrastructure/rootVars-dev.tfvars file
  • TFOUT.LOG – This is the name of the file to which Terraform will store the plan to achieve your Terraform configuration

Should everything goes well you’ll have a screen close to the one below and we’ll be ready to finally create your AKS!

Take a look at the “node_labels” tag on AKS and also on the additional node pool. We will use this in the Kubernetes config file below to tell Kubernetes in which node pool to deploy our Pods.

Deploying the infrastructure – Terraform apply

All the hard work is done. Just run the command below and wait for about 10 minutes and your AKS will be running

terraform apply tfout.log

Once the deployment is done you should see a screen like this:

Configuring kubectl to work connected to AKS

Azure CLI does the heavy lifting on this part. So run the command below to make your Kubectl command-line tool to easily point to the newly deployed AKS:

az aks get-credentials --name $(terraform output aks_name) --resource-group $(terraform output resource_group_name)

If you don’t have the Azure CLI configured yet, follow the instructions here.

Applying our configuration to Kubernetes

Now navigate back on your terminal to the folder kubernetes_deployment. Let’s apply the commands and then run through the files to understand what’s going on:

1. PROFILE=dev
2. kubectl apply -f k8s_deployment-dev.yaml
3. kubectl apply -f


PROFILE=dev – it is setting an environment variable on your terminal to be read by kubectl and applied to the docker containers. I used a spring application, so you can see it being used on k8s_deployment-dev.yaml here:

  1. Kubernetes will grab our PROFILE=dev environment variable and pass on to Spring Boot.
  2. The path where Kubernetes will pull our images from using ACR credentials.
  3. Liveness probe teaches Kubernetes how to understand if that container is running or not.
  4. NodeSelector tells Kubernetes in which node pool (using the node_labels we highlighted above) where the Pods should be run.

Configure K8S

kubectl apply -f k8s_deployment-dev.yaml

Kubernetes allows us to store all our configuration in a single file. This is the file. You will see two deployments (pods instructions): company and customer. Also, you will see one service that exposes each of them: company-service and customer-service.

  • The services (example below) use the ClusterIP strategy. It will tell Kubernetes to create an internal Load Balancer to balance requests to your pods. The port tells which port receives requests and the targetPort tells which port in the service will handle requests. More info here.
Services example
  • Ingress strategy is the most important part:
  1. nginx is the class for your ingress strategy. It uses nginx implementation to load balance requests internally.
  2. /$1$2$3 is what Kubernetes should forward as the request URL to our pods. $1 means (api/company) highlighted in item 5. $2 means (/|$) and $3 means (.*)
  3. /$1/swagger-ui.html this is the default app root for our Pods
  4. Redirect from www – true – self-explanatory
  5. Path is the URL structure to pass on as variables to item 2
  • To add TLS yo our Kubernetes you have to generate your certificate and past key and crt on the highlighted areas below on base64 format. An example on Linux is like first image below. When adding the info to the file remember to past it as a single row without spaces, line breaks or others. Second image shows where to put the crt and key respectivelly.

Apply nginx Load Balancer

kubectl apply -f 

This will apply nginx version 0.34.1 to handle our ingress instrategy.

Testing our Kubernetes deployment

After all this configuration run the command below to wait for Kubernetes to assign an IP to our ingress strategy:

kubectl get ingress --watch

You will get an output like this:

Once you have the IP, you can paste it to Chrome, add the path to your specific service and you will get your application output.

Cloud Strategy, Practical examples

App scaling with operational excellence

June 14, 2020

This is a continuation of the app scaling series showing motivations, clear ways, trade-offs, and pitfalls for a cloud strategy. The last post is an overview of why to prepare and how to start. It can be found here.

As soon as your application starts scaling and more actions are needed everyday to evolve the app to the new scenarios, the automation will come in handy and also in need.

Benefits of automating infrastructure

  • Fastest-possible solution for deploying a new workflow environment. It saves you time for deploying multiple environments like PROD, DEV, and QA.
  • Ensures PROD, QA, and DEV are exactly the same. This will help your engineers to narrow problems and solve issues faster.
  • Immutable infrastructure – the old dark days when nobody knew how a server was still working is gone. With immutable infrastructure, stop using human interference to fix things, and use it only to hot-fixes.
  • Define your workflows as code. Code is more reliable than anyone’s memory.
  • Easily track changes over time (you also achieve more coverage for auditing with this step).
  • Your infrastructure-as-code is documentation you can review and ask for support if needed.

Automating infrastructure

There are a few tools to automate infrastructure creation and each cloud provider has its own. CloudFormation on AWS, Resource Templates on Azure, and Cloud Deployment on Google. But you may be in a organization that wants as least lock-in as possible due to past experience. Then HashiCorp’s stack, specifically Terraform here, comes in handy.

Use cases + Tactics

  1. Saving costs with environments – have your DEV and QA environments shut down at the end of every day to save costs on cloud.
  2. Watching infrastructure – since your infrastructure may change during execution (upscaling, downscaling, termination, etc), you can have a job looking for specific parts of your app that should be kept in a certain configuration. Example: for scaling systems, you can use Terraform to always have one instance pre-heated and ready to be attached to an auto-scaling (AS) group when the application requires, instead of waiting for the time of warm-up of every instance’s configuration. Once you app needs to scale, that instance will be added to the AS group, and some time after that Terraform will provision a new instance proactively for when the load suffer another spike.
  3. Configuration management – applying the immutable infrastructure concept here, you will have one single source of truth for your environment. Example: you had to apply one hot-fix in production to prevent an error to happen. Right after the hot-fix, update your infrastructure-as-code to include that fix so you won’t forget to replicate it to new environments.
  4. Orchestration – let’s say you have your infrastructure primarily on AWS but are using Google for ML. Terraform will orchestrate the creation of them all for you. It saves you the time of going to each cloud and activating CloudFormation, Cloud Deployment, and so on.
  5. Security and Compliance – having your infrastructure as code will make easier for your team to ensure they are following all the security and compliance definitions. The code is also versionable, ensuring auditing capabilities.

Example with Terraform

The code found here will deploy the above infrastructure in a matter of few minutes. It is an example of Terraform provisioning the AWS best practices for infrastructure when still using EC2 instances as your option for computing.

Do not forget to add CloudFront and Route 53 to your stack if you are going to use it in a real environment.

IT is business, Practical examples

The bad products that Covid-19 highlighted and how to not be one of them

May 30, 2020

Amid Covid-19 news, there one single thing that is driving business leaders nowadays: how can I make digital business? We are living-history of an evolution in our pattern of consumption. The retail industry (using Via Varejo’s example here) already changed. Via Varejo managed to keep 70% of its regular revenue even with its more than 1 thousand physical stores closed. And they kept that number through their single store that never closes: the e-commerce.

COVID-19-highlighted pros and cons

How did Via Varejo (and Zoom to mention a non-retail example) hold all of the needed infrastructures to maintain their results during this crisis? They certainly invested differently in technology and their product’s software architecture than a few bad examples. A friend of mine reported a virtual-line waiting to shop at a supermarket in Lisbon of 12h. Even after making his purchase, he had to wait for about a week to receive his groceries at home. Also, Netflix and Youtube reduced their bandwidth consumption in Europe to not overload the band available in that region. And a worst-case scenario happened to Clear Corretora’s system. They went down a couple of times and now their clients are asking for refunds based on operations they could not do because the software was unavailable.

Software is present in every organization that wants to grow at some scale. The software helps the company’s employees to be more productive and to reduce manual and repetitive works. The software can have the best possible interface, turned 100% to people productiveness, but as a basic item, it must work when it must work. When vital software to an organization fails, the difference between regular software and good software is discovered. And here I talk about its architecture.

When the architecture makes the difference

To talk about this subject, I’ll cite two real cases of two clients in the media area. They are close to me and had their business growth held for some while due to bad quality software.

Please, no more load

The mentioned software was a portal to Brazillian population access to a known TV show. Under normal circumstances, the portal used to handle everything. But when one of its TV presenters mentioned something about the portal when they were live in the national network, it was deadly. We just had to count time and a few minutes after, the portal was down. It could be simple stuff like saying live “access our website for a chance of winning a prize”. Or “access our website to talk to the actress X”. And the problem that repeatedly happened, taking the portal down, was a technical problem related to its architecture. It was not prepared to receive so much access like it did in fact. In this example, we see a powerful trigger to an entire population wasted because of a software malfunction. The population (customers) already convinced to look for something lost interest automatically. The TV show image was affected negatively, and it failed at enhancing its brand, increasing its overall engaged public and eventually losing some revenue.

Let’s integrate!

The second example is about a scenario when the software was a big aggregator of information coming from many different sources. It was responsible for some important transactions related to the company invoices. It operated fine for many months after its first release. But when the database went over some load, it started to behave slower than it used to. Also, a big project or renewing the entire UI was undergoing. But it’s big failure if something beautiful is released without actually working. The loading screens were too slow and were forcing the user to wait for many seconds for some feedback. The impact on business was quite relevant because the selling process for a few of their products was also affected.

Solutions and business relief

For both scenarios, a new architecture is live now. The first one focused heavily on caching. The second focused on shortening the times to get information. But both passed by a deep architecture remodeling.

Nowadays the first one counts with millions of users reaching their portal to interact with the brand. They also count with a stable portal, which allows the decision-makers to make wiser decisions than when they were under pressure. The second was able to release the new UI, which enhanced the relationship with B2B demanding customers, and no more transactions are lost. Now the roadmap is welcome again.

Cloud Strategy, IT is business, Practical examples

App scaling with fast and reliable architecture

May 9, 2020

The pandemic reality gave a huge push on digital businesses. The companies succeeding right now have strong strategies for digital interaction. More on: McKinsey, McKinsey, The Economist, and CNN Business. But how are they scaling their business (and their applications) so fast in a reliable manner? Prepare your applications (aka our Digital Products) with a fast and reliable architecture. This is a paradigm shift for most companies.

Why should we prepare to scale

Costs. An application not prepared to scale accordingly to its demand will cost more to be kept than others. The cloud advantage must be used at most in this scenario. An example: the websites to buy tickets for concerts spend a huge portion of their time working under low or regular demand. But when a well-known group announces a new concert, thousands of people rush to their environment to get a ticket. (find a similar reading here). A useful analogy: if you don’t prepare your application to scale according to demand, it’s like you are always driving an RV even if you are just going to the supermarket instead of going to vacations. You don’t need to carry 5 or 6 people with you to go to the supermarket. As well as your app doesn’t need to be fully-armed at 3 am.

As we can see in the image below, people use the internet with different intensity according to time each day.

Seattle internet consumption on the first week of January 2020 (from CloudFlare)

Peeks of usage. Maintaining a portal with around 100 visits per day (like my own) is fine. But a different approach will be needed to another one with One Million views on the same timeframe. But more important than that, be prepared for peeks of usage to maintain the brand’s reliability and company growth. Zoom is an excellent successful example of application scaling. But they are the minority amid hundreds of bad examples that are impacting our lives. I.E: check New Jersey’s government ask for help with a very very old application).

How to prepare a fast and reliable architecture

Architecting for scalability

Use the advantages of the cloud’s existing tools. All cloud players have efficient tools for load balancing the application’s requests. Microsoft’s Load Balancer, Google Load Balancing, and AWS Elastic Load Balancing are very easy to set up. Once defined rules for load balancing, auto-scaling groups improve the application power to handle requests. Using the auto-scaling groups you can set different behaviors for your app. Both based on demand from users and also patterns you already know they exist (3 am driving an RV). If all of this is new for you, keep in mind that new solutions bring new challenges. Listed below are few things you have to take a look when setting up auto-scaling behavior:

  • Speed to startup a new server – When you need to scale you probably will need to scale FAST. To help with that, have pre-built images (AWS AMIs) to speed up your new servers boot time. Kubernetes orchestrating your app will also help with this.
  • Keeping database consistency – Luckily the big cloud players have solutions to keep the databases synchronized between your different availability zones almost seamless. But once you start working with multiple regions, this will become one more thing to establish a plan and handle.
  • Keep low latency between different regions – Multiple regions can solve latency for your users, but will bring the latency to you. Once again talking about multiple regions (either if you are building a disaster/recovery plan or just moving infrastructure closer to your users to reduce latency). The latency between regions has to be mitigated both on databases and on your app communications.

The attention points above pay-off. Once you have all set, the cloud can keep itself. Looking for alerts on CPU, memory, network, and other usages, and triggering self-healing actions will be part of its day.

Architecting for reliability

To increase your app reliability, I list two good strategies to apply:

  • On the infrastructure and the app level. Adding several layers of tests and health checks is the most basic action for reliability.
  • Architecting for multi-region. Using pilot-light (slower), passing by warm standby and active/active multi-region (faster) architecture solutions for failover and disaster/recovery plans are good approaches. The faster one (active/active) requires the same infrastructure to be deployed exactly the same in two regions. Also, an intelligent DNS routing rule has to be set.
  • Reducing risk with infrastructure deploy automation. Examples of services like CloudFormation (AWS), Resource Templates (MS), and Cloud Deployment (Google). It helps you to create a single point of infrastructure description to be used across multiple regions.

Architecture is a living subject, just like digital products are. Looking for scalability and reliability on the same environment will make you achieve a fast and reliable architecture.

Entrepreneurship, Practical examples

Expanding a company to another country: Language

April 7, 2020

In 2018 I started the biggest challenge I’ve ever faced in my career: expanding the company I work for to the most business mature country in the world, the United States. Two years are gone, and I want to start sharing the knowledge I have so far. Maybe I’ll shorten paths to people that will read these articles. Since my roles so far have gone primarily through sales activities, this is the sight I’ll be applying to these articles. And in this first one I’ll start talking about the most essential aspect I can ever think about this:

This article at a glance:

Poor communication skills lead to a lack of trust between your team and prospects/partners/any other important stakeholder outside your company.

Cultural differences can create big issues because of how you understand phrases in each culture.

The language

Pff Guilherme, this is obvious. Yes, it is. And yet it’s underestimated.

A simple exploratory conversation

I attend several events during the year to showcase our services and products. At those fancy places, I have the chance to talk to many people. Once I’m there I interact with people from other countries doing the same I’m doing: selling and expanding their companies. In one of the events, I saw this guy. He was from Romania. I don’t know how to say “Hi” in Romanian, but I remember his accent was something close to Russian. Since he and his possible client were close to me I heard two or three sentences of the conversation:

  • Client: “(…) and do you think your team can deliver the POC within two weeks?”
  • Romanian guy: “Oh yes, about the POC, we will develop using part of the data you sent me and also the data we have from our research, like the trends for investing in 2020

The Romanian guy gave more details of the POC (Proof of Concept) instead of answering the question straight. I could notice at that time that he didn’t understand the question. He just got the acronym “POC” and started to say the first thing that came to his mind. It was not a bad intention. He didn’t want to dodge the question. He just didn’t understand. But yet…

Lack of trust

The main error here (and this is one that I’ve already committed also) is losing the opportunity to create trust with that other person. If the salesperson can’t understand what the client is asking, it’s impossible to trust that the rest of the company will do any better. And even if the rest of the company understands better, the handover from sales to the technical team will be already gone. And a problem will be created.

From my experience: it doesn’t matter if it’s a B2B relationship: people buy from people. If you can’t talk to somebody with the same eloquence you have talking to a friend in a bar about the most random subject, you have serious problems. You won’t be able to create trust simply because there will always be a doubt if you and your client are understanding each other 100% and as a consequence, being trustworthy.

Your eloquence will be needed when questions different from those you are expecting to come, get into the subject. And they will come. Describing a product or service you’ve been providing for years will be easy even in a different language. But what happens when you have to offer help amid Coronavirus situation?

Cultural differences

One simple example: In Brazil, if you finish a conversation with somebody saying “let’s talk about it next week”, it doesn’t really mean you want to talk about the subject next week. There’s just a possibility. It might happen or not. When you are on the states and say something like that, you both are automatically setting a real appointment. And you will necessarily catch up on the week after.

Another example: oriental cultures (my experience talks about just India and China) tend to never say “no”. Even when they want to say No, they will say “yes, but let’s take a look at more alternatives”. Many years ago I had one guy from India interviewing on a colleague on my team. After the interview, I called the sponsor and asked:

  • Me: “hey, so what did you think about John (person just interviewed)? Can we go ahead and set him to start working with your project?
  • The Indian client: “Yes. And I want to know more people from your team that we can evaluate”

I took John out of his project and told him to start working with the new client ASAP. When I called the client again to ask about John’s credentials the client asked me: “but what about the other people we were to interview?”. That gave me one of the biggest headaches I’ve ever had when managing people. John had already created a big expectation about the new client, was excited and also got a raise. After a few days, we solved it and John started to work with the new client indeed. But all the headache could’ve been prevented with just one more question from my side, knowing the style of communication of the other people, still during the same call of above:

  • The question that I should have asked: “So, Mr Client, actually will John start working with you guys right away or do you want to take a look at everybody and then deciding for the full team at once?”
Practical examples

3 quick tips and more for managing a remote DEV team

March 31, 2020

Amid the Coronavirus reality we’ve been living since February, many remote work guides and best practices have emerged. In this article I hope to help the niche of software development teams and industries, sharing the experience I gathered over past years. The intention here is to start from scratch if you are not used to working remotely at all, but feel free to head straight to the point you may find yourself more interested in.

Tip 1: Training

Never underestimate the effectiveness of a training session before starting with a new methodology. It might look very simple for you to work remotely, but your team might see it differently than you. Not training is not an alternative just like stopping everybody for one week, putting them in a room with 8h/day of training, is not. If you are in a rush, grab the first morning you have and do some training. (This isn’t the focus of this article, but you will find some useful tips here, and here, and also some training tips here). Train on the go is the key to this quick turn. Since developing software is an activity that can easily be done anywhere, you probably won’t find many barriers with your team embracing it. But you do have to take care of productivity…

Tip 2: Pick the ceremonies

The quickest takeaway from this article: make daily meetings. I suggest even twice a day if the 100% remote work is new for you. You should use the daily meetings like they were described: 15 minutes of people saying what they did, what they will do, and what is blocking them. Pay extra attention to the blocks listed. If no one comes, ask some questions to stimulate them to come out. There are always blocks. The retrospectives might be challenging at first, but I know they can happen remotely. So give it more than a shot. The ceremonies, specifically the dailies will give you inputs about the team performance, which leads us to the next item…

Tip 3: Act

Be proactive. You must be an example of the posture you want to see on your team. Bad performance is easy to perceive in the daily checkpoints. Once you feel somebody is not producing as they could (that will happen) you must act and do it quickly (by producing I mean both performance and will to do the job). If you don’t, that will sound like a message for the whole team to slow down, because nobody cares.

This kind of action is not an inquisition. Have your mind clear of judgments about people and base your questions just on the facts you already know. (This article, from step 3 on will help you with these conversations). Always point out what happened and ask for an opinion. In 99% of cases people know they’re doing something different from what is expected. If the person doesn’t see anything wrong on what they did, you will have to show the consequences and then get to a common understanding. I do recommend you reach out to the article mentioned for further clarification about a feedback session.

Mindset: be always open and transparent

Having the same no-judgment mindset of the item above, be open to questions at this moment. Your team may be afraid of the changes not only on the work methodology, but also with the whole economic environment. If they have questions, prepare to spend more than just 5 minutes talking to them. They might want to talk about personal stuff before going to the real point. So keep your ears open and don’t moments of sharing thoughts about even your personal feelings about the moment. Keeping the routines of feedback you already have (or should have) is key to this moment. They are waiting for the right moment to ask you though questions, so be prepared to answer.

We all are under uncertainty and a new situation never faced before. If you are calm, your team will be calm. Always communicate to them in a simple way about every new step your company is taking to respond to the pandemic. Sharing information, even if they can’t vote on tough decisions, will make them feel part of the moment. You will also be reminded as someone who takes care of your people.

Career, Entrepreneurship, IT is business, Practical examples

Two main errors your agile project may have right now and how to solve them

March 12, 2020

Doing agile for software development is way beyond leaving the heavy documentation behind and produce more. According to an HBR study of 2018, 25% of enterprise companies are already using agile and 55% are in the process of doing the same. The data doesn’t lie: the masters of DevOps and Agile grow and are 60% more profitable than the rest. Agile brings many benefits, but it also brings new challenges built-in. The single point of adopting agile is already a major challenge. But after them, new challenges are still there and you might not have noted them.

1st – Involuntary blindness from POs

A day in the skin of a PO (Product Owner) is tough. They own the product! And are responsible for its evolution. Basically they are responsible for translating the company’s macro strategy for that product into the actual final product. It sounds pretty simple, but this process is where commonly things get fuzzy. A PO also:

  • Solves questions from the development team on their tasks;
  • Keeps feeding the backlog;
  • Helps the Scrum masters with decisions for prioritization;
  • Eventually, changes prioritization and delivery plans because of special requests from the management;
  • Monitors metrics of the product;
  • Is responsible for meetings with high management to discuss the metrics she’s been monitoring;

Many of the items above would require a person full time working to fulfill it successfully. But given this load of work, the PO often leaves behind one important thing: hearing what the market is saying and still staying aligned with the macro strategy. By that I mean the PO frequently doesn’t have time to think, judge and decide appropriately after a new demand came from a BA (Business Analyst) or high management. Due to lack of time, she leaves behind the most important task of his job, which is enhancing the product. Let’s go over one example to better illustrate:

A page never used in a hotel website

Years ago I was the SM (Scrum Master) leading a team developing a hotel website. I remember as if it was today. We had a big deliver at the end of 3 months and in the last week, my PO brought me the news that we missed one important page to showcase the hotel’s partnership with an entertainment channel. Guess what? We worked several hours more than planned and delivered just a small part of that page. Also kept working on the same page even after the deadline. We released the final page a few weeks later, and I was checking Google Analytics. I found out that that page had less than 5 percent of the website visits.

The summary: we spent at least one entire month of work on a page that less than 5% of our target public was actually interested in. We wasted one month of money for a team with 4 dedicated people. There was no regulatory thing and no contract binding forcing them to have that page live. It was just a request from a board member. In that situation, the PO SHOULD have argued about not doing that and going for the e-commerce system as a priority. This was actually something the users were asking strongly.

But why did the PO let it stay that way? The answer: she didn’t know that page was about so few accesses. She was so dived into the deadline and reports and monitors and tests that she just accepted the request without questioning it. If she had had time to think about it, with the appropriate information from her BAs, she would have taken a better decision.

2nd – Deficient hands-off of tasks from the PO/BAs to the delivery team

The second thing that often breaks plans is when the planning meeting is already going on and the technical team finds out a task is much bigger than the PO/BAs thought at first. When the PO is defining the next tasks for the backlog, she must be well aligned with the software architects. When a major feature (an epic) comes to the table, the technical team has to adapt the software to develop that new task accordingly. Let’s go over one more example:

The “just one more report”

This one happened during a project for one company in the manufacturing industry. The software had been running for months and was stable. We were in a phase of acting over bugs, security and small features. We were also generating some reports for gathering new metrics. When the planning started, our PO explained about the task of adding more columns and creating a new report with charts. The charts were fine, but those new columns were stored in other systems which we didn’t have control over. We had to talk to people on the other software to create an API for us to consume the data, and the simple report took more than a month to be finished.

The interesting part of this example is that the report was promised to be delivered after one week and took almost 2 months. The management had to wait for the new report to take new decisions because of the important information. The change from 1 week to ~2 months created an unneeded discussion between the project team and the management. All of that wouldn’t have happened if somebody with a brief technical vision of the project was involved during the grooming/prioritization of the backlog and had properly communicated with management.

If a task much bigger comes with no previous preparation, it generates delay. The way to solve it is to have technical senior people checking the backlog periodically and being closer to strategic decisions. This way they will be able to anticipate such moves and tackle big tasks little by little.

At last

Agile is not exact sciences. You will have to find your own set of practices that will create your own agile. These are the two main challenges I found are often hidden and people don’t actively tackle them because they don’t hurt at first, but yes have side effects that can turn into a big mess. And what are your main challenges?