Automating Internal Databases Operations at OVHcloud with Ansible

CfgMgmtCamp 2024

Julien RIOU

February 6, 2024



Speaker


Summary


Who are we?


All products rely on internal databases


Managed infrastructure


Cluster example


Mutualized environment


Management tools


Infrastructure as Code

Terraform logo

Using Terraform (Enterprise).

Providers:


Configuration management

puppet

Using Puppet.

Operating system security hardening:


One-shot operations

ansible


Operation examples


Automation


Deep dive into Ansible


Code base

Architecture of a playbook


Reusable tasks


Real-world examples


Schema migrations


Schema migrations

sql-migrate

-- +migrate Up
create table author (
    id   bigserial primary key,
    name text not null
);

create table talk (
    id        bigserial primary key,
    title     text not null,
    author_id bigint not null references author(id)
);

-- +migrate Down
drop table author, talk;

Schema migrations


Playbook overview

- name: check arguments
  hosts: all
  run_once: true
  delegate_to: localhost
  tasks:
    - name: check variable schema_url    # fail fast
    - name: check variable database_name # fail fast
- name: update database to the latest schema migration
  hosts: "{{ database_name }}:&subrole_primary"
  tasks:
    - name: create sql-migrate directories
    - name: create sql-migrate configuration file
    - name: clone schema
    - name: run migrations

Playbook tasks

- name: create sql-migrate directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
  loop:
    - /etc/sqlmigrate
    - /var/lib/sqlmigrate
- name: create sql-migrate configuration file
  ansible.builtin.template:
    src: sqlmigrate/database.yml.j2
    dest: "/etc/sqlmigrate/{{ database_name }}.yml"

Playbook tasks

- name: clone schema repository
  ansible.builtin.git:
    repo: "{{ schema_url }}"
    dest: "/var/lib/sqlmigrate/{{ database_name }}"
    version: "{{ branch|default('master') }}" # branch or tag
    force: true
  environment:
    TMPDIR: /run
- name: run migrations
  ansible.builtin.command:
    cmd: sql-migrate up -config /etc/sqlmigrate/{{ database_name }}.yml

Database creation

Just run CREATE DATABASE.

Easy, right?

Well…


Database creation

  1. Check arguments
  2. Select an available cluster
  3. Create git repository
  4. Run CREATE DATABASE (using a module)
  5. Create secrets
  6. Create roles and users (for applications, humans)
  7. Link the database to the git repository
  8. Run schema migrations

Minor upgrades

Ensure softwares are up-to-date:


Minor upgrades


Minor upgrade (1/2)


Minor upgrade (2/2)


Database migration


Database migration

Move one or more databases from one cluster to another

  1. Setup logical replication
  2. Promote
    • Check
    • Migrate
    • Rollback

Database migration


External collections


Internal collections


Implementation

How we use Ansible


Secure Shell (SSH)

How can we securely connect to remote hosts to perform actions?


The Bastion

The Bastion

Ansible + The Bastion

“Ansible Wrapper”

[ssh_connection]
pipelining = True
private_key_file = ~/.ssh/id_ed25519
ssh_executable = /usr/share/ansible/plugins/bastion/sshwrapper.py
sftp_executable = /usr/share/ansible/plugins/bastion/sftpbastion.sh
transfer_method = sftp
retries = 3

https://github.com/ovh/the-bastion-ansible-wrapper

SCP is deprecated, use SFTP instead.


Inventory

Where can we find our hosts to perform operations?


Consul


Consul

Consul service discovery


Consul


Static configuration


Dynamic configuration


Where is my database?

Consul service


Ansible + Consul


How to use the inventory?

With a limit option

ansible server_type_postgresql -m ping

ansible-playbook -l server_type_postgresql playbook.yml

Group combinaison

ansible-playbook -l 'test:&subrole_primary' playbook.yml
ansible-playbook -l 'server_type_postgresql:server_type_mysql' playbook.yml
ansible-playbook -l 'server_type_postgresql:!cluster_99' playbook.yml

Execution environments

Where Ansible runs?


Admin server


AWX


Concepts


AWX UI



AWX CLI

awx -f human job_templates launch --monitor --extra_vars \
    '{"database_name": "***", "branch": "master", "schema_url": "ssh://***.git"}' \
    database-primary-schema-update


Configuration


Components on Kubernetes


Disclaimer

Part of the issues we have encountered are probably related to our internal implementation (internal services, internal Kubernetes).


Quota on pods

Component Type cpu memory ephemeral-storage Quantity
web request 500m 1Gi 1
limit 2000m 2Gi
task request 1000m 2Gi 1
limit 1500m 4Gi
ee request 1000m 256Mi n
limit 2000m 2Gi 1G

Job execution time

  1. Source Control Update
  2. Inventory Sync
  3. Pod scheduling time (quotas, simultaneous jobs)
  4. Containers starting time (init containers)
  5. Playbook execution time

Job execution time

PING

1 min 45 secs


Solutions

  • Enable SCM update cache
    • scm_update_on_launch (bool)
    • scm_update_cache_timeout (int)
  • Enable inventory cache
    • update_on_launch (bool)
    • update_cache_timeout (int)
  • Check quotas on Kubernetes namespace
  • Analyze playbook performances

Fixed


Custom Vault


Custom Vault and databases migrations


Custom Vault


Custom Vault with application key


Custom Vault with basic auth


Solution

  • Init Container to pull all secrets locally once
  • Lookup vault_secret to read locally (application key)
  • Lookup vault_secret_with_user to bypass the cache (basic auth)

Fixed

Two plugins to avoid breaking changes on the first one.


Network unreachable on Kubernetes

configstore:
provider '***':
Post "https://***/auth/app":
dial tcp:
lookup *** on ***:53:
read udp ***->***:53:
i/o timeout

Breaking the job

Job failed with no output


Cascading break

List of failed jobs


Solution

Replace iptables by nftables on Kubernetes workers

Fixed


Consul Federation


But