GitSync overview

GitSync is an ADO service designed for uploading and synchronization of Airflow DAGs from remote Git repositories. The service enables Git-based management of DAGs and integrates directly into Airflow environments.

GitSync main features:

  • Repository synchronization — cloning and updating files from one or multiple Git repositories.

  • Automated delivery — synchronization of DAG files to the target Airflow DAG directory.

  • Flexible filtering — selection of files using pattern-based filters.

  • Parallel processing — handling multiple repositories simultaneously using workers.

  • Cleanup support — optional removal of outdated files from target directories.

  • SSH key management — centralized handling of SSH credentials through service actions in ADCM.

Workflow

GitSync operates as a standalone service and has only one component (gitsync).

The synchronization process consists of the following steps:

  1. DAG source code is stored in one or more Git repositories.

  2. GitSync clones or updates repositories.

  3. Files are filtered based on GitSync’s configuration.

  4. DAG files are copied to the target directory.

  5. Optional cleanup removes outdated files.

  6. Logs and metrics are generated.

Airflow automatically discovers updated DAGs by scanning the configured DAG directory. It recursively scans all subdirectories inside the DAG folder (for example, /opt/airflow/dags).

Configuration

GitSync configuration consists of two levels:

Service-level configuration

Service-level parameters define global behavior of the GitSync service. They are defined in the gitsync-env.sh option in ADCM.

Key parameters include:

  • number of parallel workers;

  • synchronization interval and timeout;

  • logging configuration.

Repository-level configuration

Repository settings are defined in the config.json option in ADCM, which contains the parameters for connecting to the repositories and DAG selection options for synchronization.

Example repository configuration:

{
  "url": "git@ssh.gitlab.example.io:org/repo.git", (1)
  "branch": "main", (2)
  "directory": "./dags",
  "files": "*.py", (3)
  "sync_interval": 60, (4)
  "sync_timeout": 120,
  "ssh_key": "my-git-key", (5)
  "target_folder": "/opt/airflow/dags/project", (6)
  "delete_old_files": true (7)
}
1 Git repository URL.
2 Branch and directory.
3 File filtering rules.
4 Synchronization interval and timeout.
5 SSH key name (for SSH repositories).
6 Target directory for DAGs in Airflow.
7 Optional cleanup behavior.

GitSync supports synchronization of multiple repositories simultaneously. Each repository is processed independently by the worker pool and must use a unique target_folder to avoid conflicts.

Example configuration for multiple repositories
[
  {
    "url": "git@ssh.gitlab.example.io:org/marketing-dags.git",
    "sync_interval": 60,
    "target_folder": "/opt/airflow/dags/marketing",
    "branch": "main",
    "tag": null,
    "directory": "./dags",
    "files": "*.py",
    "sync_requirements": false,
    "requirements_path": null,
    "sync_timeout": 120,
    "ssh_key": "ssh_key_marketing",
    "delete_old_files": true
  },
  {
    "url": "git@ssh.gitlab.example.io:org/finance-dags.git",
    "sync_interval": 120,
    "target_folder": "/opt/airflow/dags/finance",
    "branch": "main",
    "tag": null,
    "directory": "./dags",
    "files": "*.py",
    "sync_requirements": false,
    "requirements_path": null,
    "sync_timeout": 300,
    "ssh_key": "ssh_key_finance",
    "delete_old_files": true
  },
  {
    "url": "git@ssh.gitlab.example.io:org/sales-dags.git",
    "sync_interval": 180,
    "target_folder": "/opt/airflow/dags/sales",
    "branch": "main",
    "tag": null,
    "directory": "./dags",
    "files": "*.py",
    "sync_requirements": false,
    "requirements_path": null,
    "sync_timeout": 300,
    "ssh_key": "ssh_key_sales",
    "delete_old_files": true
  },
  {
    "url": "https://github.com/org/shared-dags.git",
    "sync_interval": 300,
    "target_folder": "/opt/airflow/dags/shared",
    "branch": "main",
    "tag": null,
    "directory": "./",
    "files": "*.py",
    "sync_requirements": false,
    "requirements_path": null,
    "sync_timeout": 300,
    "access_token": "******",
    "https_username": "oauth2",
    "delete_old_files": false
  }
]

SSH authentication

For SSH-based repositories, GitSync provides built-in key management:

  1. SSH keys are uploaded via the Upload private key action.

  2. Keys are stored and managed by GitSync, according to the service configuration.

  3. Repository configuration references keys by name.

  4. Keys are injected at runtime.

The same SSH key can be reused across multiple repositories.

Usage

To start using GitSync:

  1. Add and install the GitSync service in ADO.

  2. Configure the service parameters.

  3. Upload SSH keys (if required) via the Upload private key GitSync action.

  4. Define repository configurations.

  5. Ensure that target DAG directories are accessible by Airflow.

After all steps are completed, GitSync automatically maintains DAG synchronization according to defined intervals.

Limitations

Consider the following limitations when configuring GitSync:

  • Python environment. TARGET_PYTHON is defined at the service level and shared across all repositories. Using separate Python environments for different repositories is not supported.

  • Repository configuration flexibility. The following cases are not supported and may case undefined behavior:

    • synchronization of multiple directories from the same repository and branch;

    • synchronization of the same repository from multiple branches.

  • dbt project support. This is not a primary use case and not fully validated. Runtime-generated artifacts (for example, target/, logs/) may be removed if not configured correctly. Additionally, the use of dbt equires the following configuration:

    • files = "*";

    • delete_old_files = false.

  • File synchronization behavior. delete_old_files removes files based on repository state and does not distinguish between outdated and runtime-generated files.

  • General limitations.

    • Requires network access to Git repositories.

    • SSH requires correct key configuration.

    • Duplicate dag_id across repositories leads to conflicts.

Found a mistake? Seleсt text and press Ctrl+Enter to report it