躬行笔记 » 从零开始玩转Ansible：让运维自动化不再是梦想

其实我刚接触Ansible的时候也是一脸懵逼，什么playbook、inventory、module...这些概念听起来就头大。但是用了一段时间后，我发现这玩意儿真的是运维人员的福音。今天就来跟大家聊聊Ansible这个神器，保证让你看完就能上手，而且我会把这几年踩过的坑和积累的生产经验都分享给大家。

什么是Ansible？为什么选择它？

说白了，Ansible就是一个自动化工具。你可以把它理解为一个"遥控器"，通过这个遥控器，你可以同时控制成百上千台服务器做同样的事情。

我记得刚入行那会儿，公司有个项目需要在200多台服务器上部署应用。当时我傻乎乎地写了个shell脚本，然后用for循环ssh到每台机器上执行。结果执行到一半，网络断了，我都不知道哪些机器执行成功了，哪些失败了。后来同事介绍我用Ansible，那种感觉就像是从石器时代进入了现代社会。

Ansible有几个特点让我特别喜欢：

无需安装客户端：这点真的很爽，只要目标服务器能SSH连接就行。我之前用过Puppet和Chef，都需要在每台机器上装agent，维护起来特别麻烦。有一次我们的Puppet master挂了，所有节点都连不上，整个配置管理系统瘫痪。但是Ansible不会有这个问题，它是推送模式，控制节点挂了不影响已经配置好的服务器运行。

基于SSH：利用现有的SSH连接，安全性有保障。而且SSH本身就有很多安全机制，比如密钥认证、跳板机等，Ansible都能很好地支持。

幂等性：这个词听起来很高大上，其实就是说你执行多少次结果都一样。比如你用Ansible创建一个用户，如果用户已经存在了，它不会重复创建。这个特性在生产环境中特别重要，因为你可能需要多次执行同一个playbook来确保配置正确。

YAML语法：配置文件用的是YAML格式，比JSON或者XML好读多了。我之前写过XML格式的配置文件，那个缩进和嵌套看得眼花缭乱。

强大的模块库：Ansible内置了几千个模块，覆盖了几乎所有的运维场景。从基础的文件操作到复杂的云资源管理，基本上你能想到的操作都有对应的模块。

安装和环境配置

安装Ansible其实很简单，但是在生产环境中，我建议你做一些额外的配置。

基础安装

我一般都是用pip安装，这样可以安装最新版本：

# 先安装pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py

# 安装ansible
pip3 install ansible

# 如果需要特定版本
pip3 install ansible==6.7.0

如果你的环境比较复杂，我建议用虚拟环境：

python3 -m venv ansible-env
source ansible-env/bin/activate
pip install ansible

配置文件优化

Ansible的配置文件是ansible.cfg，它会按照以下顺序查找：

当前目录下的ansible.cfg
~/.ansible.cfg
/etc/ansible/ansible.cfg

我一般在项目根目录下放一个ansible.cfg，这样每个项目都有自己的配置：

[defaults]
# 指定inventory文件位置
inventory = inventories/production/hosts

# 指定私钥文件
private_key_file = ~/.ssh/ansible_key

# 设置远程用户
remote_user = ansible

# 禁用host key检查（生产环境谨慎使用）
host_key_checking = False

# 设置并发数，根据你的网络情况调整
forks = 50

# 设置超时时间
timeout = 30

# 开启pipelining，提高性能
pipelining = True

# 日志文件
log_path = /var/log/ansible.log

# 设置重试文件位置
retry_files_enabled = True
retry_files_save_path = ~/.ansible-retry

[ssh_connection]
# SSH连接复用，提高性能
ssh_args = -C -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes

# 使用SCP而不是SFTP传输文件
scp_if_ssh = True

这些配置都是我在生产环境中总结出来的，特别是forks参数，默认是5，但是在管理大量服务器时明显不够用。我一般设置成50或者更高，但要注意不要设置得太高，否则可能会导致SSH连接数过多。

SSH密钥配置

在生产环境中，我强烈建议使用SSH密钥认证，而不是密码认证。首先生成密钥：

ssh-keygen -t rsa -b 4096 -C "ansible@company.com" -f ~/.ssh/ansible_key

然后把公钥分发到所有目标服务器：

# 手动分发
ssh-copy-id -i ~/.ssh/ansible_key.pub user@target_server

# 或者批量分发脚本
for host in $(cat server_list.txt); do
    ssh-copy-id -i ~/.ssh/ansible_key.pub user@$host
done

为了安全起见，我建议为Ansible专门创建一个用户，而不是直接使用root：

# 在目标服务器上创建ansible用户
useradd -m -s /bin/bash ansible
echo "ansible ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/ansible

深入理解Inventory

Inventory是Ansible的核心概念之一，它定义了你要管理的服务器。在生产环境中，inventory的设计直接影响到你的运维效率。

静态Inventory

最简单的inventory就是一个文本文件，但是在实际项目中，我会按照环境和功能来组织：

# inventories/production/hosts

# Web服务器组
[webservers]
web01 ansible_host=10.0.1.10 ansible_port=22
web02 ansible_host=10.0.1.11 ansible_port=22
web03 ansible_host=10.0.1.12 ansible_port=22

# 数据库服务器组
[databases]
db01 ansible_host=10.0.2.10 mysql_role=master
db02 ansible_host=10.0.2.11 mysql_role=slave
db03 ansible_host=10.0.2.12 mysql_role=slave

# 负载均衡器
[loadbalancers]
lb01 ansible_host=10.0.3.10
lb02 ansible_host=10.0.3.11

# 缓存服务器
[cache]
redis01 ansible_host=10.0.4.10 redis_port=6379
redis02 ansible_host=10.0.4.11 redis_port=6379

# 定义组的组
[frontend:children]
webservers
loadbalancers

[backend:children]
databases
cache

# 全局变量
[all:vars]
ansible_user=ansible
ansible_ssh_private_key_file=~/.ssh/ansible_key
ansible_python_interpreter=/usr/bin/python3

# 环境特定变量
[production:children]
frontend
backend

[production:vars]
env=production
domain=example.com

我还会创建对应的group_vars和host_vars目录：

inventories/production/
├── hosts
├── group_vars/
│   ├── all.yml
│   ├── webservers.yml
│   ├── databases.yml
│   └── production.yml
└── host_vars/
    ├── web01.yml
    └── db01.yml

group_vars/webservers.yml：

# Web服务器特定变量
nginx_version: 1.20.2
php_version: 8.0
document_root: /var/www/html
max_connections: 1024

# 防火墙规则
firewall_rules:
  - port: 80
    protocol: tcp
    source: 0.0.0.0/0
  - port: 443
    protocol: tcp
    source: 0.0.0.0/0

group_vars/databases.yml：

# MySQL配置
mysql_version: 8.0
mysql_root_password: "{{ vault_mysql_root_password }}"
mysql_databases:
  - name: webapp
    encoding: utf8mb4
    collation: utf8mb4_unicode_ci

mysql_users:
  - name: webapp_user
    password: "{{ vault_webapp_db_password }}"
    priv: "webapp.*:ALL"
    host: "10.0.1.%"

# MySQL配置参数
mysql_config:
  innodb_buffer_pool_size: "1G"
  max_connections: 200
  query_cache_size: "128M"

动态Inventory

在云环境中，服务器的IP地址可能经常变化，这时候动态inventory就很有用了。我写过一个从阿里云ECS获取服务器列表的脚本：

#!/usr/bin/env python3
# dynamic_inventory.py

import json
import sys
from aliyunsdkcore.client import AcsClient
from aliyunsdkecs.request.v20140526 import DescribeInstancesRequest

class AlicloudInventory:
    def __init__(self):
        self.client = AcsClient(
            access_key_id='your_access_key',
            access_key_secret='your_secret_key',
            region_id='cn-hangzhou'
        )
        self.inventory = {
            '_meta': {
                'hostvars': {}
            }
        }
      
    def get_instances(self):
        request = DescribeInstancesRequest.DescribeInstancesRequest()
        response = self.client.do_action_with_exception(request)
        return json.loads(response)
  
    def build_inventory(self):
        instances = self.get_instances()
      
        for instance in instances['Instances']['Instance']:
            hostname = instance['InstanceName']
            private_ip = instance['NetworkInterfaces']['NetworkInterface'][0]['PrimaryIpAddress']
          
            # 根据标签分组
            tags = {tag['TagKey']: tag['TagValue'] for tag in instance.get('Tags', {}).get('Tag', [])}
          
            # 添加到对应组
            if 'Role' in tags:
                role = tags['Role']
                if role not in self.inventory:
                    self.inventory[role] = {'hosts': []}
                self.inventory[role]['hosts'].append(hostname)
          
            # 添加主机变量
            self.inventory['_meta']['hostvars'][hostname] = {
                'ansible_host': private_ip,
                'instance_id': instance['InstanceId'],
                'instance_type': instance['InstanceType'],
                'tags': tags
            }
  
    def run(self):
        self.build_inventory()
        print(json.dumps(self.inventory, indent=2))

if __name__ == '__main__':
    inventory = AlicloudInventory()
    inventory.run()

使用动态inventory：

ansible all -i dynamic_inventory.py -m ping

核心模块深度解析

Ansible有几千个模块，但是在日常工作中，你可能只会用到几十个。我来详细介绍一些最重要的模块，以及在生产环境中的使用技巧。

文件操作模块

copy模块是最基础的，我们先从命令行开始：

# 基本文件复制
ansible webservers -m copy -a "src=/etc/hosts dest=/tmp/hosts"

# 复制文件并设置权限
ansible webservers -m copy -a "src=nginx.conf dest=/etc/nginx/nginx.conf owner=root group=root mode=0644"

# 复制文件并备份原文件
ansible webservers -m copy -a "src=nginx.conf dest=/etc/nginx/nginx.conf backup=yes"

# 直接写入内容到文件
ansible webservers -m copy -a "content='Hello World' dest=/tmp/hello.txt"

# 复制文件并验证（需要验证命令）
ansible webservers -m copy -a "src=nginx.conf dest=/etc/nginx/nginx.conf validate='nginx -t -c %s'"

在生产环境中，我经常遇到需要复制大量配置文件的情况。有一次我们要给200台Web服务器更新SSL证书，用copy模块一条命令就搞定了：

# 批量更新SSL证书
ansible webservers -m copy -a "src=/etc/ssl/certs/new-cert.pem dest=/etc/ssl/certs/server.crt backup=yes" --become

# 同时更新私钥文件
ansible webservers -m copy -a "src=/etc/ssl/private/new-key.pem dest=/etc/ssl/private/server.key mode=0600 backup=yes" --become

file模块用于文件和目录操作，这个模块我用得特别多：

# 创建目录
ansible webservers -m file -a "path=/opt/app state=directory owner=app group=app mode=0755"

# 创建多级目录
ansible webservers -m file -a "path=/opt/app/logs/nginx state=directory mode=0755 recurse=yes"

# 创建软链接
ansible webservers -m file -a "src=/opt/app/current dest=/opt/app/releases/v1.2.3 state=link"

# 删除文件或目录
ansible webservers -m file -a "path=/tmp/old_files state=absent"

# 修改文件权限
ansible webservers -m file -a "path=/etc/app/config.ini owner=app group=app mode=0600"

# 创建空文件
ansible webservers -m file -a "path=/var/log/app.log state=touch owner=app group=app"

我记得有一次系统升级后，发现所有服务器的日志目录权限都不对，应用写不了日志。用file模块批量修复：

# 递归修改目录权限
ansible webservers -m file -a "path=/var/log/app state=directory owner=app group=app mode=0755 recurse=yes" --become

包管理模块

在生产环境中，包管理是一个很重要的话题。不同的操作系统有不同的包管理器，但好在Ansible的package模块可以自动识别：

# 安装单个软件包
ansible webservers -m package -a "name=nginx state=present" --become

# 安装多个软件包
ansible webservers -m package -a "name=nginx,mysql-server,git state=present" --become

# 安装特定版本的软件包
ansible centos_servers -m yum -a "name=nginx-1.20.2 state=present" --become

# 卸载软件包
ansible webservers -m package -a "name=apache2 state=absent" --become

# 更新所有软件包
ansible webservers -m package -a "name='*' state=latest" --become

# 更新包缓存
ansible ubuntu_servers -m apt -a "update_cache=yes cache_valid_time=3600" --become

我之前负责一个项目，需要在100多台服务器上安装Docker。当时我是这样操作的：

# 先安装依赖包
ansible centos_servers -m yum -a "name=yum-utils,device-mapper-persistent-data,lvm2 state=present" --become

# 添加Docker仓库
ansible centos_servers -m yum_repository -a "name=docker-ce description='Docker CE Repository' baseurl=https://download.docker.com/linux/centos/7/x86_64/stable gpgcheck=yes gpgkey=https://download.docker.com/linux/centos/gpg enabled=yes" --become

# 安装Docker CE
ansible centos_servers -m yum -a "name=docker-ce,docker-ce-cli,containerd.io state=present" --become

有时候需要从本地RPM包安装软件：

# 先复制RPM包到目标服务器
ansible webservers -m copy -a "src=/tmp/custom-app-1.0.0.rpm dest=/tmp/"

# 安装本地RPM包
ansible webservers -m yum -a "name=/tmp/custom-app-1.0.0.rpm state=present" --become

服务管理模块

service模块在不同系统上的行为可能不同，我更推荐使用systemd模块：

# 启动服务
ansible webservers -m systemd -a "name=nginx state=started" --become

# 停止服务
ansible webservers -m systemd -a "name=nginx state=stopped" --become

# 重启服务
ansible webservers -m systemd -a "name=nginx state=restarted" --become

# 重新加载服务配置
ansible webservers -m systemd -a "name=nginx state=reloaded" --become

# 启用服务开机自启
ansible webservers -m systemd -a "name=nginx enabled=yes" --become

# 禁用服务开机自启
ansible webservers -m systemd -a "name=nginx enabled=no" --become

# 重新加载systemd配置
ansible webservers -m systemd -a "daemon_reload=yes" --become

# 检查服务状态
ansible webservers -m systemd -a "name=nginx" --become

我在生产环境中经常需要批量重启服务。比如更新了nginx配置后：

# 先验证配置文件语法
ansible webservers -m shell -a "nginx -t" --become

# 如果语法正确，重新加载配置
ansible webservers -m systemd -a "name=nginx state=reloaded" --become

有时候服务启动后需要等待一段时间才能正常工作，可以结合wait_for模块：

# 启动服务
ansible webservers -m systemd -a "name=webapp state=started" --become

# 等待服务端口可用
ansible webservers -m wait_for -a "port=8080 delay=5 timeout=30"

# 检查服务是否正常响应
ansible webservers -m uri -a "url=http://localhost:8080/health method=GET status_code=200"

用户管理模块

在生产环境中，用户管理是一个很重要的安全话题：

# 创建用户
ansible webservers -m user -a "name=webapp shell=/bin/bash home=/opt/webapp create_home=yes" --become

# 创建系统用户
ansible webservers -m user -a "name=nginx system=yes shell=/sbin/nologin home=/var/lib/nginx create_home=no" --become

# 设置用户密码（密码需要先加密）
ansible webservers -m user -a "name=webapp password='$6$rounds=656000$salt$hash'" --become

# 添加用户到组
ansible webservers -m user -a "name=webapp groups=docker,sudo append=yes" --become

# 删除用户
ansible webservers -m user -a "name=olduser state=absent remove=yes" --become

# 修改用户shell
ansible webservers -m user -a "name=webapp shell=/bin/zsh" --become

创建应用专用用户是我在生产环境中的标准操作：

# 创建应用用户
ansible webservers -m user -a "name=webapp uid=1001 group=webapp shell=/bin/bash home=/opt/webapp create_home=yes system=no" --become

# 创建对应的组
ansible webservers -m group -a "name=webapp gid=1001 system=no" --become

SSH密钥管理

# 添加SSH公钥到用户
ansible webservers -m authorized_key -a "user=webapp key='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAB...'" --become

# 从文件读取公钥
ansible webservers -m authorized_key -a "user=webapp key='{{ lookup('file', '~/.ssh/id_rsa.pub') }}'" --become

# 删除SSH密钥
ansible webservers -m authorized_key -a "user=webapp key='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAB...' state=absent" --become

命令执行模块

shell和command模块的区别很重要，我来详细说说：

# command模块不支持shell特性，更安全
ansible webservers -m command -a "df -h /"

# 获取系统负载
ansible webservers -m command -a "uptime"

# 查看内存使用情况
ansible webservers -m command -a "free -m"

# shell模块支持管道、重定向等shell特性
ansible webservers -m shell -a "ps aux | grep nginx | grep -v grep | wc -l"

# 使用shell执行复杂命令
ansible webservers -m shell -a "cd /opt/app && ./deploy.sh"

# 设置环境变量执行命令
ansible webservers -m shell -a "APP_ENV=production /opt/app/start.sh"

在生产环境中，我经常需要执行一些复杂的运维脚本：

# 检查磁盘使用情况
ansible webservers -m shell -a "df -h | grep -v tmpfs | awk '{print \$5 \" \" \$6}' | grep -v Use"

# 查找大文件
ansible webservers -m shell -a "find /var/log -type f -size +100M -exec ls -lh {} \;"

# 清理日志文件
ansible webservers -m shell -a "find /var/log -name '*.log' -mtime +7 -delete"

# 获取服务器信息
ansible webservers -m shell -a "echo 'CPU:' && nproc && echo 'Memory:' && free -h | grep Mem && echo 'Disk:' && df -h /"

网络和HTTP模块

uri模块对于API调用和健康检查非常有用：

# 简单的HTTP GET请求
ansible webservers -m uri -a "url=http://localhost/health method=GET"

# 检查API响应状态
ansible webservers -m uri -a "url=http://localhost:8080/api/status method=GET status_code=200"

# POST请求发送数据
ansible webservers -m uri -a "url=http://localhost:8080/api/restart method=POST body='{}' body_format=json"

# 下载文件
ansible webservers -m get_url -a "url=https://releases.example.com/app-1.2.3.tar.gz dest=/tmp/"

# 带认证的HTTP请求
ansible webservers -m uri -a "url=https://api.example.com/status method=GET headers='Authorization=Bearer token123'"

我经常用uri模块做健康检查：

# 检查Web服务是否正常
ansible webservers -m uri -a "url=http://{{ ansible_default_ipv4.address }}/health method=GET status_code=200 timeout=10"

# 检查API接口
ansible webservers -m uri -a "url=http://localhost:8080/api/ping method=GET return_content=yes" --one-line

数据库模块

对于MySQL数据库管理：

# 创建数据库
ansible db_servers -m mysql_db -a "name=webapp_db state=present login_user=root login_password=password" --become

# 删除数据库
ansible db_servers -m mysql_db -a "name=old_db state=absent login_user=root login_password=password" --become

# 创建数据库用户
ansible db_servers -m mysql_user -a "name=webapp_user password=secret_password priv='webapp_db.*:ALL' host='%' login_user=root login_password=password" --become

# 删除数据库用户
ansible db_servers -m mysql_user -a "name=old_user state=absent login_user=root login_password=password" --become

文本处理模块

lineinfile模块对于配置文件修改特别有用：

# 修改配置文件中的行
ansible webservers -m lineinfile -a "path=/etc/nginx/nginx.conf regexp='^worker_processes' line='worker_processes auto;'" --become

# 在文件末尾添加行
ansible webservers -m lineinfile -a "path=/etc/hosts line='192.168.1.100 app.example.com'" --become

# 删除匹配的行
ansible webservers -m lineinfile -a "path=/etc/hosts regexp='old.example.com' state=absent" --become

# 在特定位置插入行
ansible webservers -m lineinfile -a "path=/etc/ssh/sshd_config line='PermitRootLogin no' insertafter='^#PermitRootLogin'" --become

blockinfile模块用于处理配置块：

# 添加配置块
ansible webservers -m blockinfile -a "path=/etc/nginx/nginx.conf block='upstream backend { server 192.168.1.10:8080; server 192.168.1.11:8080; }' insertbefore='server {'" --become

# 删除配置块
ansible webservers -m blockinfile -a "path=/etc/nginx/nginx.conf block='' state=absent marker='# {mark} ANSIBLE MANAGED BLOCK - backend'" --become

系统信息收集

setup模块用于收集系统信息，这个在写条件判断时特别有用：

# 收集所有系统信息
ansible webservers -m setup

# 只收集网络信息
ansible webservers -m setup -a "gather_subset=network"

# 只收集硬件信息
ansible webservers -m setup -a "gather_subset=hardware"

# 过滤特定信息
ansible webservers -m setup -a "filter=ansible_default_ipv4"

# 收集自定义facts
ansible webservers -m setup -a "fact_path=/etc/ansible/facts.d"

我经常用setup模块来获取服务器的基本信息：

# 获取服务器IP地址
ansible webservers -m setup -a "filter=ansible_default_ipv4" | grep address

# 获取内存信息
ansible webservers -m setup -a "filter=ansible_memtotal_mb"

# 获取CPU核心数
ansible webservers -m setup -a "filter=ansible_processor_cores"

高级用法和组合技巧

在实际工作中，我经常需要组合使用多个模块。比如部署应用的时候：

# 1. 先停止服务
ansible webservers -m systemd -a "name=webapp state=stopped" --become

# 2. 备份当前版本
ansible webservers -m shell -a "cp -r /opt/webapp/current /opt/webapp/backup-$(date +%Y%m%d-%H%M%S)" --become

# 3. 下载新版本
ansible webservers -m get_url -a "url=https://releases.example.com/webapp-1.2.3.tar.gz dest=/tmp/" --become

# 4. 解压新版本
ansible webservers -m unarchive -a "src=/tmp/webapp-1.2.3.tar.gz dest=/opt/webapp/releases/ remote_src=yes owner=webapp group=webapp" --become

# 5. 更新软链接
ansible webservers -m file -a "src=/opt/webapp/releases/webapp-1.2.3 dest=/opt/webapp/current state=link force=yes" --become

# 6. 启动服务
ansible webservers -m systemd -a "name=webapp state=started" --become

# 7. 检查服务状态
ansible webservers -m uri -a "url=http://localhost:8080/health method=GET status_code=200"

使用register保存输出结果

虽然在ad-hoc命令中不能直接使用register，但可以通过一些技巧获取命令输出：

# 获取命令输出并格式化显示
ansible webservers -m shell -a "df -h /" --one-line

# 检查文件是否存在
ansible webservers -m stat -a "path=/etc/nginx/nginx.conf"

# 获取服务状态
ansible webservers -m systemd -a "name=nginx" --become

批量操作技巧

在生产环境中，我经常需要对不同的服务器执行不同的操作：

# 只对特定主机执行
ansible webservers -l "web01,web02" -m systemd -a "name=nginx state=restarted" --become

# 排除特定主机
ansible webservers -l "!web03" -m package -a "name=nginx state=latest" --become

# 使用模式匹配
ansible "web*" -m shell -a "uptime"

# 对不同组执行不同操作
ansible databases -m systemd -a "name=mysql state=restarted" --become
ansible webservers -m systemd -a "name=nginx state=reloaded" --become

错误处理和调试

当命令执行失败时，可以使用一些参数来调试：

# 显示详细输出
ansible webservers -m shell -a "nginx -t" -vvv --become

# 忽略错误继续执行
ansible webservers -m shell -a "some_command_that_might_fail" --ignore-errors

# 检查模式（不实际执行）
ansible webservers -m copy -a "src=test.txt dest=/tmp/" --check

# 显示差异
ansible webservers -m copy -a "src=nginx.conf dest=/etc/nginx/nginx.conf" --check --diff --become

这些ad-hoc命令在日常运维中非常有用，特别是需要快速执行一些简单任务的时候。不过对于复杂的操作，我还是建议写成playbook，这样更容易维护和重复使用。

编写高质量的Playbook

一个好的Playbook不仅要能完成任务，还要易读、易维护、易调试。我来分享一些编写高质量Playbook的技巧。

基本结构和命名规范

---
# 文件头注释，说明这个playbook的作用
# Deploy web application to production servers
# Author: Your Name
# Date: 2024-01-01

- name: Deploy web application  # 清晰的play名称
  hosts: webservers
  become: yes  # 是否需要sudo
  gather_facts: yes  # 是否收集系统信息

  vars:
    # 在playbook中定义的变量
    app_name: webapp
    app_version: "1.2.3"
  
  vars_files:
    # 引用外部变量文件
    - vars/common.yml
    - vars/{{ env }}.yml
  
  pre_tasks:
    # 预处理任务
    - name: Check system requirements
      assert:
        that:
          - ansible_memtotal_mb >= 2048
          - ansible_architecture == "x86_64"
        fail_msg: "System does not meet minimum requirements"
      
  tasks:
    # 主要任务列表
    - name: Install required packages
      package:
        name: "{{ item }}"
        state: present
      loop: "{{ required_packages }}"
      tags: 
        - packages
        - install
      
  post_tasks:
    # 后处理任务
    - name: Verify application is running
      uri:
        url: "http://{{ ansible_default_ipv4.address }}/health"
        method: GET
        status_code: 200
      retries: 5
      delay: 10
    
  handlers:
    # 处理器，响应notify
    - name: restart webapp
      systemd:
        name: webapp
        state: restarted

错误处理和调试

在生产环境中，错误处理是非常重要的：

- name: Download application package
  get_url:
    url: "{{ app_download_url }}"
    dest: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
    timeout: 300
  register: download_result
  retries: 3
  delay: 10
  until: download_result is succeeded

- name: Extract application package
  unarchive:
    src: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
    dest: /opt/app/releases/
    remote_src: yes
    creates: "/opt/app/releases/{{ app_version }}"  # 如果目录已存在，跳过
  register: extract_result

- name: Verify extraction
  stat:
    path: "/opt/app/releases/{{ app_version }}/app.py"
  register: app_file
  failed_when: not app_file.stat.exists

- name: Rollback on failure
  file:
    path: "/opt/app/releases/{{ app_version }}"
    state: absent
  when: extract_result is failed

使用block进行错误处理

- name: Deploy application with rollback
  block:
    - name: Stop application
      systemd:
        name: webapp
        state: stopped
      
    - name: Update symlink
      file:
        src: "/opt/app/releases/{{ app_version }}"
        dest: /opt/app/current
        state: link
      
    - name: Start application
      systemd:
        name: webapp
        state: started
      
    - name: Verify application health
      uri:
        url: "http://localhost:8080/health"
        status_code: 200
      retries: 5
      delay: 10
    
  rescue:
    # 如果block中的任务失败，执行rescue
    - name: Rollback to previous version
      file:
        src: "/opt/app/releases/{{ previous_version }}"
        dest: /opt/app/current
        state: link
      
    - name: Restart application
      systemd:
        name: webapp
        state: restarted
      
    - name: Send failure notification
      mail:
        to: ops@company.com
        subject: "Deployment failed on {{ inventory_hostname }}"
        body: "Application deployment failed, rolled back to {{ previous_version }}"
      
  always:
    # 无论成功还是失败都会执行
    - name: Clean up temporary files
      file:
        path: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
        state: absent

条件判断和循环的高级用法

# 复杂条件判断
- name: Install database server
  package:
    name: "{{ db_package }}"
    state: present
  vars:
    db_package: "{% if ansible_os_family == 'RedHat' %}mysql-server{% elif ansible_os_family == 'Debian' %}mysql-server-8.0{% endif %}"
  when: 
    - inventory_hostname in groups['databases']
    - install_database | default(true)
    - ansible_memtotal_mb > 2048

# 循环中的条件判断
- name: Create users
  user:
    name: "{{ item.name }}"
    groups: "{{ item.groups | default([]) }}"
    shell: "{{ item.shell | default('/bin/bash') }}"
    state: "{{ item.state | default('present') }}"
  loop: "{{ users }}"
  when: 
    - item.name is defined
    - item.state | default('present') == 'present'

# 字典循环
- name: Configure virtual hosts
  template:
    src: vhost.conf.j2
    dest: "/etc/nginx/sites-available/{{ item.key }}"
  loop: "{{ vhosts | dict2items }}"
  notify: restart nginx

# 嵌套循环
- name: Install packages for each environment
  package:
    name: "{{ item.1 }}"
    state: present
  loop: "{{ environments | subelements('packages') }}"
  vars:
    environments:
      - name: development
        packages: ['git', 'vim', 'curl']
      - name: production
        packages: ['nginx', 'mysql-server']

变量和模板的高级技巧

# 使用set_fact动态设置变量
- name: Determine database master
  set_fact:
    db_master: "{{ item }}"
  loop: "{{ groups['databases'] }}"
  when: hostvars[item]['mysql_role'] == 'master'

# 使用lookup插件读取外部数据
- name: Read password from file
  set_fact:
    db_password: "{{ lookup('file', '/etc/ansible/secrets/db_password') }}"

# 使用过滤器处理数据
- name: Show formatted information
  debug:
    msg: |
      Server: {{ inventory_hostname }}
      IP: {{ ansible_default_ipv4.address }}
      Memory: {{ (ansible_memtotal_mb / 1024) | round(1) }} GB
      Disk: {{ ansible_mounts | selectattr('mount', 'equalto', '/') | map(attribute='size_total') | first | filesizeformat }}
      Uptime: {{ ansible_uptime_seconds | int | human_readable }}

实战案例：完整的Web应用部署

让我分享一个完整的生产环境部署案例。这是我之前负责的一个电商网站的部署方案，包含了负载均衡、Web服务器、数据库、缓存等组件。

项目目录结构

webapp-deploy/
├── ansible.cfg
├── inventories/
│   ├── production/
│   │   ├── hosts
│   │   ├── group_vars/
│   │   │   ├── all.yml
│   │   │   ├── webservers.yml
│   │   │   ├── databases.yml
│   │   │   └── loadbalancers.yml
│   │   └── host_vars/
│   └── staging/
├── roles/
│   ├── common/
│   ├── nginx/
│   ├── mysql/
│   ├── redis/
│   └── webapp/
├── playbooks/
│   ├── site.yml
│   ├── deploy.yml
│   └── maintenance.yml
├── group_vars/
│   └── all.yml
└── vars/
    ├── secrets.yml
    └── common.yml

主要的site.yml文件

---
# 完整的站点部署playbook
- import_playbook: playbooks/common.yml
- import_playbook: playbooks/databases.yml
- import_playbook: playbooks/cache.yml
- import_playbook: playbooks/webservers.yml
- import_playbook: playbooks/loadbalancers.yml
- import_playbook: playbooks/monitoring.yml

# 通用配置playbook
- name: Configure all servers
  hosts: all
  become: yes
  roles:
    - common
  tags:
    - common
    - base

# 数据库服务器配置
- name: Configure database servers
  hosts: databases
  become: yes
  serial: 1  # 一台一台配置，避免同时停机
  roles:
    - mysql
  tags:
    - database
    - mysql

# Web服务器配置
- name: Configure web servers
  hosts: webservers
  become: yes
  roles:
    - nginx
    - webapp
  tags:
    - web
    - nginx
    - app

# 负载均衡器配置
- name: Configure load balancers
  hosts: loadbalancers
  become: yes
  roles:
    - nginx
  tags:
    - lb
    - nginx

通用Role（roles/common/tasks/main.yml）

---
- name: Update system packages
  package:
    name: "*"
    state: latest
  when: update_packages | default(false)
  tags: packages

- name: Install essential packages
  package:
    name: "{{ essential_packages }}"
    state: present
  tags: packages

- name: Configure timezone
  timezone:
    name: "{{ system_timezone | default('Asia/Shanghai') }}"
  notify: restart rsyslog

- name: Configure NTP
  template:
    src: chrony.conf.j2
    dest: /etc/chrony.conf
    backup: yes
  notify: restart chronyd
  when: ansible_os_family == "RedHat"

- name: Start and enable chronyd
  systemd:
    name: chronyd
    state: started
    enabled: yes

- name: Create common directories
  file:
    path: "{{ item }}"
    state: directory
    owner: root
    group: root
    mode: '0755'
  loop:
    - /opt/scripts
    - /var/log/ansible
    - /etc/ansible/facts.d

- name: Configure system limits
  pam_limits:
    domain: "{{ item.domain }}"
    limit_type: "{{ item.type }}"
    limit_item: "{{ item.item }}"
    value: "{{ item.value }}"
  loop:
    - { domain: '*', type: 'soft', item: 'nofile', value: '65536' }
    - { domain: '*', type: 'hard', item: 'nofile', value: '65536' }
    - { domain: '*', type: 'soft', item: 'nproc', value: '32768' }
    - { domain: '*', type: 'hard', item: 'nproc', value: '32768' }

- name: Configure kernel parameters
  sysctl:
    name: "{{ item.name }}"
    value: "{{ item.value }}"
    state: present
    reload: yes
  loop:
    - { name: 'net.core.somaxconn', value: '32768' }
    - { name: 'net.ipv4.tcp_max_syn_backlog', value: '32768' }
    - { name: 'net.ipv4.tcp_fin_timeout', value: '10' }
    - { name: 'net.ipv4.tcp_keepalive_time', value: '1200' }
    - { name: 'vm.swappiness', value: '10' }

- name: Configure SSH hardening
  lineinfile:
    path: /etc/ssh/sshd_config
    regexp: "{{ item.regexp }}"
    line: "{{ item.line }}"
    backup: yes
  loop:
    - { regexp: '^#?PermitRootLogin', line: 'PermitRootLogin no' }
    - { regexp: '^#?PasswordAuthentication', line: 'PasswordAuthentication no' }
    - { regexp: '^#?MaxAuthTries', line: 'MaxAuthTries 3' }
    - { regexp: '^#?ClientAliveInterval', line: 'ClientAliveInterval 300' }
    - { regexp: '^#?ClientAliveCountMax', line: 'ClientAliveCountMax 2' }
  notify: restart sshd

- name: Setup logrotate for application logs
  template:
    src: app-logrotate.j2
    dest: /etc/logrotate.d/webapp
    owner: root
    group: root
    mode: '0644'

- name: Install custom monitoring script
  template:
    src: system_monitor.sh.j2
    dest: /opt/scripts/system_monitor.sh
    owner: root
    group: root
    mode: '0755'

- name: Setup cron job for monitoring
  cron:
    name: "System monitoring"
    minute: "*/5"
    job: "/opt/scripts/system_monitor.sh"
    user: root

MySQL Role（roles/mysql/tasks/main.yml）

---
- name: Install MySQL packages
  package:
    name: "{{ mysql_packages }}"
    state: present

- name: Start and enable MySQL
  systemd:
    name: "{{ mysql_service }}"
    state: started
    enabled: yes

- name: Check if MySQL root password is set
  shell: mysql -u root -e "SELECT 1"
  register: mysql_root_check
  failed_when: false
  changed_when: false
  no_log: true

- name: Set MySQL root password
  mysql_user:
    name: root
    password: "{{ mysql_root_password }}"
    login_unix_socket: /var/lib/mysql/mysql.sock
  when: mysql_root_check.rc == 0

- name: Create MySQL configuration file
  template:
    src: my.cnf.j2
    dest: /etc/my.cnf
    backup: yes
    owner: root
    group: root
    mode: '0644'
  notify: restart mysql

- name: Create MySQL databases
  mysql_db:
    name: "{{ item.name }}"
    encoding: "{{ item.encoding | default('utf8mb4') }}"
    collation: "{{ item.collation | default('utf8mb4_unicode_ci') }}"
    login_user: root
    login_password: "{{ mysql_root_password }}"
    state: present
  loop: "{{ mysql_databases }}"
  no_log: true

- name: Create MySQL users
  mysql_user:
    name: "{{ item.name }}"
    password: "{{ item.password }}"
    priv: "{{ item.priv }}"
    host: "{{ item.host | default('localhost') }}"
    login_user: root
    login_password: "{{ mysql_root_password }}"
    state: present
  loop: "{{ mysql_users }}"
  no_log: true

- name: Configure MySQL replication (master)
  template:
    src: master.cnf.j2
    dest: /etc/mysql/conf.d/master.cnf
  when: mysql_role == 'master'
  notify: restart mysql

- name: Configure MySQL replication (slave)
  template:
    src: slave.cnf.j2
    dest: /etc/mysql/conf.d/slave.cnf
  when: mysql_role == 'slave'
  notify: restart mysql

- name: Create replication user (master only)
  mysql_user:
    name: "{{ mysql_replication_user }}"
    password: "{{ mysql_replication_password }}"
    priv: "*.*:REPLICATION SLAVE"
    host: "%"
    login_user: root
    login_password: "{{ mysql_root_password }}"
    state: present
  when: mysql_role == 'master'
  no_log: true

- name: Setup MySQL backup script
  template:
    src: mysql_backup.sh.j2
    dest: /opt/scripts/mysql_backup.sh
    owner: root
    group: root
    mode: '0700'

- name: Setup MySQL backup cron job
  cron:
    name: "MySQL backup"
    minute: "0"
    hour: "2"
    job: "/opt/scripts/mysql_backup.sh"
    user: root
  when: mysql_role == 'master'

应用部署Role（roles/webapp/tasks/main.yml）

---
- name: Create application user
  user:
    name: "{{ app_user }}"
    system: yes
    shell: /bin/bash
    home: "{{ app_home }}"
    create_home: yes

- name: Create application directories
  file:
    path: "{{ item }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    mode: '0755'
  loop:
    - "{{ app_home }}"
    - "{{ app_home }}/releases"
    - "{{ app_home }}/shared"
    - "{{ app_home }}/shared/logs"
    - "{{ app_home }}/shared/config"
    - "{{ app_home }}/shared/uploads"

- name: Install Python and pip
  package:
    name:
      - python3
      - python3-pip
      - python3-venv
      - git
    state: present

- name: Check if application is already deployed
  stat:
    path: "{{ app_home }}/current"
  register: current_release

- name: Get current release version
  shell: readlink {{ app_home }}/current | xargs basename
  register: current_version
  when: current_release.stat.exists
  changed_when: false

- name: Set previous version fact
  set_fact:
    previous_version: "{{ current_version.stdout | default('none') }}"

- name: Create release directory
  file:
    path: "{{ app_home }}/releases/{{ app_version }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    mode: '0755'

- name: Download application code
  get_url:
    url: "{{ app_download_url }}"
    dest: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
    timeout: 300
  register: download_result
  retries: 3
  delay: 10

- name: Extract application code
  unarchive:
    src: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
    dest: "{{ app_home }}/releases/{{ app_version }}"
    remote_src: yes
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    extra_opts: [--strip-components=1]

- name: Create virtual environment
  pip:
    requirements: "{{ app_home }}/releases/{{ app_version }}/requirements.txt"
    virtualenv: "{{ app_home }}/releases/{{ app_version }}/venv"
    virtualenv_python: python3
  become_user: "{{ app_user }}"

- name: Generate application config
  template:
    src: "{{ item.src }}"
    dest: "{{ app_home }}/shared/config/{{ item.dest }}"
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    mode: '0600'
  loop:
    - { src: 'app_config.py.j2', dest: 'config.py' }
    - { src: 'database.ini.j2', dest: 'database.ini' }
  notify: restart webapp

- name: Create symlinks to shared resources
  file:
    src: "{{ app_home }}/shared/{{ item.src }}"
    dest: "{{ app_home }}/releases/{{ app_version }}/{{ item.dest }}"
    state: link
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    force: yes
  loop:
    - { src: 'logs', dest: 'logs' }
    - { src: 'config/config.py', dest: 'config.py' }
    - { src: 'config/database.ini', dest: 'database.ini' }
    - { src: 'uploads', dest: 'static/uploads' }

- name: Run database migrations
  shell: |
    source {{ app_home }}/releases/{{ app_version }}/venv/bin/activate
    python manage.py migrate
  args:
    chdir: "{{ app_home }}/releases/{{ app_version }}"
  become_user: "{{ app_user }}"
  register: migration_result
  when: run_migrations | default(true)

- name: Collect static files
  shell: |
    source {{ app_home }}/releases/{{ app_version }}/venv/bin/activate
    python manage.py collectstatic --noinput
  args:
    chdir: "{{ app_home }}/releases/{{ app_version }}"
  become_user: "{{ app_user }}"
  when: collect_static | default(true)

- name: Create systemd service file
  template:
    src: webapp.service.j2
    dest: "/etc/systemd/system/{{ app_name }}.service"
    owner: root
    group: root
    mode: '0644'
  notify:
    - reload systemd
    - restart webapp

- name: Update current symlink
  file:
    src: "{{ app_home }}/releases/{{ app_version }}"
    dest: "{{ app_home }}/current"
    state: link
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    force: yes
  notify: restart webapp

- name: Start and enable application service
  systemd:
    name: "{{ app_name }}"
    state: started
    enabled: yes
    daemon_reload: yes

- name: Wait for application to start
  wait_for:
    port: "{{ app_port }}"
    host: 127.0.0.1
    delay: 5
    timeout: 60

- name: Verify application health
  uri:
    url: "http://127.0.0.1:{{ app_port }}/health"
    method: GET
    status_code: 200
  retries: 5
  delay: 10
  register: health_check

- name: Clean up old releases
  shell: |
    cd {{ app_home }}/releases
    ls -t | tail -n +{{ keep_releases | default(5) + 1 }} | xargs -r rm -rf
  become_user: "{{ app_user }}"
  when: cleanup_releases | default(true)

- name: Clean up downloaded archive
  file:
    path: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
    state: absent

Nginx配置模板（roles/nginx/templates/webapp.conf.j2）

# {{ ansible_managed }}
# Nginx configuration for {{ app_name }}

upstream {{ app_name }}_backend {
    {% for host in groups['webservers'] %}
    server {{ hostvars[host]['ansible_default_ipv4']['address'] }}:{{ app_port }} weight={{ hostvars[host]['nginx_weight'] | default(1) }};
    {% endfor %}
  
    # 健康检查和故障转移
    keepalive 32;
    keepalive_requests 100;
    keepalive_timeout 60s;
}

# HTTP重定向到HTTPS
server {
    listen 80;
    server_name {{ app_domain }} www.{{ app_domain }};
  
    # Let's Encrypt验证
    location /.well-known/acme-challenge/ {
        root /var/www/letsencrypt;
    }
  
    location / {
        return 301 https://$server_name$request_uri;
    }
}

# HTTPS服务器配置
server {
    listen 443 ssl http2;
    server_name {{ app_domain }} www.{{ app_domain }};
  
    # SSL证书配置
    ssl_certificate {{ ssl_cert_path }};
    ssl_certificate_key {{ ssl_key_path }};
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384;
    ssl_prefer_server_ciphers on;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 10m;
  
    # 安全头
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Frame-Options DENY always;
    add_header X-Content-Type-Options nosniff always;
    add_header X-XSS-Protection "1; mode=block" always;
  
    # 访问日志
    access_log /var/log/nginx/{{ app_name }}_access.log combined;
    error_log /var/log/nginx/{{ app_name }}_error.log;
  
    # 客户端上传限制
    client_max_body_size {{ max_upload_size | default('10M') }};
    client_body_timeout 60s;
    client_header_timeout 60s;
  
    # Gzip压缩
    gzip on;
    gzip_vary on;
    gzip_min_length 1024;
    gzip_types text/plain text/css text/xml text/javascript application/javascript application/xml+rss application/json;
  
    # 静态文件缓存
    location /static/ {
        alias {{ app_home }}/shared/uploads/;
        expires 30d;
        add_header Cache-Control "public, immutable";
      
        # 安全设置
        location ~* \.(php|py|pl|sh)$ {
            deny all;
        }
    }
  
    # 媒体文件
    location /media/ {
        alias {{ app_home }}/shared/uploads/;
        expires 7d;
        add_header Cache-Control "public";
    }
  
    # 应用程序代理
    location / {
        proxy_pass http://{{ app_name }}_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
      
        # 代理超时设置
        proxy_connect_timeout 30s;
        proxy_send_timeout 30s;
        proxy_read_timeout 30s;
      
        # 缓冲设置
        proxy_buffering on;
        proxy_buffer_size 4k;
        proxy_buffers 8 4k;
        proxy_busy_buffers_size 8k;
      
        # 健康检查
        proxy_next_upstream error timeout http_500 http_502 http_503;
        proxy_next_upstream_tries 3;
        proxy_next_upstream_timeout 30s;
    }
  
    # 健康检查端点
    location /nginx-health {
        access_log off;
        return 200 "healthy\n";
        add_header Content-Type text/plain;
    }
  
    # 禁止访问敏感文件
    location ~ /\. {
        deny all;
    }
  
    location ~ \.(sql|log|conf)$ {
        deny all;
    }
}

高级功能和最佳实践

在生产环境中使用Ansible，还有很多高级功能和最佳实践需要掌握。

Ansible Vault敏感信息管理

在生产环境中，密码、密钥等敏感信息不能明文存储。Ansible Vault提供了加密功能：

# 创建加密文件
ansible-vault create vars/secrets.yml

# 编辑加密文件
ansible-vault edit vars/secrets.yml

# 加密现有文件
ansible-vault encrypt vars/passwords.yml

# 解密文件
ansible-vault decrypt vars/passwords.yml

# 查看加密文件内容
ansible-vault view vars/secrets.yml

secrets.yml文件内容：

# 数据库密码
mysql_root_password: "SuperSecretPassword123!"
mysql_replication_password: "ReplicationPass456!"

# 应用密钥
app_secret_key: "django-secret-key-very-long-and-random"
jwt_secret: "jwt-signing-key-also-very-secret"

# API密钥
aws_access_key: "AKIAIOSFODNN7EXAMPLE"
aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

# SSL证书密码
ssl_key_password: "ssl-private-key-password"

在playbook中使用：

---
- name: Deploy with secrets
  hosts: webservers
  vars_files:
    - vars/secrets.yml
  tasks:
    - name: Configure database connection
      template:
        src: database.conf.j2
        dest: /etc/app/database.conf
        mode: '0600'
      vars:
        db_password: "{{ mysql_root_password }}"

执行时需要提供密码：

ansible-playbook --ask-vault-pass deploy.yml

# 或者使用密码文件
echo "vault_password" > ~/.vault_pass
chmod 600 ~/.vault_pass
ansible-playbook --vault-password-file ~/.vault_pass deploy.yml

自定义模块开发

有时候内置模块不能满足需求，你可以开发自定义模块。我写过一个检查应用健康状态的模块：

#!/usr/bin/python
# library/health_check.py

from ansible.module_utils.basic import AnsibleModule
import requests
import time

def main():
    module = AnsibleModule(
        argument_spec=dict(
            url=dict(type='str', required=True),
            timeout=dict(type='int', default=30),
            retries=dict(type='int', default=3),
            delay=dict(type='int', default=5),
            expected_status=dict(type='int', default=200),
            expected_content=dict(type='str', default=None)
        ),
        supports_check_mode=True
    )
  
    url = module.params['url']
    timeout = module.params['timeout']
    retries = module.params['retries']
    delay = module.params['delay']
    expected_status = module.params['expected_status']
    expected_content = module.params['expected_content']
  
    if module.check_mode:
        module.exit_json(changed=False, msg="Check mode - would check health")
  
    last_error = None
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=timeout)
          
            if response.status_code != expected_status:
                raise Exception(f"Expected status {expected_status}, got {response.status_code}")
          
            if expected_content and expected_content not in response.text:
                raise Exception(f"Expected content '{expected_content}' not found in response")
          
            module.exit_json(
                changed=False,
                msg="Health check passed",
                status_code=response.status_code,
                response_time=response.elapsed.total_seconds(),
                attempt=attempt + 1
            )
          
        except Exception as e:
            last_error = str(e)
            if attempt < retries - 1:
                time.sleep(delay)
            continue
  
    module.fail_json(msg=f"Health check failed after {retries} attempts: {last_error}")

if __name__ == '__main__':
    main()

在playbook中使用：

- name: Check application health
  health_check:
    url: "http://{{ ansible_default_ipv4.address }}:8080/health"
    timeout: 10
    retries: 5
    delay: 3
    expected_content: "healthy"
  register: health_result

- name: Show health check result
  debug:
    var: health_result

性能优化技巧

在管理大量服务器时，性能优化很重要：

# ansible.cfg 性能优化配置
[defaults]
# 增加并发数
forks = 100

# 启用SSH连接复用
[ssh_connection]
ssh_args = -C -o ControlMaster=auto -o ControlPersist=3600s
pipelining = True

# 使用更快的事实收集
gather_subset = !all,!any,network,hardware,virtual

# 禁用不必要的插件
callback_whitelist = timer, profile_tasks

# 使用更快的JSON库
[inventory]
enable_plugins = host_list, script, yaml, ini, auto

在playbook中也可以做一些优化：

- name: Optimized playbook
  hosts: webservers
  gather_facts: no  # 如果不需要系统信息，禁用事实收集
  strategy: free    # 使用free策略，服务器可以独立执行

  tasks:
    - name: Gather minimal facts
      setup:
        gather_subset:
          - '!all'
          - '!any'
          - network
      when: need_network_info | default(false)
  
    - name: Batch operations
      package:
        name: "{{ packages }}"
        state: present
      # 批量操作比循环更高效
    
    - name: Use async for long running tasks
      shell: /opt/scripts/long_running_task.sh
      async: 3600  # 最大运行时间
      poll: 0      # 不等待结果
      register: long_task
    
    - name: Check async task status
      async_status:
        jid: "{{ long_task.ansible_job_id }}"
      register: task_result
      until: task_result.finished
      retries: 60
      delay: 10

滚动部署和蓝绿部署

在生产环境中，零停机部署是很重要的。我来分享一个滚动部署的例子：

---
- name: Rolling deployment
  hosts: webservers
  serial: "25%"  # 每次部署25%的服务器
  max_fail_percentage: 10  # 允许10%的服务器失败

  pre_tasks:
    - name: Remove server from load balancer
      uri:
        url: "http://{{ loadbalancer_host }}/api/servers/{{ inventory_hostname }}/disable"
        method: POST
        headers:
          Authorization: "Bearer {{ lb_api_token }}"
      delegate_to: localhost
    
    - name: Wait for connections to drain
      wait_for:
        port: 80
        host: "{{ ansible_default_ipv4.address }}"
        state: drained
        timeout: 300

  tasks:
    - name: Deploy new version
      include_role:
        name: webapp
      vars:
        app_version: "{{ new_version }}"
  
    - name: Verify deployment
      uri:
        url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
        method: GET
        status_code: 200
      retries: 10
      delay: 5

  post_tasks:
    - name: Add server back to load balancer
      uri:
        url: "http://{{ loadbalancer_host }}/api/servers/{{ inventory_hostname }}/enable"
        method: POST
        headers:
          Authorization: "Bearer {{ lb_api_token }}"
      delegate_to: localhost
    
    - name: Wait for server to be healthy in LB
      uri:
        url: "http://{{ loadbalancer_host }}/api/servers/{{ inventory_hostname }}/status"
        method: GET
      register: lb_status
      until: lb_status.json.status == "healthy"
      retries: 20
      delay: 10
      delegate_to: localhost

蓝绿部署的实现：

---
- name: Blue-Green Deployment
  hosts: localhost
  vars:
    current_env: "{{ 'blue' if active_environment == 'green' else 'green' }}"
    target_env: "{{ 'green' if active_environment == 'blue' else 'blue' }}"
  
  tasks:
    - name: Deploy to inactive environment
      include: deploy-to-environment.yml
      vars:
        environment: "{{ target_env }}"
        version: "{{ new_version }}"
  
    - name: Run smoke tests on target environment
      uri:
        url: "http://{{ target_env }}.internal.example.com/health"
        method: GET
        status_code: 200
      retries: 10
      delay: 30
  
    - name: Switch load balancer to new environment
      template:
        src: nginx-upstream.conf.j2
        dest: /etc/nginx/conf.d/upstream.conf
      vars:
        active_pool: "{{ target_env }}"
      delegate_to: "{{ item }}"
      loop: "{{ groups['loadbalancers'] }}"
      notify: reload nginx
  
    - name: Update active environment marker
      lineinfile:
        path: /etc/ansible/facts.d/deployment.fact
        regexp: '^active_environment='
        line: "active_environment={{ target_env }}"
      delegate_to: "{{ item }}"
      loop: "{{ groups['all'] }}"
  
    - name: Keep old environment for rollback
      debug:
        msg: "Deployment complete. Old environment {{ current_env }} kept for rollback if needed."

监控和告警集成

我经常在部署过程中集成监控和告警：

- name: Deploy with monitoring
  hosts: webservers
  tasks:
    - name: Disable monitoring alerts during deployment
      uri:
        url: "{{ prometheus_alertmanager_url }}/api/v1/silences"
        method: POST
        body_format: json
        body:
          matchers:
            - name: instance
              value: "{{ ansible_default_ipv4.address }}:9100"
          startsAt: "{{ ansible_date_time.iso8601 }}"
          endsAt: "{{ (ansible_date_time.epoch | int + 1800) | strftime('%Y-%m-%dT%H:%M:%S.000Z') }}"
          createdBy: "ansible-deployment"
          comment: "Deployment in progress"
      delegate_to: localhost
      register: silence_id
  
    - name: Deploy application
      include_role:
        name: webapp
    
    - name: Send deployment notification to Slack
      uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: "🚀 Deployment completed on {{ inventory_hostname }}"
          attachments:
            - color: "good"
              fields:
                - title: "Version"
                  value: "{{ app_version }}"
                  short: true
                - title: "Environment"
                  value: "{{ env }}"
                  short: true
                - title: "Server"
                  value: "{{ inventory_hostname }}"
                  short: true
      delegate_to: localhost
      when: notify_slack | default(true)
  
    - name: Update deployment tracking
      uri:
        url: "{{ deployment_api_url }}/deployments"
        method: POST
        body_format: json
        headers:
          Authorization: "Bearer {{ deployment_api_token }}"
        body:
          application: "{{ app_name }}"
          version: "{{ app_version }}"
          environment: "{{ env }}"
          server: "{{ inventory_hostname }}"
          timestamp: "{{ ansible_date_time.iso8601 }}"
          status: "success"
      delegate_to: localhost

错误处理和回滚机制

完善的错误处理对生产环境至关重要：

---
- name: Deployment with automatic rollback
  hosts: webservers
  vars:
    rollback_on_failure: true
    health_check_retries: 5
  
  tasks:
    - name: Get current version for rollback
      shell: readlink {{ app_home }}/current | xargs basename
      register: current_version
      changed_when: false
      failed_when: false
  
    - name: Deployment block
      block:
        - name: Deploy new version
          include_role:
            name: webapp
          vars:
            app_version: "{{ new_version }}"
      
        - name: Health check after deployment
          uri:
            url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
            method: GET
            status_code: 200
          retries: "{{ health_check_retries }}"
          delay: 10
          register: health_check_result
      
        - name: Performance test
          shell: |
            curl -w "@curl-format.txt" -o /dev/null -s "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/"
          register: perf_test
          failed_when: perf_test.stdout | regex_search('time_total:\s+([0-9.]+)') | regex_replace('time_total:\s+', '') | float > 2.0
      
      rescue:
        - name: Log deployment failure
          debug:
            msg: "Deployment failed: {{ ansible_failed_result.msg | default('Unknown error') }}"
      
        - name: Rollback to previous version
          block:
            - name: Stop failed service
              systemd:
                name: "{{ app_name }}"
                state: stopped
              ignore_errors: yes
          
            - name: Restore previous version symlink
              file:
                src: "{{ app_home }}/releases/{{ current_version.stdout }}"
                dest: "{{ app_home }}/current"
                state: link
                force: yes
              when: 
                - rollback_on_failure
                - current_version.stdout != ""
          
            - name: Start service with previous version
              systemd:
                name: "{{ app_name }}"
                state: started
              when: rollback_on_failure
          
            - name: Verify rollback success
              uri:
                url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
                method: GET
                status_code: 200
              retries: 3
              delay: 5
              when: rollback_on_failure
          
            - name: Send rollback notification
              mail:
                to: "{{ ops_email }}"
                subject: "URGENT: Deployment failed and rolled back on {{ inventory_hostname }}"
                body: |
                  Deployment of {{ app_name }} version {{ new_version }} failed on {{ inventory_hostname }}.
                
                  Error: {{ ansible_failed_result.msg | default('Unknown error') }}
                
                  System has been rolled back to version {{ current_version.stdout }}.
                
                  Please investigate immediately.
              delegate_to: localhost
              when: rollback_on_failure
        
          rescue:
            - name: Rollback also failed
              fail:
                msg: "Both deployment and rollback failed. Manual intervention required!"
      
        - name: Fail the play
          fail:
            msg: "Deployment failed and rollback completed"
          when: rollback_on_failure
    
      always:
        - name: Clean up temporary files
          file:
            path: "{{ item }}"
            state: absent
          loop:
            - "/tmp/{{ app_name }}-{{ new_version }}.tar.gz"
            - "/tmp/deployment-{{ ansible_date_time.epoch }}.log"
          ignore_errors: yes

多环境管理

在实际工作中，通常需要管理多个环境。我的目录结构是这样的：

project/
├── inventories/
│   ├── development/
│   │   ├── hosts
│   │   └── group_vars/
│   ├── staging/
│   │   ├── hosts
│   │   └── group_vars/
│   └── production/
│       ├── hosts
│       └── group_vars/
├── playbooks/
├── roles/
└── scripts/

每个环境有不同的配置：

inventories/production/group_vars/all.yml：

# 生产环境配置
env: production
domain: example.com

# 数据库配置
mysql_config:
  innodb_buffer_pool_size: "4G"
  max_connections: 500
  query_cache_size: "256M"

# 应用配置
app_workers: 8
app_timeout: 30
max_upload_size: "50M"

# 缓存配置
redis_maxmemory: "2gb"
redis_maxmemory_policy: "allkeys-lru"

# 监控配置
monitoring_enabled: true
log_level: "warning"

# 安全配置
firewall_enabled: true
ssl_required: true

inventories/development/group_vars/all.yml：

# 开发环境配置
env: development
domain: dev.example.com

# 数据库配置
mysql_config:
  innodb_buffer_pool_size: "512M"
  max_connections: 100
  query_cache_size: "64M"

# 应用配置
app_workers: 2
app_timeout: 60
max_upload_size: "10M"

# 缓存配置
redis_maxmemory: "256mb"
redis_maxmemory_policy: "noeviction"

# 监控配置
monitoring_enabled: false
log_level: "debug"

# 安全配置
firewall_enabled: false
ssl_required: false

常见问题和解决方案

在生产环境使用Ansible这么多年，我遇到过各种各样的问题，这里分享一些常见的：

SSH连接问题

# 解决SSH连接超时
- name: Configure SSH settings
  blockinfile:
    path: ~/.ssh/config
    create: yes
    block: |
      Host *
          ServerAliveInterval 60
          ServerAliveCountMax 3
          TCPKeepAlive yes
          ControlMaster auto
          ControlPath ~/.ssh/control-%r@%h:%p
          ControlPersist 3600

权限问题

# 处理sudo权限问题
- name: Tasks requiring different privileges
  block:
    - name: Install system packages
      package:
        name: nginx
        state: present
      become: yes
      become_user: root
  
    - name: Configure application
      template:
        src: app.conf.j2
        dest: /opt/app/app.conf
      become: yes
      become_user: app

大文件传输问题

# 对于大文件，使用分块传输
- name: Download large file
  get_url:
    url: "{{ large_file_url }}"
    dest: "/tmp/large_file.tar.gz"
    timeout: 1800
    force: yes
  register: download_result
  retries: 3
  delay: 60
  until: download_result is succeeded

# 或者使用rsync
- name: Sync large directory
  synchronize:
    src: /local/large_directory/
    dest: /remote/large_directory/
    delete: yes
    compress: yes
    recursive: yes

内存不足问题

# 在内存较小的服务器上分批处理
- name: Process large dataset in batches
  shell: |
    for i in {1..10}; do
      python process_batch.py --batch $i
      sleep 5
    done
  when: ansible_memtotal_mb < 4096

监控和日志管理

在生产环境中，监控Ansible的执行情况很重要：

启用详细日志

# ansible.cfg
[defaults]
log_path = /var/log/ansible/ansible.log
display_skipped_hosts = false
display_ok_hosts = true

# 使用callback插件
callback_whitelist = timer, profile_tasks, log_plays

[callback_profile_tasks]
task_output_limit = 100

自定义日志格式

# callback_plugins/custom_logger.py
from ansible.plugins.callback import CallbackBase
import json
import time

class CallbackModule(CallbackBase):
    CALLBACK_VERSION = 2.0
    CALLBACK_TYPE = 'notification'
    CALLBACK_NAME = 'custom_logger'
  
    def __init__(self):
        super(CallbackModule, self).__init__()
        self.start_time = time.time()
  
    def v2_playbook_on_start(self, playbook):
        self._display.display(f"Playbook started: {playbook._file_name}")
      
    def v2_runner_on_ok(self, result):
        host = result._host.get_name()
        task = result._task.get_name()
      
        log_entry = {
            'timestamp': time.time(),
            'host': host,
            'task': task,
            'status': 'ok',
            'changed': result._result.get('changed', False)
        }
      
        with open('/var/log/ansible/task_results.json', 'a') as f:
            f.write(json.dumps(log_entry) + '\n')
  
    def v2_runner_on_failed(self, result, ignore_errors=False):
        host = result._host.get_name()
        task = result._task.get_name()
      
        log_entry = {
            'timestamp': time.time(),
            'host': host,
            'task': task,
            'status': 'failed',
            'error': result._result.get('msg', 'Unknown error')
        }
      
        with open('/var/log/ansible/task_results.json', 'a') as f:
            f.write(json.dumps(log_entry) + '\n')

集成监控系统

- name: Send metrics to monitoring system
  uri:
    url: "{{ metrics_endpoint }}/api/v1/metrics"
    method: POST
    body_format: json
    body:
      metric_name: "ansible.deployment.duration"
      value: "{{ deployment_duration }}"
      tags:
        environment: "{{ env }}"
        application: "{{ app_name }}"
        version: "{{ app_version }}"
      timestamp: "{{ ansible_date_time.epoch }}"
  delegate_to: localhost
  when: send_metrics | default(true)

总结

经过这么多年的使用，我觉得Ansible真的是运维工程师必须掌握的工具。它不仅能提高工作效率，更重要的是让运维工作变得更加规范和可重复。

从最初的手工操作，到写shell脚本批量处理，再到使用Ansible进行自动化管理，这个过程让我深刻体会到了工具的重要性。特别是在管理几百台服务器的时候，Ansible的价值就更加明显了。

但是要记住，工具只是手段，不是目的。真正重要的是要理解你的业务需求，设计合理的架构，然后用Ansible来实现和维护。我见过一些同事，过度依赖自动化，结果出问题的时候反而不知道怎么手工处理，这是不对的。

另外，在生产环境中使用Ansible，一定要注意以下几点：

测试，测试，再测试：任何playbook都要先在测试环境充分验证
备份很重要：部署前一定要备份数据和配置
监控和告警：要能及时发现问题
文档要完善：让团队其他成员也能理解和维护
安全第一：敏感信息要加密，权限要控制好

我现在的团队已经把所有的运维操作都Ansible化了，从服务器初始化到应用部署，从配置管理到故障处理，基本上都有对应的playbook。这样不仅提高了效率，也降低了出错的概率，新同事上手也更容易。

最后想说的是，学习Ansible不是一蹴而就的，需要在实际项目中不断练习和总结。我建议大家从简单的任务开始，比如批量安装软件包、复制配置文件等，然后逐步学习更复杂的功能。

如果你觉得这篇文章对你有帮助，欢迎点赞转发，让更多的运维同行看到。有什么问题也可以在评论区讨论，我会尽力回答。运维这条路很长，但是有了好的工具和方法，会走得更轻松一些。

关注@运维躬行录，我会持续分享更多实用的运维技术和经验，让我们一起在运维的道路上不断学习，不断进步！记住，最好的运维就是让系统稳定运行，让开发同学专注于业务开发，让用户感受不到我们的存在。这就是我们运维工程师的价值所在。

从零开始玩转Ansible：让运维自动化不再是梦想