从零开始玩转Ansible:让运维自动化不再是梦想
其实我刚接触Ansible的时候也是一脸懵逼,什么playbook、inventory、module...这些概念听起来就头大。但是用了一段时间后,我发现这玩意儿真的是运维人员的福音。今天就来跟大家聊聊Ansible这个神器,保证让你看完就能上手,而且我会把这几年踩过的坑和积累的生产经验都分享给大家。
什么是Ansible?为什么选择它?
说白了,Ansible就是一个自动化工具。你可以把它理解为一个"遥控器",通过这个遥控器,你可以同时控制成百上千台服务器做同样的事情。
我记得刚入行那会儿,公司有个项目需要在200多台服务器上部署应用。当时我傻乎乎地写了个shell脚本,然后用for循环ssh到每台机器上执行。结果执行到一半,网络断了,我都不知道哪些机器执行成功了,哪些失败了。后来同事介绍我用Ansible,那种感觉就像是从石器时代进入了现代社会。
Ansible有几个特点让我特别喜欢:
无需安装客户端:这点真的很爽,只要目标服务器能SSH连接就行。我之前用过Puppet和Chef,都需要在每台机器上装agent,维护起来特别麻烦。有一次我们的Puppet master挂了,所有节点都连不上,整个配置管理系统瘫痪。但是Ansible不会有这个问题,它是推送模式,控制节点挂了不影响已经配置好的服务器运行。
基于SSH:利用现有的SSH连接,安全性有保障。而且SSH本身就有很多安全机制,比如密钥认证、跳板机等,Ansible都能很好地支持。
幂等性:这个词听起来很高大上,其实就是说你执行多少次结果都一样。比如你用Ansible创建一个用户,如果用户已经存在了,它不会重复创建。这个特性在生产环境中特别重要,因为你可能需要多次执行同一个playbook来确保配置正确。
YAML语法:配置文件用的是YAML格式,比JSON或者XML好读多了。我之前写过XML格式的配置文件,那个缩进和嵌套看得眼花缭乱。
强大的模块库:Ansible内置了几千个模块,覆盖了几乎所有的运维场景。从基础的文件操作到复杂的云资源管理,基本上你能想到的操作都有对应的模块。
安装和环境配置
安装Ansible其实很简单,但是在生产环境中,我建议你做一些额外的配置。
基础安装
我一般都是用pip安装,这样可以安装最新版本:
# 先安装pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
# 安装ansible
pip3 install ansible
# 如果需要特定版本
pip3 install ansible==6.7.0如果你的环境比较复杂,我建议用虚拟环境:
python3 -m venv ansible-env
source ansible-env/bin/activate
pip install ansible配置文件优化
Ansible的配置文件是ansible.cfg,它会按照以下顺序查找:
- 当前目录下的ansible.cfg
- ~/.ansible.cfg
- /etc/ansible/ansible.cfg
我一般在项目根目录下放一个ansible.cfg,这样每个项目都有自己的配置:
[defaults]
# 指定inventory文件位置
inventory = inventories/production/hosts
# 指定私钥文件
private_key_file = ~/.ssh/ansible_key
# 设置远程用户
remote_user = ansible
# 禁用host key检查(生产环境谨慎使用)
host_key_checking = False
# 设置并发数,根据你的网络情况调整
forks = 50
# 设置超时时间
timeout = 30
# 开启pipelining,提高性能
pipelining = True
# 日志文件
log_path = /var/log/ansible.log
# 设置重试文件位置
retry_files_enabled = True
retry_files_save_path = ~/.ansible-retry
[ssh_connection]
# SSH连接复用,提高性能
ssh_args = -C -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes
# 使用SCP而不是SFTP传输文件
scp_if_ssh = True这些配置都是我在生产环境中总结出来的,特别是forks参数,默认是5,但是在管理大量服务器时明显不够用。我一般设置成50或者更高,但要注意不要设置得太高,否则可能会导致SSH连接数过多。
SSH密钥配置
在生产环境中,我强烈建议使用SSH密钥认证,而不是密码认证。首先生成密钥:
ssh-keygen -t rsa -b 4096 -C "ansible@company.com" -f ~/.ssh/ansible_key然后把公钥分发到所有目标服务器:
# 手动分发
ssh-copy-id -i ~/.ssh/ansible_key.pub user@target_server
# 或者批量分发脚本
for host in $(cat server_list.txt); do
ssh-copy-id -i ~/.ssh/ansible_key.pub user@$host
done为了安全起见,我建议为Ansible专门创建一个用户,而不是直接使用root:
# 在目标服务器上创建ansible用户
useradd -m -s /bin/bash ansible
echo "ansible ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/ansible深入理解Inventory
Inventory是Ansible的核心概念之一,它定义了你要管理的服务器。在生产环境中,inventory的设计直接影响到你的运维效率。
静态Inventory
最简单的inventory就是一个文本文件,但是在实际项目中,我会按照环境和功能来组织:
# inventories/production/hosts
# Web服务器组
[webservers]
web01 ansible_host=10.0.1.10 ansible_port=22
web02 ansible_host=10.0.1.11 ansible_port=22
web03 ansible_host=10.0.1.12 ansible_port=22
# 数据库服务器组
[databases]
db01 ansible_host=10.0.2.10 mysql_role=master
db02 ansible_host=10.0.2.11 mysql_role=slave
db03 ansible_host=10.0.2.12 mysql_role=slave
# 负载均衡器
[loadbalancers]
lb01 ansible_host=10.0.3.10
lb02 ansible_host=10.0.3.11
# 缓存服务器
[cache]
redis01 ansible_host=10.0.4.10 redis_port=6379
redis02 ansible_host=10.0.4.11 redis_port=6379
# 定义组的组
[frontend:children]
webservers
loadbalancers
[backend:children]
databases
cache
# 全局变量
[all:vars]
ansible_user=ansible
ansible_ssh_private_key_file=~/.ssh/ansible_key
ansible_python_interpreter=/usr/bin/python3
# 环境特定变量
[production:children]
frontend
backend
[production:vars]
env=production
domain=example.com我还会创建对应的group_vars和host_vars目录:
inventories/production/
├── hosts
├── group_vars/
│ ├── all.yml
│ ├── webservers.yml
│ ├── databases.yml
│ └── production.yml
└── host_vars/
├── web01.yml
└── db01.ymlgroup_vars/webservers.yml:
# Web服务器特定变量
nginx_version: 1.20.2
php_version: 8.0
document_root: /var/www/html
max_connections: 1024
# 防火墙规则
firewall_rules:
- port: 80
protocol: tcp
source: 0.0.0.0/0
- port: 443
protocol: tcp
source: 0.0.0.0/0group_vars/databases.yml:
# MySQL配置
mysql_version: 8.0
mysql_root_password: "{{ vault_mysql_root_password }}"
mysql_databases:
- name: webapp
encoding: utf8mb4
collation: utf8mb4_unicode_ci
mysql_users:
- name: webapp_user
password: "{{ vault_webapp_db_password }}"
priv: "webapp.*:ALL"
host: "10.0.1.%"
# MySQL配置参数
mysql_config:
innodb_buffer_pool_size: "1G"
max_connections: 200
query_cache_size: "128M"动态Inventory
在云环境中,服务器的IP地址可能经常变化,这时候动态inventory就很有用了。我写过一个从阿里云ECS获取服务器列表的脚本:
#!/usr/bin/env python3
# dynamic_inventory.py
import json
import sys
from aliyunsdkcore.client import AcsClient
from aliyunsdkecs.request.v20140526 import DescribeInstancesRequest
class AlicloudInventory:
def __init__(self):
self.client = AcsClient(
access_key_id='your_access_key',
access_key_secret='your_secret_key',
region_id='cn-hangzhou'
)
self.inventory = {
'_meta': {
'hostvars': {}
}
}
def get_instances(self):
request = DescribeInstancesRequest.DescribeInstancesRequest()
response = self.client.do_action_with_exception(request)
return json.loads(response)
def build_inventory(self):
instances = self.get_instances()
for instance in instances['Instances']['Instance']:
hostname = instance['InstanceName']
private_ip = instance['NetworkInterfaces']['NetworkInterface'][0]['PrimaryIpAddress']
# 根据标签分组
tags = {tag['TagKey']: tag['TagValue'] for tag in instance.get('Tags', {}).get('Tag', [])}
# 添加到对应组
if 'Role' in tags:
role = tags['Role']
if role not in self.inventory:
self.inventory[role] = {'hosts': []}
self.inventory[role]['hosts'].append(hostname)
# 添加主机变量
self.inventory['_meta']['hostvars'][hostname] = {
'ansible_host': private_ip,
'instance_id': instance['InstanceId'],
'instance_type': instance['InstanceType'],
'tags': tags
}
def run(self):
self.build_inventory()
print(json.dumps(self.inventory, indent=2))
if __name__ == '__main__':
inventory = AlicloudInventory()
inventory.run()使用动态inventory:
ansible all -i dynamic_inventory.py -m ping核心模块深度解析
Ansible有几千个模块,但是在日常工作中,你可能只会用到几十个。我来详细介绍一些最重要的模块,以及在生产环境中的使用技巧。
文件操作模块
copy模块是最基础的,我们先从命令行开始:
# 基本文件复制
ansible webservers -m copy -a "src=/etc/hosts dest=/tmp/hosts"
# 复制文件并设置权限
ansible webservers -m copy -a "src=nginx.conf dest=/etc/nginx/nginx.conf owner=root group=root mode=0644"
# 复制文件并备份原文件
ansible webservers -m copy -a "src=nginx.conf dest=/etc/nginx/nginx.conf backup=yes"
# 直接写入内容到文件
ansible webservers -m copy -a "content='Hello World' dest=/tmp/hello.txt"
# 复制文件并验证(需要验证命令)
ansible webservers -m copy -a "src=nginx.conf dest=/etc/nginx/nginx.conf validate='nginx -t -c %s'"在生产环境中,我经常遇到需要复制大量配置文件的情况。有一次我们要给200台Web服务器更新SSL证书,用copy模块一条命令就搞定了:
# 批量更新SSL证书
ansible webservers -m copy -a "src=/etc/ssl/certs/new-cert.pem dest=/etc/ssl/certs/server.crt backup=yes" --become
# 同时更新私钥文件
ansible webservers -m copy -a "src=/etc/ssl/private/new-key.pem dest=/etc/ssl/private/server.key mode=0600 backup=yes" --becomefile模块用于文件和目录操作,这个模块我用得特别多:
# 创建目录
ansible webservers -m file -a "path=/opt/app state=directory owner=app group=app mode=0755"
# 创建多级目录
ansible webservers -m file -a "path=/opt/app/logs/nginx state=directory mode=0755 recurse=yes"
# 创建软链接
ansible webservers -m file -a "src=/opt/app/current dest=/opt/app/releases/v1.2.3 state=link"
# 删除文件或目录
ansible webservers -m file -a "path=/tmp/old_files state=absent"
# 修改文件权限
ansible webservers -m file -a "path=/etc/app/config.ini owner=app group=app mode=0600"
# 创建空文件
ansible webservers -m file -a "path=/var/log/app.log state=touch owner=app group=app"我记得有一次系统升级后,发现所有服务器的日志目录权限都不对,应用写不了日志。用file模块批量修复:
# 递归修改目录权限
ansible webservers -m file -a "path=/var/log/app state=directory owner=app group=app mode=0755 recurse=yes" --become包管理模块
在生产环境中,包管理是一个很重要的话题。不同的操作系统有不同的包管理器,但好在Ansible的package模块可以自动识别:
# 安装单个软件包
ansible webservers -m package -a "name=nginx state=present" --become
# 安装多个软件包
ansible webservers -m package -a "name=nginx,mysql-server,git state=present" --become
# 安装特定版本的软件包
ansible centos_servers -m yum -a "name=nginx-1.20.2 state=present" --become
# 卸载软件包
ansible webservers -m package -a "name=apache2 state=absent" --become
# 更新所有软件包
ansible webservers -m package -a "name='*' state=latest" --become
# 更新包缓存
ansible ubuntu_servers -m apt -a "update_cache=yes cache_valid_time=3600" --become我之前负责一个项目,需要在100多台服务器上安装Docker。当时我是这样操作的:
# 先安装依赖包
ansible centos_servers -m yum -a "name=yum-utils,device-mapper-persistent-data,lvm2 state=present" --become
# 添加Docker仓库
ansible centos_servers -m yum_repository -a "name=docker-ce description='Docker CE Repository' baseurl=https://download.docker.com/linux/centos/7/x86_64/stable gpgcheck=yes gpgkey=https://download.docker.com/linux/centos/gpg enabled=yes" --become
# 安装Docker CE
ansible centos_servers -m yum -a "name=docker-ce,docker-ce-cli,containerd.io state=present" --become有时候需要从本地RPM包安装软件:
# 先复制RPM包到目标服务器
ansible webservers -m copy -a "src=/tmp/custom-app-1.0.0.rpm dest=/tmp/"
# 安装本地RPM包
ansible webservers -m yum -a "name=/tmp/custom-app-1.0.0.rpm state=present" --become服务管理模块
service模块在不同系统上的行为可能不同,我更推荐使用systemd模块:
# 启动服务
ansible webservers -m systemd -a "name=nginx state=started" --become
# 停止服务
ansible webservers -m systemd -a "name=nginx state=stopped" --become
# 重启服务
ansible webservers -m systemd -a "name=nginx state=restarted" --become
# 重新加载服务配置
ansible webservers -m systemd -a "name=nginx state=reloaded" --become
# 启用服务开机自启
ansible webservers -m systemd -a "name=nginx enabled=yes" --become
# 禁用服务开机自启
ansible webservers -m systemd -a "name=nginx enabled=no" --become
# 重新加载systemd配置
ansible webservers -m systemd -a "daemon_reload=yes" --become
# 检查服务状态
ansible webservers -m systemd -a "name=nginx" --become我在生产环境中经常需要批量重启服务。比如更新了nginx配置后:
# 先验证配置文件语法
ansible webservers -m shell -a "nginx -t" --become
# 如果语法正确,重新加载配置
ansible webservers -m systemd -a "name=nginx state=reloaded" --become有时候服务启动后需要等待一段时间才能正常工作,可以结合wait_for模块:
# 启动服务
ansible webservers -m systemd -a "name=webapp state=started" --become
# 等待服务端口可用
ansible webservers -m wait_for -a "port=8080 delay=5 timeout=30"
# 检查服务是否正常响应
ansible webservers -m uri -a "url=http://localhost:8080/health method=GET status_code=200"用户管理模块
在生产环境中,用户管理是一个很重要的安全话题:
# 创建用户
ansible webservers -m user -a "name=webapp shell=/bin/bash home=/opt/webapp create_home=yes" --become
# 创建系统用户
ansible webservers -m user -a "name=nginx system=yes shell=/sbin/nologin home=/var/lib/nginx create_home=no" --become
# 设置用户密码(密码需要先加密)
ansible webservers -m user -a "name=webapp password='$6$rounds=656000$salt$hash'" --become
# 添加用户到组
ansible webservers -m user -a "name=webapp groups=docker,sudo append=yes" --become
# 删除用户
ansible webservers -m user -a "name=olduser state=absent remove=yes" --become
# 修改用户shell
ansible webservers -m user -a "name=webapp shell=/bin/zsh" --become创建应用专用用户是我在生产环境中的标准操作:
# 创建应用用户
ansible webservers -m user -a "name=webapp uid=1001 group=webapp shell=/bin/bash home=/opt/webapp create_home=yes system=no" --become
# 创建对应的组
ansible webservers -m group -a "name=webapp gid=1001 system=no" --becomeSSH密钥管理
# 添加SSH公钥到用户
ansible webservers -m authorized_key -a "user=webapp key='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAB...'" --become
# 从文件读取公钥
ansible webservers -m authorized_key -a "user=webapp key='{{ lookup('file', '~/.ssh/id_rsa.pub') }}'" --become
# 删除SSH密钥
ansible webservers -m authorized_key -a "user=webapp key='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAB...' state=absent" --become命令执行模块
shell和command模块的区别很重要,我来详细说说:
# command模块不支持shell特性,更安全
ansible webservers -m command -a "df -h /"
# 获取系统负载
ansible webservers -m command -a "uptime"
# 查看内存使用情况
ansible webservers -m command -a "free -m"
# shell模块支持管道、重定向等shell特性
ansible webservers -m shell -a "ps aux | grep nginx | grep -v grep | wc -l"
# 使用shell执行复杂命令
ansible webservers -m shell -a "cd /opt/app && ./deploy.sh"
# 设置环境变量执行命令
ansible webservers -m shell -a "APP_ENV=production /opt/app/start.sh"在生产环境中,我经常需要执行一些复杂的运维脚本:
# 检查磁盘使用情况
ansible webservers -m shell -a "df -h | grep -v tmpfs | awk '{print \$5 \" \" \$6}' | grep -v Use"
# 查找大文件
ansible webservers -m shell -a "find /var/log -type f -size +100M -exec ls -lh {} \;"
# 清理日志文件
ansible webservers -m shell -a "find /var/log -name '*.log' -mtime +7 -delete"
# 获取服务器信息
ansible webservers -m shell -a "echo 'CPU:' && nproc && echo 'Memory:' && free -h | grep Mem && echo 'Disk:' && df -h /"网络和HTTP模块
uri模块对于API调用和健康检查非常有用:
# 简单的HTTP GET请求
ansible webservers -m uri -a "url=http://localhost/health method=GET"
# 检查API响应状态
ansible webservers -m uri -a "url=http://localhost:8080/api/status method=GET status_code=200"
# POST请求发送数据
ansible webservers -m uri -a "url=http://localhost:8080/api/restart method=POST body='{}' body_format=json"
# 下载文件
ansible webservers -m get_url -a "url=https://releases.example.com/app-1.2.3.tar.gz dest=/tmp/"
# 带认证的HTTP请求
ansible webservers -m uri -a "url=https://api.example.com/status method=GET headers='Authorization=Bearer token123'"我经常用uri模块做健康检查:
# 检查Web服务是否正常
ansible webservers -m uri -a "url=http://{{ ansible_default_ipv4.address }}/health method=GET status_code=200 timeout=10"
# 检查API接口
ansible webservers -m uri -a "url=http://localhost:8080/api/ping method=GET return_content=yes" --one-line数据库模块
对于MySQL数据库管理:
# 创建数据库
ansible db_servers -m mysql_db -a "name=webapp_db state=present login_user=root login_password=password" --become
# 删除数据库
ansible db_servers -m mysql_db -a "name=old_db state=absent login_user=root login_password=password" --become
# 创建数据库用户
ansible db_servers -m mysql_user -a "name=webapp_user password=secret_password priv='webapp_db.*:ALL' host='%' login_user=root login_password=password" --become
# 删除数据库用户
ansible db_servers -m mysql_user -a "name=old_user state=absent login_user=root login_password=password" --become文本处理模块
lineinfile模块对于配置文件修改特别有用:
# 修改配置文件中的行
ansible webservers -m lineinfile -a "path=/etc/nginx/nginx.conf regexp='^worker_processes' line='worker_processes auto;'" --become
# 在文件末尾添加行
ansible webservers -m lineinfile -a "path=/etc/hosts line='192.168.1.100 app.example.com'" --become
# 删除匹配的行
ansible webservers -m lineinfile -a "path=/etc/hosts regexp='old.example.com' state=absent" --become
# 在特定位置插入行
ansible webservers -m lineinfile -a "path=/etc/ssh/sshd_config line='PermitRootLogin no' insertafter='^#PermitRootLogin'" --becomeblockinfile模块用于处理配置块:
# 添加配置块
ansible webservers -m blockinfile -a "path=/etc/nginx/nginx.conf block='upstream backend { server 192.168.1.10:8080; server 192.168.1.11:8080; }' insertbefore='server {'" --become
# 删除配置块
ansible webservers -m blockinfile -a "path=/etc/nginx/nginx.conf block='' state=absent marker='# {mark} ANSIBLE MANAGED BLOCK - backend'" --become系统信息收集
setup模块用于收集系统信息,这个在写条件判断时特别有用:
# 收集所有系统信息
ansible webservers -m setup
# 只收集网络信息
ansible webservers -m setup -a "gather_subset=network"
# 只收集硬件信息
ansible webservers -m setup -a "gather_subset=hardware"
# 过滤特定信息
ansible webservers -m setup -a "filter=ansible_default_ipv4"
# 收集自定义facts
ansible webservers -m setup -a "fact_path=/etc/ansible/facts.d"我经常用setup模块来获取服务器的基本信息:
# 获取服务器IP地址
ansible webservers -m setup -a "filter=ansible_default_ipv4" | grep address
# 获取内存信息
ansible webservers -m setup -a "filter=ansible_memtotal_mb"
# 获取CPU核心数
ansible webservers -m setup -a "filter=ansible_processor_cores"高级用法和组合技巧
在实际工作中,我经常需要组合使用多个模块。比如部署应用的时候:
# 1. 先停止服务
ansible webservers -m systemd -a "name=webapp state=stopped" --become
# 2. 备份当前版本
ansible webservers -m shell -a "cp -r /opt/webapp/current /opt/webapp/backup-$(date +%Y%m%d-%H%M%S)" --become
# 3. 下载新版本
ansible webservers -m get_url -a "url=https://releases.example.com/webapp-1.2.3.tar.gz dest=/tmp/" --become
# 4. 解压新版本
ansible webservers -m unarchive -a "src=/tmp/webapp-1.2.3.tar.gz dest=/opt/webapp/releases/ remote_src=yes owner=webapp group=webapp" --become
# 5. 更新软链接
ansible webservers -m file -a "src=/opt/webapp/releases/webapp-1.2.3 dest=/opt/webapp/current state=link force=yes" --become
# 6. 启动服务
ansible webservers -m systemd -a "name=webapp state=started" --become
# 7. 检查服务状态
ansible webservers -m uri -a "url=http://localhost:8080/health method=GET status_code=200"使用register保存输出结果
虽然在ad-hoc命令中不能直接使用register,但可以通过一些技巧获取命令输出:
# 获取命令输出并格式化显示
ansible webservers -m shell -a "df -h /" --one-line
# 检查文件是否存在
ansible webservers -m stat -a "path=/etc/nginx/nginx.conf"
# 获取服务状态
ansible webservers -m systemd -a "name=nginx" --become批量操作技巧
在生产环境中,我经常需要对不同的服务器执行不同的操作:
# 只对特定主机执行
ansible webservers -l "web01,web02" -m systemd -a "name=nginx state=restarted" --become
# 排除特定主机
ansible webservers -l "!web03" -m package -a "name=nginx state=latest" --become
# 使用模式匹配
ansible "web*" -m shell -a "uptime"
# 对不同组执行不同操作
ansible databases -m systemd -a "name=mysql state=restarted" --become
ansible webservers -m systemd -a "name=nginx state=reloaded" --become错误处理和调试
当命令执行失败时,可以使用一些参数来调试:
# 显示详细输出
ansible webservers -m shell -a "nginx -t" -vvv --become
# 忽略错误继续执行
ansible webservers -m shell -a "some_command_that_might_fail" --ignore-errors
# 检查模式(不实际执行)
ansible webservers -m copy -a "src=test.txt dest=/tmp/" --check
# 显示差异
ansible webservers -m copy -a "src=nginx.conf dest=/etc/nginx/nginx.conf" --check --diff --become这些ad-hoc命令在日常运维中非常有用,特别是需要快速执行一些简单任务的时候。不过对于复杂的操作,我还是建议写成playbook,这样更容易维护和重复使用。
编写高质量的Playbook
一个好的Playbook不仅要能完成任务,还要易读、易维护、易调试。我来分享一些编写高质量Playbook的技巧。
基本结构和命名规范
---
# 文件头注释,说明这个playbook的作用
# Deploy web application to production servers
# Author: Your Name
# Date: 2024-01-01
- name: Deploy web application # 清晰的play名称
hosts: webservers
become: yes # 是否需要sudo
gather_facts: yes # 是否收集系统信息
vars:
# 在playbook中定义的变量
app_name: webapp
app_version: "1.2.3"
vars_files:
# 引用外部变量文件
- vars/common.yml
- vars/{{ env }}.yml
pre_tasks:
# 预处理任务
- name: Check system requirements
assert:
that:
- ansible_memtotal_mb >= 2048
- ansible_architecture == "x86_64"
fail_msg: "System does not meet minimum requirements"
tasks:
# 主要任务列表
- name: Install required packages
package:
name: "{{ item }}"
state: present
loop: "{{ required_packages }}"
tags:
- packages
- install
post_tasks:
# 后处理任务
- name: Verify application is running
uri:
url: "http://{{ ansible_default_ipv4.address }}/health"
method: GET
status_code: 200
retries: 5
delay: 10
handlers:
# 处理器,响应notify
- name: restart webapp
systemd:
name: webapp
state: restarted错误处理和调试
在生产环境中,错误处理是非常重要的:
- name: Download application package
get_url:
url: "{{ app_download_url }}"
dest: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
timeout: 300
register: download_result
retries: 3
delay: 10
until: download_result is succeeded
- name: Extract application package
unarchive:
src: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
dest: /opt/app/releases/
remote_src: yes
creates: "/opt/app/releases/{{ app_version }}" # 如果目录已存在,跳过
register: extract_result
- name: Verify extraction
stat:
path: "/opt/app/releases/{{ app_version }}/app.py"
register: app_file
failed_when: not app_file.stat.exists
- name: Rollback on failure
file:
path: "/opt/app/releases/{{ app_version }}"
state: absent
when: extract_result is failed使用block进行错误处理
- name: Deploy application with rollback
block:
- name: Stop application
systemd:
name: webapp
state: stopped
- name: Update symlink
file:
src: "/opt/app/releases/{{ app_version }}"
dest: /opt/app/current
state: link
- name: Start application
systemd:
name: webapp
state: started
- name: Verify application health
uri:
url: "http://localhost:8080/health"
status_code: 200
retries: 5
delay: 10
rescue:
# 如果block中的任务失败,执行rescue
- name: Rollback to previous version
file:
src: "/opt/app/releases/{{ previous_version }}"
dest: /opt/app/current
state: link
- name: Restart application
systemd:
name: webapp
state: restarted
- name: Send failure notification
mail:
to: ops@company.com
subject: "Deployment failed on {{ inventory_hostname }}"
body: "Application deployment failed, rolled back to {{ previous_version }}"
always:
# 无论成功还是失败都会执行
- name: Clean up temporary files
file:
path: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
state: absent条件判断和循环的高级用法
# 复杂条件判断
- name: Install database server
package:
name: "{{ db_package }}"
state: present
vars:
db_package: "{% if ansible_os_family == 'RedHat' %}mysql-server{% elif ansible_os_family == 'Debian' %}mysql-server-8.0{% endif %}"
when:
- inventory_hostname in groups['databases']
- install_database | default(true)
- ansible_memtotal_mb > 2048
# 循环中的条件判断
- name: Create users
user:
name: "{{ item.name }}"
groups: "{{ item.groups | default([]) }}"
shell: "{{ item.shell | default('/bin/bash') }}"
state: "{{ item.state | default('present') }}"
loop: "{{ users }}"
when:
- item.name is defined
- item.state | default('present') == 'present'
# 字典循环
- name: Configure virtual hosts
template:
src: vhost.conf.j2
dest: "/etc/nginx/sites-available/{{ item.key }}"
loop: "{{ vhosts | dict2items }}"
notify: restart nginx
# 嵌套循环
- name: Install packages for each environment
package:
name: "{{ item.1 }}"
state: present
loop: "{{ environments | subelements('packages') }}"
vars:
environments:
- name: development
packages: ['git', 'vim', 'curl']
- name: production
packages: ['nginx', 'mysql-server']变量和模板的高级技巧
# 使用set_fact动态设置变量
- name: Determine database master
set_fact:
db_master: "{{ item }}"
loop: "{{ groups['databases'] }}"
when: hostvars[item]['mysql_role'] == 'master'
# 使用lookup插件读取外部数据
- name: Read password from file
set_fact:
db_password: "{{ lookup('file', '/etc/ansible/secrets/db_password') }}"
# 使用过滤器处理数据
- name: Show formatted information
debug:
msg: |
Server: {{ inventory_hostname }}
IP: {{ ansible_default_ipv4.address }}
Memory: {{ (ansible_memtotal_mb / 1024) | round(1) }} GB
Disk: {{ ansible_mounts | selectattr('mount', 'equalto', '/') | map(attribute='size_total') | first | filesizeformat }}
Uptime: {{ ansible_uptime_seconds | int | human_readable }}实战案例:完整的Web应用部署
让我分享一个完整的生产环境部署案例。这是我之前负责的一个电商网站的部署方案,包含了负载均衡、Web服务器、数据库、缓存等组件。
项目目录结构
webapp-deploy/
├── ansible.cfg
├── inventories/
│ ├── production/
│ │ ├── hosts
│ │ ├── group_vars/
│ │ │ ├── all.yml
│ │ │ ├── webservers.yml
│ │ │ ├── databases.yml
│ │ │ └── loadbalancers.yml
│ │ └── host_vars/
│ └── staging/
├── roles/
│ ├── common/
│ ├── nginx/
│ ├── mysql/
│ ├── redis/
│ └── webapp/
├── playbooks/
│ ├── site.yml
│ ├── deploy.yml
│ └── maintenance.yml
├── group_vars/
│ └── all.yml
└── vars/
├── secrets.yml
└── common.yml主要的site.yml文件
---
# 完整的站点部署playbook
- import_playbook: playbooks/common.yml
- import_playbook: playbooks/databases.yml
- import_playbook: playbooks/cache.yml
- import_playbook: playbooks/webservers.yml
- import_playbook: playbooks/loadbalancers.yml
- import_playbook: playbooks/monitoring.yml
# 通用配置playbook
- name: Configure all servers
hosts: all
become: yes
roles:
- common
tags:
- common
- base
# 数据库服务器配置
- name: Configure database servers
hosts: databases
become: yes
serial: 1 # 一台一台配置,避免同时停机
roles:
- mysql
tags:
- database
- mysql
# Web服务器配置
- name: Configure web servers
hosts: webservers
become: yes
roles:
- nginx
- webapp
tags:
- web
- nginx
- app
# 负载均衡器配置
- name: Configure load balancers
hosts: loadbalancers
become: yes
roles:
- nginx
tags:
- lb
- nginx通用Role(roles/common/tasks/main.yml)
---
- name: Update system packages
package:
name: "*"
state: latest
when: update_packages | default(false)
tags: packages
- name: Install essential packages
package:
name: "{{ essential_packages }}"
state: present
tags: packages
- name: Configure timezone
timezone:
name: "{{ system_timezone | default('Asia/Shanghai') }}"
notify: restart rsyslog
- name: Configure NTP
template:
src: chrony.conf.j2
dest: /etc/chrony.conf
backup: yes
notify: restart chronyd
when: ansible_os_family == "RedHat"
- name: Start and enable chronyd
systemd:
name: chronyd
state: started
enabled: yes
- name: Create common directories
file:
path: "{{ item }}"
state: directory
owner: root
group: root
mode: '0755'
loop:
- /opt/scripts
- /var/log/ansible
- /etc/ansible/facts.d
- name: Configure system limits
pam_limits:
domain: "{{ item.domain }}"
limit_type: "{{ item.type }}"
limit_item: "{{ item.item }}"
value: "{{ item.value }}"
loop:
- { domain: '*', type: 'soft', item: 'nofile', value: '65536' }
- { domain: '*', type: 'hard', item: 'nofile', value: '65536' }
- { domain: '*', type: 'soft', item: 'nproc', value: '32768' }
- { domain: '*', type: 'hard', item: 'nproc', value: '32768' }
- name: Configure kernel parameters
sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
state: present
reload: yes
loop:
- { name: 'net.core.somaxconn', value: '32768' }
- { name: 'net.ipv4.tcp_max_syn_backlog', value: '32768' }
- { name: 'net.ipv4.tcp_fin_timeout', value: '10' }
- { name: 'net.ipv4.tcp_keepalive_time', value: '1200' }
- { name: 'vm.swappiness', value: '10' }
- name: Configure SSH hardening
lineinfile:
path: /etc/ssh/sshd_config
regexp: "{{ item.regexp }}"
line: "{{ item.line }}"
backup: yes
loop:
- { regexp: '^#?PermitRootLogin', line: 'PermitRootLogin no' }
- { regexp: '^#?PasswordAuthentication', line: 'PasswordAuthentication no' }
- { regexp: '^#?MaxAuthTries', line: 'MaxAuthTries 3' }
- { regexp: '^#?ClientAliveInterval', line: 'ClientAliveInterval 300' }
- { regexp: '^#?ClientAliveCountMax', line: 'ClientAliveCountMax 2' }
notify: restart sshd
- name: Setup logrotate for application logs
template:
src: app-logrotate.j2
dest: /etc/logrotate.d/webapp
owner: root
group: root
mode: '0644'
- name: Install custom monitoring script
template:
src: system_monitor.sh.j2
dest: /opt/scripts/system_monitor.sh
owner: root
group: root
mode: '0755'
- name: Setup cron job for monitoring
cron:
name: "System monitoring"
minute: "*/5"
job: "/opt/scripts/system_monitor.sh"
user: rootMySQL Role(roles/mysql/tasks/main.yml)
---
- name: Install MySQL packages
package:
name: "{{ mysql_packages }}"
state: present
- name: Start and enable MySQL
systemd:
name: "{{ mysql_service }}"
state: started
enabled: yes
- name: Check if MySQL root password is set
shell: mysql -u root -e "SELECT 1"
register: mysql_root_check
failed_when: false
changed_when: false
no_log: true
- name: Set MySQL root password
mysql_user:
name: root
password: "{{ mysql_root_password }}"
login_unix_socket: /var/lib/mysql/mysql.sock
when: mysql_root_check.rc == 0
- name: Create MySQL configuration file
template:
src: my.cnf.j2
dest: /etc/my.cnf
backup: yes
owner: root
group: root
mode: '0644'
notify: restart mysql
- name: Create MySQL databases
mysql_db:
name: "{{ item.name }}"
encoding: "{{ item.encoding | default('utf8mb4') }}"
collation: "{{ item.collation | default('utf8mb4_unicode_ci') }}"
login_user: root
login_password: "{{ mysql_root_password }}"
state: present
loop: "{{ mysql_databases }}"
no_log: true
- name: Create MySQL users
mysql_user:
name: "{{ item.name }}"
password: "{{ item.password }}"
priv: "{{ item.priv }}"
host: "{{ item.host | default('localhost') }}"
login_user: root
login_password: "{{ mysql_root_password }}"
state: present
loop: "{{ mysql_users }}"
no_log: true
- name: Configure MySQL replication (master)
template:
src: master.cnf.j2
dest: /etc/mysql/conf.d/master.cnf
when: mysql_role == 'master'
notify: restart mysql
- name: Configure MySQL replication (slave)
template:
src: slave.cnf.j2
dest: /etc/mysql/conf.d/slave.cnf
when: mysql_role == 'slave'
notify: restart mysql
- name: Create replication user (master only)
mysql_user:
name: "{{ mysql_replication_user }}"
password: "{{ mysql_replication_password }}"
priv: "*.*:REPLICATION SLAVE"
host: "%"
login_user: root
login_password: "{{ mysql_root_password }}"
state: present
when: mysql_role == 'master'
no_log: true
- name: Setup MySQL backup script
template:
src: mysql_backup.sh.j2
dest: /opt/scripts/mysql_backup.sh
owner: root
group: root
mode: '0700'
- name: Setup MySQL backup cron job
cron:
name: "MySQL backup"
minute: "0"
hour: "2"
job: "/opt/scripts/mysql_backup.sh"
user: root
when: mysql_role == 'master'应用部署Role(roles/webapp/tasks/main.yml)
---
- name: Create application user
user:
name: "{{ app_user }}"
system: yes
shell: /bin/bash
home: "{{ app_home }}"
create_home: yes
- name: Create application directories
file:
path: "{{ item }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_user }}"
mode: '0755'
loop:
- "{{ app_home }}"
- "{{ app_home }}/releases"
- "{{ app_home }}/shared"
- "{{ app_home }}/shared/logs"
- "{{ app_home }}/shared/config"
- "{{ app_home }}/shared/uploads"
- name: Install Python and pip
package:
name:
- python3
- python3-pip
- python3-venv
- git
state: present
- name: Check if application is already deployed
stat:
path: "{{ app_home }}/current"
register: current_release
- name: Get current release version
shell: readlink {{ app_home }}/current | xargs basename
register: current_version
when: current_release.stat.exists
changed_when: false
- name: Set previous version fact
set_fact:
previous_version: "{{ current_version.stdout | default('none') }}"
- name: Create release directory
file:
path: "{{ app_home }}/releases/{{ app_version }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_user }}"
mode: '0755'
- name: Download application code
get_url:
url: "{{ app_download_url }}"
dest: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
timeout: 300
register: download_result
retries: 3
delay: 10
- name: Extract application code
unarchive:
src: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
dest: "{{ app_home }}/releases/{{ app_version }}"
remote_src: yes
owner: "{{ app_user }}"
group: "{{ app_user }}"
extra_opts: [--strip-components=1]
- name: Create virtual environment
pip:
requirements: "{{ app_home }}/releases/{{ app_version }}/requirements.txt"
virtualenv: "{{ app_home }}/releases/{{ app_version }}/venv"
virtualenv_python: python3
become_user: "{{ app_user }}"
- name: Generate application config
template:
src: "{{ item.src }}"
dest: "{{ app_home }}/shared/config/{{ item.dest }}"
owner: "{{ app_user }}"
group: "{{ app_user }}"
mode: '0600'
loop:
- { src: 'app_config.py.j2', dest: 'config.py' }
- { src: 'database.ini.j2', dest: 'database.ini' }
notify: restart webapp
- name: Create symlinks to shared resources
file:
src: "{{ app_home }}/shared/{{ item.src }}"
dest: "{{ app_home }}/releases/{{ app_version }}/{{ item.dest }}"
state: link
owner: "{{ app_user }}"
group: "{{ app_user }}"
force: yes
loop:
- { src: 'logs', dest: 'logs' }
- { src: 'config/config.py', dest: 'config.py' }
- { src: 'config/database.ini', dest: 'database.ini' }
- { src: 'uploads', dest: 'static/uploads' }
- name: Run database migrations
shell: |
source {{ app_home }}/releases/{{ app_version }}/venv/bin/activate
python manage.py migrate
args:
chdir: "{{ app_home }}/releases/{{ app_version }}"
become_user: "{{ app_user }}"
register: migration_result
when: run_migrations | default(true)
- name: Collect static files
shell: |
source {{ app_home }}/releases/{{ app_version }}/venv/bin/activate
python manage.py collectstatic --noinput
args:
chdir: "{{ app_home }}/releases/{{ app_version }}"
become_user: "{{ app_user }}"
when: collect_static | default(true)
- name: Create systemd service file
template:
src: webapp.service.j2
dest: "/etc/systemd/system/{{ app_name }}.service"
owner: root
group: root
mode: '0644'
notify:
- reload systemd
- restart webapp
- name: Update current symlink
file:
src: "{{ app_home }}/releases/{{ app_version }}"
dest: "{{ app_home }}/current"
state: link
owner: "{{ app_user }}"
group: "{{ app_user }}"
force: yes
notify: restart webapp
- name: Start and enable application service
systemd:
name: "{{ app_name }}"
state: started
enabled: yes
daemon_reload: yes
- name: Wait for application to start
wait_for:
port: "{{ app_port }}"
host: 127.0.0.1
delay: 5
timeout: 60
- name: Verify application health
uri:
url: "http://127.0.0.1:{{ app_port }}/health"
method: GET
status_code: 200
retries: 5
delay: 10
register: health_check
- name: Clean up old releases
shell: |
cd {{ app_home }}/releases
ls -t | tail -n +{{ keep_releases | default(5) + 1 }} | xargs -r rm -rf
become_user: "{{ app_user }}"
when: cleanup_releases | default(true)
- name: Clean up downloaded archive
file:
path: "/tmp/{{ app_name }}-{{ app_version }}.tar.gz"
state: absentNginx配置模板(roles/nginx/templates/webapp.conf.j2)
# {{ ansible_managed }}
# Nginx configuration for {{ app_name }}
upstream {{ app_name }}_backend {
{% for host in groups['webservers'] %}
server {{ hostvars[host]['ansible_default_ipv4']['address'] }}:{{ app_port }} weight={{ hostvars[host]['nginx_weight'] | default(1) }};
{% endfor %}
# 健康检查和故障转移
keepalive 32;
keepalive_requests 100;
keepalive_timeout 60s;
}
# HTTP重定向到HTTPS
server {
listen 80;
server_name {{ app_domain }} www.{{ app_domain }};
# Let's Encrypt验证
location /.well-known/acme-challenge/ {
root /var/www/letsencrypt;
}
location / {
return 301 https://$server_name$request_uri;
}
}
# HTTPS服务器配置
server {
listen 443 ssl http2;
server_name {{ app_domain }} www.{{ app_domain }};
# SSL证书配置
ssl_certificate {{ ssl_cert_path }};
ssl_certificate_key {{ ssl_key_path }};
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
# 安全头
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Frame-Options DENY always;
add_header X-Content-Type-Options nosniff always;
add_header X-XSS-Protection "1; mode=block" always;
# 访问日志
access_log /var/log/nginx/{{ app_name }}_access.log combined;
error_log /var/log/nginx/{{ app_name }}_error.log;
# 客户端上传限制
client_max_body_size {{ max_upload_size | default('10M') }};
client_body_timeout 60s;
client_header_timeout 60s;
# Gzip压缩
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_types text/plain text/css text/xml text/javascript application/javascript application/xml+rss application/json;
# 静态文件缓存
location /static/ {
alias {{ app_home }}/shared/uploads/;
expires 30d;
add_header Cache-Control "public, immutable";
# 安全设置
location ~* \.(php|py|pl|sh)$ {
deny all;
}
}
# 媒体文件
location /media/ {
alias {{ app_home }}/shared/uploads/;
expires 7d;
add_header Cache-Control "public";
}
# 应用程序代理
location / {
proxy_pass http://{{ app_name }}_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# 代理超时设置
proxy_connect_timeout 30s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
# 缓冲设置
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 8 4k;
proxy_busy_buffers_size 8k;
# 健康检查
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 30s;
}
# 健康检查端点
location /nginx-health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
# 禁止访问敏感文件
location ~ /\. {
deny all;
}
location ~ \.(sql|log|conf)$ {
deny all;
}
}高级功能和最佳实践
在生产环境中使用Ansible,还有很多高级功能和最佳实践需要掌握。
Ansible Vault敏感信息管理
在生产环境中,密码、密钥等敏感信息不能明文存储。Ansible Vault提供了加密功能:
# 创建加密文件
ansible-vault create vars/secrets.yml
# 编辑加密文件
ansible-vault edit vars/secrets.yml
# 加密现有文件
ansible-vault encrypt vars/passwords.yml
# 解密文件
ansible-vault decrypt vars/passwords.yml
# 查看加密文件内容
ansible-vault view vars/secrets.ymlsecrets.yml文件内容:
# 数据库密码
mysql_root_password: "SuperSecretPassword123!"
mysql_replication_password: "ReplicationPass456!"
# 应用密钥
app_secret_key: "django-secret-key-very-long-and-random"
jwt_secret: "jwt-signing-key-also-very-secret"
# API密钥
aws_access_key: "AKIAIOSFODNN7EXAMPLE"
aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
# SSL证书密码
ssl_key_password: "ssl-private-key-password"在playbook中使用:
---
- name: Deploy with secrets
hosts: webservers
vars_files:
- vars/secrets.yml
tasks:
- name: Configure database connection
template:
src: database.conf.j2
dest: /etc/app/database.conf
mode: '0600'
vars:
db_password: "{{ mysql_root_password }}"执行时需要提供密码:
ansible-playbook --ask-vault-pass deploy.yml
# 或者使用密码文件
echo "vault_password" > ~/.vault_pass
chmod 600 ~/.vault_pass
ansible-playbook --vault-password-file ~/.vault_pass deploy.yml自定义模块开发
有时候内置模块不能满足需求,你可以开发自定义模块。我写过一个检查应用健康状态的模块:
#!/usr/bin/python
# library/health_check.py
from ansible.module_utils.basic import AnsibleModule
import requests
import time
def main():
module = AnsibleModule(
argument_spec=dict(
url=dict(type='str', required=True),
timeout=dict(type='int', default=30),
retries=dict(type='int', default=3),
delay=dict(type='int', default=5),
expected_status=dict(type='int', default=200),
expected_content=dict(type='str', default=None)
),
supports_check_mode=True
)
url = module.params['url']
timeout = module.params['timeout']
retries = module.params['retries']
delay = module.params['delay']
expected_status = module.params['expected_status']
expected_content = module.params['expected_content']
if module.check_mode:
module.exit_json(changed=False, msg="Check mode - would check health")
last_error = None
for attempt in range(retries):
try:
response = requests.get(url, timeout=timeout)
if response.status_code != expected_status:
raise Exception(f"Expected status {expected_status}, got {response.status_code}")
if expected_content and expected_content not in response.text:
raise Exception(f"Expected content '{expected_content}' not found in response")
module.exit_json(
changed=False,
msg="Health check passed",
status_code=response.status_code,
response_time=response.elapsed.total_seconds(),
attempt=attempt + 1
)
except Exception as e:
last_error = str(e)
if attempt < retries - 1:
time.sleep(delay)
continue
module.fail_json(msg=f"Health check failed after {retries} attempts: {last_error}")
if __name__ == '__main__':
main()在playbook中使用:
- name: Check application health
health_check:
url: "http://{{ ansible_default_ipv4.address }}:8080/health"
timeout: 10
retries: 5
delay: 3
expected_content: "healthy"
register: health_result
- name: Show health check result
debug:
var: health_result性能优化技巧
在管理大量服务器时,性能优化很重要:
# ansible.cfg 性能优化配置
[defaults]
# 增加并发数
forks = 100
# 启用SSH连接复用
[ssh_connection]
ssh_args = -C -o ControlMaster=auto -o ControlPersist=3600s
pipelining = True
# 使用更快的事实收集
gather_subset = !all,!any,network,hardware,virtual
# 禁用不必要的插件
callback_whitelist = timer, profile_tasks
# 使用更快的JSON库
[inventory]
enable_plugins = host_list, script, yaml, ini, auto在playbook中也可以做一些优化:
- name: Optimized playbook
hosts: webservers
gather_facts: no # 如果不需要系统信息,禁用事实收集
strategy: free # 使用free策略,服务器可以独立执行
tasks:
- name: Gather minimal facts
setup:
gather_subset:
- '!all'
- '!any'
- network
when: need_network_info | default(false)
- name: Batch operations
package:
name: "{{ packages }}"
state: present
# 批量操作比循环更高效
- name: Use async for long running tasks
shell: /opt/scripts/long_running_task.sh
async: 3600 # 最大运行时间
poll: 0 # 不等待结果
register: long_task
- name: Check async task status
async_status:
jid: "{{ long_task.ansible_job_id }}"
register: task_result
until: task_result.finished
retries: 60
delay: 10滚动部署和蓝绿部署
在生产环境中,零停机部署是很重要的。我来分享一个滚动部署的例子:
---
- name: Rolling deployment
hosts: webservers
serial: "25%" # 每次部署25%的服务器
max_fail_percentage: 10 # 允许10%的服务器失败
pre_tasks:
- name: Remove server from load balancer
uri:
url: "http://{{ loadbalancer_host }}/api/servers/{{ inventory_hostname }}/disable"
method: POST
headers:
Authorization: "Bearer {{ lb_api_token }}"
delegate_to: localhost
- name: Wait for connections to drain
wait_for:
port: 80
host: "{{ ansible_default_ipv4.address }}"
state: drained
timeout: 300
tasks:
- name: Deploy new version
include_role:
name: webapp
vars:
app_version: "{{ new_version }}"
- name: Verify deployment
uri:
url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
method: GET
status_code: 200
retries: 10
delay: 5
post_tasks:
- name: Add server back to load balancer
uri:
url: "http://{{ loadbalancer_host }}/api/servers/{{ inventory_hostname }}/enable"
method: POST
headers:
Authorization: "Bearer {{ lb_api_token }}"
delegate_to: localhost
- name: Wait for server to be healthy in LB
uri:
url: "http://{{ loadbalancer_host }}/api/servers/{{ inventory_hostname }}/status"
method: GET
register: lb_status
until: lb_status.json.status == "healthy"
retries: 20
delay: 10
delegate_to: localhost蓝绿部署的实现:
---
- name: Blue-Green Deployment
hosts: localhost
vars:
current_env: "{{ 'blue' if active_environment == 'green' else 'green' }}"
target_env: "{{ 'green' if active_environment == 'blue' else 'blue' }}"
tasks:
- name: Deploy to inactive environment
include: deploy-to-environment.yml
vars:
environment: "{{ target_env }}"
version: "{{ new_version }}"
- name: Run smoke tests on target environment
uri:
url: "http://{{ target_env }}.internal.example.com/health"
method: GET
status_code: 200
retries: 10
delay: 30
- name: Switch load balancer to new environment
template:
src: nginx-upstream.conf.j2
dest: /etc/nginx/conf.d/upstream.conf
vars:
active_pool: "{{ target_env }}"
delegate_to: "{{ item }}"
loop: "{{ groups['loadbalancers'] }}"
notify: reload nginx
- name: Update active environment marker
lineinfile:
path: /etc/ansible/facts.d/deployment.fact
regexp: '^active_environment='
line: "active_environment={{ target_env }}"
delegate_to: "{{ item }}"
loop: "{{ groups['all'] }}"
- name: Keep old environment for rollback
debug:
msg: "Deployment complete. Old environment {{ current_env }} kept for rollback if needed."监控和告警集成
我经常在部署过程中集成监控和告警:
- name: Deploy with monitoring
hosts: webservers
tasks:
- name: Disable monitoring alerts during deployment
uri:
url: "{{ prometheus_alertmanager_url }}/api/v1/silences"
method: POST
body_format: json
body:
matchers:
- name: instance
value: "{{ ansible_default_ipv4.address }}:9100"
startsAt: "{{ ansible_date_time.iso8601 }}"
endsAt: "{{ (ansible_date_time.epoch | int + 1800) | strftime('%Y-%m-%dT%H:%M:%S.000Z') }}"
createdBy: "ansible-deployment"
comment: "Deployment in progress"
delegate_to: localhost
register: silence_id
- name: Deploy application
include_role:
name: webapp
- name: Send deployment notification to Slack
uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
text: "🚀 Deployment completed on {{ inventory_hostname }}"
attachments:
- color: "good"
fields:
- title: "Version"
value: "{{ app_version }}"
short: true
- title: "Environment"
value: "{{ env }}"
short: true
- title: "Server"
value: "{{ inventory_hostname }}"
short: true
delegate_to: localhost
when: notify_slack | default(true)
- name: Update deployment tracking
uri:
url: "{{ deployment_api_url }}/deployments"
method: POST
body_format: json
headers:
Authorization: "Bearer {{ deployment_api_token }}"
body:
application: "{{ app_name }}"
version: "{{ app_version }}"
environment: "{{ env }}"
server: "{{ inventory_hostname }}"
timestamp: "{{ ansible_date_time.iso8601 }}"
status: "success"
delegate_to: localhost错误处理和回滚机制
完善的错误处理对生产环境至关重要:
---
- name: Deployment with automatic rollback
hosts: webservers
vars:
rollback_on_failure: true
health_check_retries: 5
tasks:
- name: Get current version for rollback
shell: readlink {{ app_home }}/current | xargs basename
register: current_version
changed_when: false
failed_when: false
- name: Deployment block
block:
- name: Deploy new version
include_role:
name: webapp
vars:
app_version: "{{ new_version }}"
- name: Health check after deployment
uri:
url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
method: GET
status_code: 200
retries: "{{ health_check_retries }}"
delay: 10
register: health_check_result
- name: Performance test
shell: |
curl -w "@curl-format.txt" -o /dev/null -s "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/"
register: perf_test
failed_when: perf_test.stdout | regex_search('time_total:\s+([0-9.]+)') | regex_replace('time_total:\s+', '') | float > 2.0
rescue:
- name: Log deployment failure
debug:
msg: "Deployment failed: {{ ansible_failed_result.msg | default('Unknown error') }}"
- name: Rollback to previous version
block:
- name: Stop failed service
systemd:
name: "{{ app_name }}"
state: stopped
ignore_errors: yes
- name: Restore previous version symlink
file:
src: "{{ app_home }}/releases/{{ current_version.stdout }}"
dest: "{{ app_home }}/current"
state: link
force: yes
when:
- rollback_on_failure
- current_version.stdout != ""
- name: Start service with previous version
systemd:
name: "{{ app_name }}"
state: started
when: rollback_on_failure
- name: Verify rollback success
uri:
url: "http://{{ ansible_default_ipv4.address }}:{{ app_port }}/health"
method: GET
status_code: 200
retries: 3
delay: 5
when: rollback_on_failure
- name: Send rollback notification
mail:
to: "{{ ops_email }}"
subject: "URGENT: Deployment failed and rolled back on {{ inventory_hostname }}"
body: |
Deployment of {{ app_name }} version {{ new_version }} failed on {{ inventory_hostname }}.
Error: {{ ansible_failed_result.msg | default('Unknown error') }}
System has been rolled back to version {{ current_version.stdout }}.
Please investigate immediately.
delegate_to: localhost
when: rollback_on_failure
rescue:
- name: Rollback also failed
fail:
msg: "Both deployment and rollback failed. Manual intervention required!"
- name: Fail the play
fail:
msg: "Deployment failed and rollback completed"
when: rollback_on_failure
always:
- name: Clean up temporary files
file:
path: "{{ item }}"
state: absent
loop:
- "/tmp/{{ app_name }}-{{ new_version }}.tar.gz"
- "/tmp/deployment-{{ ansible_date_time.epoch }}.log"
ignore_errors: yes多环境管理
在实际工作中,通常需要管理多个环境。我的目录结构是这样的:
project/
├── inventories/
│ ├── development/
│ │ ├── hosts
│ │ └── group_vars/
│ ├── staging/
│ │ ├── hosts
│ │ └── group_vars/
│ └── production/
│ ├── hosts
│ └── group_vars/
├── playbooks/
├── roles/
└── scripts/每个环境有不同的配置:
inventories/production/group_vars/all.yml:
# 生产环境配置
env: production
domain: example.com
# 数据库配置
mysql_config:
innodb_buffer_pool_size: "4G"
max_connections: 500
query_cache_size: "256M"
# 应用配置
app_workers: 8
app_timeout: 30
max_upload_size: "50M"
# 缓存配置
redis_maxmemory: "2gb"
redis_maxmemory_policy: "allkeys-lru"
# 监控配置
monitoring_enabled: true
log_level: "warning"
# 安全配置
firewall_enabled: true
ssl_required: trueinventories/development/group_vars/all.yml:
# 开发环境配置
env: development
domain: dev.example.com
# 数据库配置
mysql_config:
innodb_buffer_pool_size: "512M"
max_connections: 100
query_cache_size: "64M"
# 应用配置
app_workers: 2
app_timeout: 60
max_upload_size: "10M"
# 缓存配置
redis_maxmemory: "256mb"
redis_maxmemory_policy: "noeviction"
# 监控配置
monitoring_enabled: false
log_level: "debug"
# 安全配置
firewall_enabled: false
ssl_required: false常见问题和解决方案
在生产环境使用Ansible这么多年,我遇到过各种各样的问题,这里分享一些常见的:
SSH连接问题
# 解决SSH连接超时
- name: Configure SSH settings
blockinfile:
path: ~/.ssh/config
create: yes
block: |
Host *
ServerAliveInterval 60
ServerAliveCountMax 3
TCPKeepAlive yes
ControlMaster auto
ControlPath ~/.ssh/control-%r@%h:%p
ControlPersist 3600权限问题
# 处理sudo权限问题
- name: Tasks requiring different privileges
block:
- name: Install system packages
package:
name: nginx
state: present
become: yes
become_user: root
- name: Configure application
template:
src: app.conf.j2
dest: /opt/app/app.conf
become: yes
become_user: app大文件传输问题
# 对于大文件,使用分块传输
- name: Download large file
get_url:
url: "{{ large_file_url }}"
dest: "/tmp/large_file.tar.gz"
timeout: 1800
force: yes
register: download_result
retries: 3
delay: 60
until: download_result is succeeded
# 或者使用rsync
- name: Sync large directory
synchronize:
src: /local/large_directory/
dest: /remote/large_directory/
delete: yes
compress: yes
recursive: yes内存不足问题
# 在内存较小的服务器上分批处理
- name: Process large dataset in batches
shell: |
for i in {1..10}; do
python process_batch.py --batch $i
sleep 5
done
when: ansible_memtotal_mb < 4096监控和日志管理
在生产环境中,监控Ansible的执行情况很重要:
启用详细日志
# ansible.cfg
[defaults]
log_path = /var/log/ansible/ansible.log
display_skipped_hosts = false
display_ok_hosts = true
# 使用callback插件
callback_whitelist = timer, profile_tasks, log_plays
[callback_profile_tasks]
task_output_limit = 100自定义日志格式
# callback_plugins/custom_logger.py
from ansible.plugins.callback import CallbackBase
import json
import time
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'notification'
CALLBACK_NAME = 'custom_logger'
def __init__(self):
super(CallbackModule, self).__init__()
self.start_time = time.time()
def v2_playbook_on_start(self, playbook):
self._display.display(f"Playbook started: {playbook._file_name}")
def v2_runner_on_ok(self, result):
host = result._host.get_name()
task = result._task.get_name()
log_entry = {
'timestamp': time.time(),
'host': host,
'task': task,
'status': 'ok',
'changed': result._result.get('changed', False)
}
with open('/var/log/ansible/task_results.json', 'a') as f:
f.write(json.dumps(log_entry) + '\n')
def v2_runner_on_failed(self, result, ignore_errors=False):
host = result._host.get_name()
task = result._task.get_name()
log_entry = {
'timestamp': time.time(),
'host': host,
'task': task,
'status': 'failed',
'error': result._result.get('msg', 'Unknown error')
}
with open('/var/log/ansible/task_results.json', 'a') as f:
f.write(json.dumps(log_entry) + '\n')集成监控系统
- name: Send metrics to monitoring system
uri:
url: "{{ metrics_endpoint }}/api/v1/metrics"
method: POST
body_format: json
body:
metric_name: "ansible.deployment.duration"
value: "{{ deployment_duration }}"
tags:
environment: "{{ env }}"
application: "{{ app_name }}"
version: "{{ app_version }}"
timestamp: "{{ ansible_date_time.epoch }}"
delegate_to: localhost
when: send_metrics | default(true)总结
经过这么多年的使用,我觉得Ansible真的是运维工程师必须掌握的工具。它不仅能提高工作效率,更重要的是让运维工作变得更加规范和可重复。
从最初的手工操作,到写shell脚本批量处理,再到使用Ansible进行自动化管理,这个过程让我深刻体会到了工具的重要性。特别是在管理几百台服务器的时候,Ansible的价值就更加明显了。
但是要记住,工具只是手段,不是目的。真正重要的是要理解你的业务需求,设计合理的架构,然后用Ansible来实现和维护。我见过一些同事,过度依赖自动化,结果出问题的时候反而不知道怎么手工处理,这是不对的。
另外,在生产环境中使用Ansible,一定要注意以下几点:
- 测试,测试,再测试:任何playbook都要先在测试环境充分验证
- 备份很重要:部署前一定要备份数据和配置
- 监控和告警:要能及时发现问题
- 文档要完善:让团队其他成员也能理解和维护
- 安全第一:敏感信息要加密,权限要控制好
我现在的团队已经把所有的运维操作都Ansible化了,从服务器初始化到应用部署,从配置管理到故障处理,基本上都有对应的playbook。这样不仅提高了效率,也降低了出错的概率,新同事上手也更容易。
最后想说的是,学习Ansible不是一蹴而就的,需要在实际项目中不断练习和总结。我建议大家从简单的任务开始,比如批量安装软件包、复制配置文件等,然后逐步学习更复杂的功能。
如果你觉得这篇文章对你有帮助,欢迎点赞转发,让更多的运维同行看到。有什么问题也可以在评论区讨论,我会尽力回答。运维这条路很长,但是有了好的工具和方法,会走得更轻松一些。
关注@运维躬行录,我会持续分享更多实用的运维技术和经验,让我们一起在运维的道路上不断学习,不断进步!记住,最好的运维就是让系统稳定运行,让开发同学专注于业务开发,让用户感受不到我们的存在。这就是我们运维工程师的价值所在。