在业务开发过程中,显示的开启事务并且在事务处理过程中对不同的情况进行显示的COMMIT或ROLLBACK,这是一个完整数据库事务处理的闭环过程。
这种在应用开发逻辑层面去handle的事务执行的结果,既确保了事务操作的数据完整性,又遵循了业务处理逻辑。所以显示的提交或回滚事务也是开发规范中的要求,但是也有一些存量的业务系统或开发人员并不能严格按照这一规范执行,进而在实际生产过程中引发故障。这里介绍一个因为开启事务后未显示的回滚导致DDL阻塞进而引发的问题。
登录mysql数据库并创建分区表
CREATE TABLE tt1 (
id int NOT NULL,
sdate date NOT NULL,
c1 varchar(4) NOT NULL,
PRIMARY KEY (id, sdate)
)
PARTITION BY RANGE columns(sdate) (
PARTITION p20240524 VALUES LESS THAN ('2024-05-25'),
PARTITION p20240525 VALUES LESS THAN ('2024-05-26')
);
mysql> begin;
mysql> select * from tango.tt1;
+----+------------+-----+
| id | sdate | c1 |
+----+------------+-----+
| 1 | 2024-05-25 | aaa |
+----+------------+-----+
1 row in set (0.00 sec)
insert into tt1 values(1,'2024-05-25','aaa');
mysql> insert into tt1 values(3,'2024-05-27','ccc');
ERROR 1526 (HY000): Table has no partition for value from column_list
数据库执行报错提示插入的记录分区不存在。
mysql> select * from performance_schema.metadata_locks where object_name='tt1';
+-------------+---------------+-------------+-------------+-----------------------+--------------+---------------+-------------+-------------------+-----------------+----------------+
| OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | COLUMN_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_DURATION | LOCK_STATUS | SOURCE | OWNER_THREAD_ID | OWNER_EVENT_ID |
+-------------+---------------+-------------+-------------+-----------------------+--------------+---------------+-------------+-------------------+-----------------+----------------+
| TABLE | tango | tt1 | NULL | 140712994313232 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:5768 | 85 | 24 |
| TABLE | tango | tt1 | NULL | 140712994947616 | SHARED_WRITE | TRANSACTION | GRANTED | sql_parse.cc:5768 | 85 | 25 |
+-------------+---------------+-------------+-------------+-----------------------+--------------+---------------+-------------+-------------------+-----------------+----------------+
2 rows in set (0.00 sec)
可以看到表持有SHARED_READ和SHARED_WRITE锁,并不因为事务执行失败而释放,这也是mysql系数据库内核机制,事务报错后数据库层面并没有执行rollback操作,而是由应用自己决定是rollback还是commit。
mysql> ALTER table tt1 ADD PARTITION ( PARTITION p20240526 VALUES LESS THAN ('2024-05-27') );
此时这个DDL操作会hang住,查看表的元数据锁情况
mysql> select * from performance_schema.metadata_locks where object_name='tt1';
+-------------+---------------+-------------+-------------+-----------------------+-------------------+---------------+-------------+-------------------+-----------------+----------------+
| OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | COLUMN_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_DURATION | LOCK_STATUS | SOURCE | OWNER_THREAD_ID | OWNER_EVENT_ID |
+-------------+---------------+-------------+-------------+-----------------------+-------------------+---------------+-------------+-------------------+-----------------+----------------+
| TABLE | tango | tt1 | NULL | 140712801139968 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:5768 | 123 | 21 |
| TABLE | tango | tt1 | NULL | 140712793308528 | SHARED_WRITE | TRANSACTION | GRANTED | sql_parse.cc:5768 | 123 | 22 |
| TABLE | tango | tt1 | NULL | 140712926580592 | SHARED_UPGRADABLE | TRANSACTION | GRANTED | sql_parse.cc:5768 | 121 | 20 |
| TABLE | tango | tt1 | NULL | 140712928177104 | EXCLUSIVE | TRANSACTION | PENDING | mdl.cc:3753 | 121 | 20 |
+-------------+---------------+-------------+-------------+-----------------------+-------------------+---------------+-------------+-------------------+-----------------+----------------+
5 rows in set (0.00 sec)
可以看到一个pending状态的锁状态,查看对应的SQL语句,知道是新增分区的DDL操作。
mysql> select THREAD_ID,EVENT_ID,EVENT_NAME,TIMER_START,TIMER_END,TIMER_WAIT,LOCK_TIME,SQL_TEXT,STATEMENT_ID from events_statements_current where thread_id=121;
+-----------+----------+---------------------------+------------------+------------------+----------------+-----------+---------------------------------------------------------------------------------------+--------------+
| THREAD_ID | EVENT_ID | EVENT_NAME | TIMER_START | TIMER_END | TIMER_WAIT | LOCK_TIME | SQL_TEXT | STATEMENT_ID |
+-----------+----------+---------------------------+------------------+------------------+----------------+-----------+---------------------------------------------------------------------------------------+--------------+
| 121 | 20 | statement/sql/alter_table | 2670208499587000 | 26874253576000 | 17216858077000 | 246000000 | ALTER table tt1 ADD PARTITION ( PARTITION p20240526 VALUES LESS THAN ('2024-05-27') ) | 32613 |
+-----------+----------+---------------------------+------------------+------------------+----------------+-----------+---------------------------------------------------------------------------------------+--------------+
1 row in set (0.00 sec)
这里的DDL操作,在mysql数据库中通过参数lock_wait_timeout控制DDL等待超时时间,超过该时间DDL会报错。默认该参数配置为31536000s,实际生产业务系统会设置30~60s,一些核心业务系统会设置为5s。但是在DDL阻塞期间,也会影响新的业务的执行。
mysql> select * from tango.tt1;
该操作也会hang住,查看对应的锁情况,也是处于pending状态。也就是阻塞的DDL操作会影响接下去的业务对该表的访问,直到DDL超时失败后,后续的业务才会正常。
mysql> select * from performance_schema.metadata_locks where object_name='tt1';
+-------------+---------------+-------------+-------------+-----------------------+-------------------+---------------+-------------+-------------------+-----------------+----------------+
| OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | COLUMN_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_DURATION | LOCK_STATUS | SOURCE | OWNER_THREAD_ID | OWNER_EVENT_ID |
+-------------+---------------+-------------+-------------+-----------------------+-------------------+---------------+-------------+-------------------+-----------------+----------------+
| TABLE | tango | tt1 | NULL | 140712801139968 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:5768 | 123 | 21 |
| TABLE | tango | tt1 | NULL | 140712793308528 | SHARED_WRITE | TRANSACTION | GRANTED | sql_parse.cc:5768 | 123 | 22 |
| TABLE | tango | tt1 | NULL | 140712926580592 | SHARED_UPGRADABLE | TRANSACTION | GRANTED | sql_parse.cc:5768 | 121 | 20 |
| TABLE | tango | tt1 | NULL | 140712928177104 | EXCLUSIVE | TRANSACTION | PENDING | mdl.cc:3753 | 121 | 20 |
| TABLE | tango | tt1 | NULL | 140713468045808 | SHARED_READ | TRANSACTION | PENDING | sql_parse.cc:5768 | 120 | 6 |
+-------------+---------------+-------------+-------------+-----------------------+-------------------+---------------+-------------+-------------------+-----------------+----------------+
5 rows in set (0.00 sec)
对于MySQL生态的数据库,事务内执行失败后数据库没有锁资源没有释放本身机制上没有问题,像国产数据库中TiDB、GoldenDB都有类似的现象。对于其它数据库,比如Oracle、PostgreSQL等,针对这个场景是什么样的表现,接下去以openGauss数据库为例进行验证。
gsql -d postgres -p 5432
[opgauss@tango-01 data]$ gsql -d postgres -p 5432
gsql ((openGauss-lite 5.0.2 build 48a25b11) compiled at 2024-05-14 10:41:04 commit 0 last mr release)
openGauss=# create database tango;
tango=# CREATE TABLE tt1 (
tango(# id int NOT NULL,
tango(# sdate date NOT NULL,
tango(# c1 varchar(4) NOT NULL
tango(# )
tango-# PARTITION BY RANGE(sdate) (
tango(# PARTITION p20240524 VALUES LESS THAN ('2024-05-25'),
tango(# PARTITION p20240525 VALUES LESS THAN ('2024-05-26')
tango(# );
CREATE TABLE
tango=# \dt
Schema | Name | Type | Owner | Storage
--------+------+-------+---------+----------------------------------
public | tt1 | table | opgauss | {orientation=row,compression=no}
tango=# begin;
BEGIN
tango=# select * from tt1;
id | sdate | c1
----+---------------------+-----
1 | 2024-05-25 00:00:00 | aaa
(1 row)
tango=# insert into tt1 values(3,'2024-05-28','ccc');
ERROR: inserted partition key does not map to any table partition
提示报错分区不存在
tango=# ALTER table tt1 ADD PARTITION p20240526 VALUES LESS THAN ('2024-05-27');
ALTER TABLE
可以看到分区是新增成功的。
tango=# SELECT l.locktype,l.database,l.relation::regclass,l.page,l.tuple,l.virtualxid,l.transactionid,l.classid,l.objid,l.objsubid,l.pid,l.mode,l.granted FROM pg_locks l JOIN pg_class c ON l.relation = c.oid WHERE c.relname = 'tt1';
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | pid | mode | granted
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+-----------------+-----------------+---------
relation | 16384 | tt1 | | | | | | | | 140405684233984 | AccessShareLock | t
(1 row)
tango=# SELECT datname,pid,sessionid,usename,application_name,backend_start,xact_start,query_start,state,query FROM pg_stat_activity where datname='tango';
datname | pid | sessionid | usename | application_name | backend_start | xact_start | query_start |
state | query
---------+-----------------+-----------+---------+------------------+-------------------------------+-------------------------------+-------------------------------+------
---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------
tango | 140405684233984 | 8 | opgauss | gsql | 2024-05-26 15:45:47.008274+08 | 2024-05-26 15:47:40.481015+08 | 2024-05-26 15:47:45.822262+08 | idle
in transaction | select * from tt1;
当执行失败后,事务处于idle in transaction (aborted)状态,表锁持有的锁也不存在了。
tango=# SELECT l.locktype,l.database,l.relation::regclass,l.page,l.tuple,l.virtualxid,l.transactionid,l.classid,l.objid,l.objsubid,l.pid,l.mode,l.granted FROM pg_locks l JOIN pg_class c ON l.relation = c.oid WHERE c.relname = 'tt1';
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | pid | mode | granted
----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+-----+------+---------
(0 rows)
tango=# SELECT datname,pid,sessionid,usename,application_name,backend_start,xact_start,query_start,state,query FROM pg_stat_activity where datname='tango';
datname | pid | sessionid | usename | application_name | backend_start | xact_start | query_start |
state | query
---------+-----------------+-----------+---------+------------------+-------------------------------+-------------------------------+-------------------------------+------
-------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------
----
tango | 140405684233984 | 8 | opgauss | gsql | 2024-05-26 15:45:47.008274+08 | | 2024-05-26 15:49:09.0485+08 | idle
in transaction (aborted) | insert into tt1 values(3,'2024-05-28','ccc');
可以看到openGauss数据库和MySQL数据库在这种故障场景下的不同表现,对于openGauss数据库而言,当事务内处理失败后,事务已经被数据库rollback了,事务中所持有的表锁也相应的释放了,其它如Oracle、PostgreSQL数据库是有相同的表现。
其它数据库因为时间关系暂时不验证了,总结针对这个场景需要优化的点有:①业务开发时候对事务报错主动处理,并显示的执行commit或rollback操作;②数据库层设置合理的DDL超时时间;③对分区表进行预创建和有效的监控手段;④数据库的DDL操作和业务处理主流程松耦合,尽量在投产窗口执行。
参考资料:
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- axer.cn 版权所有 湘ICP备2023022495号-12
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务