案例ORA-00600: internal error code, arguments: [qkaffsindex3], [], [], [], [], [], [], [], [], [], [], []

执行更新统计信息语句:

exec dbms_stats.gather_schema_stats(ownname=>’LIVE_KS’,degree=>2,cascade=>true,options=>’GATHER AUTO’,estimate_percent=>DBMS_STATS.AUTO_SAMPLE_SIZE,method_opt=>’FOR ALL COLUMNS SIZE AUTO’);

报错:

ORA-00600: internal error code, arguments: [qkaffsindex3], [], [], [], [], [], [], [], [], [], [], []

使用10046 trace跟踪:

PARSING IN CURSOR #140247979913912 len=501 dep=1 uid=0 oct=3 lid=0 tim=10144694746826 hv=3314025276 ad=’11db7aa78′ sqlid=’8pwjxwg2sh0tw’

select substrb(dump(val,16,0,64),1,240) ep, cnt from (select /*+ no_expand_table(t) index_rs(t)   no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl no_substrb_pad  */mod(“SYS_STSPQ0MN35WUGZJNRTQJ3QX27K”,9999999999) val,count(*) cnt  from “LIVE_KS”.”P_PRD_SERIAL_NUM” t  where mod(“SYS_STSPQ0MN35WUGZJNRTQJ3QX27K”,9999999999) is not null  group by mod(“SYS_STSPQ0MN35WUGZJNRTQJ3QX27K”,9999999999)) order by val

END OF STMT

PARSE #140247979913912:c=2034,e=1830,p=0,cr=6,cu=0,mis=1,r=0,dep=1,og=1,plh=283912825,tim=10144694746826

EXEC #140247979913912:c=16,e=16,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=283912825,tim=10144694746881

WAIT #140247979913912: nam=’db file sequential read’ ela= 2299 file#=20 block#=16230458 blocks=1 obj#=5232618 tim=10144694749235

WAIT #140247979913912: nam=’db file sequential read’ ela= 752 file#=20 block#=16230459 blocks=1 obj#=5232618 tim=10144694750044

FETCH #140247979913912:c=0,e=3195,p=2,cr=4,cu=0,mis=0,r=1,dep=1,og=1,plh=283912825,tim=10144694750112

FETCH #140247979913912:c=0,e=6,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=1,plh=283912825,tim=10144694750167

STAT #140247979913912 id=1 cnt=1 pid=0 pos=1 obj=0 op=’SORT GROUP BY (cr=4 pr=2 pw=0 time=3192 us cost=3 size=24 card=2)’

STAT #140247979913912 id=2 cnt=16 pid=1 pos=1 obj=5232618 op=’INDEX FAST FULL SCAN IDX2_DLIND_ACTI_SALSTTID_PRDID (cr=4 pr=2 pw=0 time=3173 us cost=2 size=132 card=11)’

CLOSE #140247979913912:c=0,e=3,dep=1,type=0,tim=10144694750236

发现对于IDX2_DLIND_ACTI_SALSTTID_PRDID使用了INDEX FAST FULL SCAN 

现象和 Bug 28132202 – ORA-600[QKAFFSINDEX3] OCCURS IN INDEX FAST FULL SCAN 非常类似

进一步诊断查看sql语句,语句中的SYS_STSPQ0MN35WUGZJNRTQJ3QX27K是自动手机的extended statistics,如果语句把这个statistics移除,则不会出错, 

所以,手动重建extended statistics

EXEC DBMS_STATS.DROP_EXTENDED_STATS(‘LIVE_KS’,’P_PRD_SERIAL_NUM’,'(“PRD_ID”,”SALES_STATUS_ID”,”ALLOC_COUNT”)’);

EXEC DBMS_STATS.CREATE_EXTENDED_STATS(‘LIVE_KS’,’P_PRD_SERIAL_NUM’,'(“PRD_ID”,”SALES_STATUS_ID”,”ALLOC_COUNT”)’);

问题解决,未再发生。

针对12cR1,建议disable 多列自动统计信息收集,避免bug:

alter session set “_OPTIMIZER_USE_FEEDBACK”=FALSE;

alter session set “_optimizer_adaptive_plans” =false;

alter session set “_optimizer_dsdir_usage_control”=0;

如果部署了patch

Bug 21171382  Enhancement: AUTO_STAT_EXTENSIONS preference on DBMS_STATS

也可以通过

EXEC DBMS_STATS.SET_GLOBAL_PREFS(‘AUTO_STAT_EXTENSIONS’,’OFF’);

来实现。

具体建议:

https://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-e4aebca50425/6e9e0d66-08a0-4da0-a2ca-3598bcc28d7a/File/945c53b4121121a5e3054491565e8225/optimizer_for_poc_v4.pdf

SQL Server新老版本CE区别

对比CE7和2014 CE12的区别:

1.表连接中连接列估算方式

老CE对所有参与连接列的统计信息step进行逐个估算。新CE只对于最大和最小step统计信息进行收集估算,在连接列的值分布不均匀的时候,老CE拥有更准确的估算。

2.对于where语句多条件估算

老CE认为各个条件独立计算,不存在相关性,估算结果以选择性最高的条件单独为标准,新CE估算方式是条件1选择性*sqrt(条件2选择性)*sqrt(sqrt(条件3选择性)) ,其中条件1 选择性>条件2>条件3,这种相关估计对于多条件语句,新CE有更准确的估计。

3.where条件超越统计信息边界的估算

老CE针对where条件超越统计信息边界的估算简单的使用统计信息边界step对应的值进行估算, 而新CE采用的行总数*行密度的方式估算,更加准确。

4.连接条件on中含有多条件等值连接的估算

老CE使用单条件列的选择性进行简单相乘的方式,值往往偏小,新CE采用对两个条件列去重复后较小的列乘以连接两个表的平均密度,估算值偏大。

5.多条件非等值连接估算。

老CE简单对于连接条件列的选择率相乘,估算值偏低,新CE假设连接是小表对大表一对多的关系,并以大表的输入估算值计算,估算值偏高。

6.非连接条件列估算:

老CE认为非连接条件列相互独立,选取较大估算值的条件列并认为同时符合连接两端,估计值偏高,新CE认为不同表的非连接条件不相关,先进行连接选择性估算, 之后再应用非连接条件列过滤。

7.连接后去重复操作(distinct,group by)

老CE在去重之前进行多对多的连接估算,估值偏高,而新CE首先进行去重之后再用较小结果进行连接估算。

SQL Server标量函数改写内联表值函数优化案例

问题SQL:

SELECT TOP 1001 ha.HuntApplicationID ,
ha.PartyNumber ,
mht.Name AS MasterHuntTypeName ,
htly.LicenseYear ,
lStatus.[Status] AS DrawTicketStatus ,
isnull(dbo.udf_GetHuntApplicationPartyCount(ha.HuntApplicationID), 0) AS MemberCount ,
count( won.DrawTicketLicenseID) AS DrawnMemberCount ,
won.drawticketid ,
dt.PreDrawNonResidentMemberCount AS NRMemberCount ,
dbo.udf_GetAvgPreferencePoints(dt.DrawTicketID) AS PreferencePointAverage ,
CASE
WHEN ha.Quantity > 1 THEN NULL
ELSE dt.PreDrawRandomNumber
END AS PreDrawRandomNumber ,
dsm.Name AS DrawSelectionMethodName ,
dt.DrawnSequence ,
dt.PreferencePointRank ,
dt.DrawID ,
dt.RandomRank
FROM dbo.HuntApplication ha
JOIN dbo.HuntTypeLicenseYear htly ON ha.HuntTypeLicenseYearID = htly.HuntTypeLicenseYearID
JOIN dbo.MasterHuntType mht ON htly.MasterHuntTypeID = mht.MasterHuntTypeID
LEFT JOIN dbo.HuntApplicationLicense hal ON ha.HuntApplicationID = hal.HuntApplicationID
LEFT JOIN dbo.DrawTicket dt ON ha.HuntApplicationID = dt.HuntApplicationID
LEFT JOIN dbo.DrawTicketLicense won ON dt.DrawTicketID = won.DrawTicketID
AND won.WasDrawn = 1
LEFT JOIN dbo.DrawSelectionMethod dsm ON dt.DrawSelectionMethodID = dsm.DrawSelectionMethodID
LEFT JOIN dbo.StatusCode lStatus ON dt.StatusCodeID = lStatus.StatusCodeID
JOIN dbo.DrawTicketHuntChoice dthc ON dt.DrawTicketID = dthc.DrawTicketID
CROSS APPLY dbo.tvf_GetHuntApplicationPartyCount(ha.HuntApplicationID) hapc
CROSS APPLY dbo.tvf_GetAvgPreferencePoints(dt.DrawTicketID) app
WHERE 1 = 1
AND htly.MasterHuntTypeID = @iMasterHuntTypeID
AND htly.LicenseYear = @iLicenseYear
AND dt.StatusCodeID = @iDrawTicketStatusCodeID
AND dthc.WasDrawn = @iHuntChoiceWasDrawn
GROUP BY ha.HuntApplicationID,
ha.PartyNumber,
mht.[Name],
htly.LicenseYear,
lStatus.[Status],
isnull(dbo.udf_GetHuntApplicationPartyCount(ha.HuntApplicationID), 0),
won.DrawTicketID,
dt.PreDrawNonResidentMemberCount,
dbo.udf_GetAvgPreferencePoints(dt.DrawTicketID),
CASE
WHEN ha.Quantity > 1 THEN NULL
ELSE dt.PreDrawRandomNumber
END,
dsm.[Name],
dt.DrawnSequence,
dt.PreferencePointRank,
dt.DrawID,
dt.RandomRank
ORDER BY htly.LicenseYear DESC,
mht.Name,
lStatus.[Status],
dt.DrawID,
PreferencePointAverage DESC,
PreDrawRandomNumber,
ha.PartyNumber

静态函数:

CREATE FUNCTION [dbo].[udf_GetAvgPreferencePoints]
(@DrawTicketID INT)
RETURNS NUMERIC (18, 3)
AS
BEGIN
RETURN
(
SELECT TOP 1
CONVERT(DECIMAL, dt.PreDrawPreferencePointTotal) / NULLIF(CONVERT(DECIMAL, dt.PreDrawMemberCount),0)
FROM dbo.DrawTicket dt
WHERE dt.DrawTicketID = @DrawTicketID
)
END

执行时间40s

这是典型可以进行静态函数改写内联表值函数的sql:

改写后:

SELECT TOP 1001 ha.HuntApplicationID ,
ha.PartyNumber ,
mht.Name AS MasterHuntTypeName ,
htly.LicenseYear ,
lStatus.[Status] AS DrawTicketStatus ,
--isnull(dbo.udf_GetHuntApplicationPartyCount(ha.HuntApplicationID), 0) AS MemberCount ,
isnull(hapc.MemberCount, 0) AS MemberCount,
count( won.DrawTicketLicenseID) AS DrawnMemberCount ,
won.drawticketid ,
dt.PreDrawNonResidentMemberCount AS NRMemberCount ,
--dbo.udf_GetAvgPreferencePoints(dt.DrawTicketID) AS PreferencePointAverage ,
app.PreferencePointAverage PreferencePointAverage,
CASE
WHEN ha.Quantity > 1 THEN NULL
ELSE dt.PreDrawRandomNumber
END AS PreDrawRandomNumber ,
dsm.Name AS DrawSelectionMethodName ,
dt.DrawnSequence ,
dt.PreferencePointRank ,
dt.DrawID ,
dt.RandomRank
FROM dbo.HuntApplication ha
JOIN dbo.HuntTypeLicenseYear htly ON ha.HuntTypeLicenseYearID = htly.HuntTypeLicenseYearID
JOIN dbo.MasterHuntType mht ON htly.MasterHuntTypeID = mht.MasterHuntTypeID
LEFT JOIN dbo.HuntApplicationLicense hal ON ha.HuntApplicationID = hal.HuntApplicationID
LEFT JOIN dbo.DrawTicket dt ON ha.HuntApplicationID = dt.HuntApplicationID
LEFT JOIN dbo.DrawTicketLicense won ON dt.DrawTicketID = won.DrawTicketID
AND won.WasDrawn = 1
LEFT JOIN dbo.DrawSelectionMethod dsm ON dt.DrawSelectionMethodID = dsm.DrawSelectionMethodID
LEFT JOIN dbo.StatusCode lStatus ON dt.StatusCodeID = lStatus.StatusCodeID
JOIN dbo.DrawTicketHuntChoice dthc ON dt.DrawTicketID = dthc.DrawTicketID
CROSS APPLY dbo.tvf_GetHuntApplicationPartyCount(ha.HuntApplicationID) hapc
CROSS APPLY dbo.tvf_GetAvgPreferencePoints(dt.DrawTicketID) app
WHERE 1 = 1
AND htly.MasterHuntTypeID = @iMasterHuntTypeID
AND htly.LicenseYear = @iLicenseYear
AND dt.StatusCodeID = @iDrawTicketStatusCodeID
AND dthc.WasDrawn = @iHuntChoiceWasDrawn
GROUP BY ha.HuntApplicationID,
ha.PartyNumber,
mht.[Name],
htly.LicenseYear,
lStatus.[Status],
--isnull(dbo.udf_GetHuntApplicationPartyCount(ha.HuntApplicationID), 0),
isnull(hapc.MemberCount, 0),
won.DrawTicketID,
dt.PreDrawNonResidentMemberCount,
--dbo.udf_GetAvgPreferencePoints(dt.DrawTicketID),
app.PreferencePointAverage,
CASE
WHEN ha.Quantity > 1 THEN NULL
ELSE dt.PreDrawRandomNumber
END,
dsm.[Name],
dt.DrawnSequence,
dt.PreferencePointRank,
dt.DrawID,
dt.RandomRank
ORDER BY htly.LicenseYear DESC,
mht.Name,
lStatus.[Status],
dt.DrawID,
PreferencePointAverage DESC,
PreDrawRandomNumber,
ha.PartyNumber

对应的表值函数:

​CREATE FUNCTION [dbo].[tvf_GetAvgPreferencePoints]
(@DrawTicketID INT)
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
(
SELECT TOP 1
CONVERT(DECIMAL, dt.PreDrawPreferencePointTotal) / NULLIF(CONVERT(DECIMAL, dt.PreDrawMemberCount),0) as PreferencePointAverage
FROM dbo.DrawTicket dt
WHERE dt.DrawTicketID = @DrawTicketID
)
GO

改写后执行时间从40s降低到16s,对于倾斜列的优化速度更为明显

SQL Server 2016升级迁移过程中性能问题诊断案例

日常运行的批量更新作业,平日是5分钟之内结束,今天出现超过30分钟没结束的情况,实际运行3个小时以上,应用程序超时报错。

数据库版本:SQL Server 2016企业版

问题SQL:

declare @batch integer,
        @min    integer,
        @max   integer,
        @count integer
 
select  @min = 1,
    @count = count(*),
    @batch = 5000,
    @max = 5000
FROM dbo.MarketingRecipientEvents

Update MarketingRecipient
SET DateTrackedURLClicked = mre.EventDate
FROM MarketingRecipient mr
INNER JOIN dbo.MarketingRecipientEvents mre on mr.CustomerID = mre.ExternalRecipientID
WHERE mr.MarketingScheduleID = 364
AND mre.EventType in ('CLICKED')
AND mr.DateTrackedURLClicked IS NULL
AND mre.MarketingRecipientEventsID between @min and @max

问题表上MarketingRecipient上有两个相同覆盖列索引,正常情况下使用出问题的时候使用

ix_MarketingRecipient_MarketingScheduleID_CustomerID 进行Nested Loop 连接,出问题的时候使用

IX_MarketingRecipient_MarketingScheduleID_DateMessageBounced进行Nested Loop连接

CREATE NONCLUSTERED INDEX [ix_MarketingRecipient_MarketingScheduleID_CustomerID] ON [dbo].[MarketingRecipient]
(
[MarketingScheduleID] ASC,
[CustomerID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [DATA]
GO

CREATE NONCLUSTERED INDEX [IX_MarketingRecipient_MarketingScheduleID_DateMessageBounced] ON [dbo].[MarketingRecipient]
(
[MarketingScheduleID] ASC
)
INCLUDE (  [CustomerID],
[DateMessageBounced],
[DateMessageSent],
[DateEmailOpened],
[DateEmailBlocked],
[MarketingRecipientID],
[DateTrackedURLClicked])

问题执行计划:

查看执行计划对应的统计信息:

SELECT  sp.stats_id
,object_name(s.object_id) object_name
,object_schema_name(s.object_id) schema_name
,name
,filter_definition
,last_updated
,rows
,rows_sampled
,rows_sampled*100/rows as [percent]
,steps
,unfiltered_rows
,modification_counter
,sp.persisted_sample_percent
FROM sys.stats s
CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) AS sp
WHERE 1=1
and modification_counter < rows/5 + 500 and modification_counter > sqrt(rows*1000)
and object_schema_name(s.object_id) = 'dbo' and rows > 500
and object_name(s.object_id)= 'MarketingRecipient';

发现modification_counter列的值不低:

目前数据库刚迁移到2016版本,但是compatibility_level依旧是120并且没有启用trace 2371,所以,自动更新统计信息的逻辑是:

更新行>500+表行数*20% = 500+ 36891356*20%=7378771,  目前行数5718611不满足条件,所以统计信息不完善导致执行计划异常,优化器并没有捕捉到

如果使用新的自动更新统计信息逻辑:

更新行>√表行数*1000=√36891356*1000=192071, 5718611满足条件,验证一下:不改代码的情况下给SQL加plan guide:

OPTION (QUERYTRACEON 2371)

发现统计信息在编译期间自动更新,并且生成正确的执行计划。

本例是数据库从2014企业版到2016企业版升级过程中间状态造成的性能问题,生产环境如果要启用compability_level=130,

在2016环境下建议以下步骤

  1. 先启用querystore建立基线,收集足够的统计数据。
  2. 更改compability_level=130
  3. 对比基线数据,捕获变更的计划进行针对性的query plan force操作,固定到之前的基线。


在没有升级到compability_level=130的情况下,建议建立DBA更新统计信息的job,每晚根据新的算法更新过期的统计信息。

也可以利用https://github.com/olahallengren/sql-server-maintenance-solution

中的IndexOptimize.sql 进行统计信息维护,利用新的参数:@StatisticsModificationLevel来设置更改行数占表总行数百分比。

SQL Server阻塞blocking案例分析

今天在性能测试过程中发现大量阻塞报警,检查whoisactive(https://github.com/amachanic/sp_whoisactive/)数据发现,阻塞blocking头部session当前执行的语句如下:

<?query —

(@p0 int,@p1 datetime,@p2 bigint,@p3 bigint,@p4 bigint)INSERT INTO [LicenseAction]([LicenseActionTypeID], [ActionDate], [LicenseID], [DocumentID], [TransactionDetailID])

VALUES (@p0, @p1, @p2, @p3, @p4)

SELECT CONVERT(BigInt,SCOPE_IDENTITY()) AS [value]

–?>

被block的session有几十个,但是当前执行的语句都一样:

<?query —

UPDATE dbo.HT

        SET IQuantity = IQuantity + 1,

            IRIQuantity = IQuantity + 1,

            QuotaFilledDate = CASE WHEN FilledDate IS NULL

                                        AND IssuedQuantity + 1 >= QuotaQuantity THEN dbo.udf_GetCurrentDateTime()

                                   ELSE QuotaDate

                              END

        FROM dbo.Hunt

        WHERE HID = @HID

–?>

锁等待信息如下:

<Lock resource_type=”KEY” index_name=”PK_Hunt” request_mode=”U” request_status=”WAIT” request_count=”1″ />

可以看出是主键争用,初步分析是不同事务的同时更新相同的主键行造成的,开trace验证

当前执行的虽然是insert语句,但是从whoisactive对于session持有的锁分析得出,事务肯定是有其他对于Pk_hunt的key的争用语句,单个事务的语句信息果然包含对于pk_hunt表的更新存储过程




可以看到确实是不同的transactionid同时更新相同的行造成的阻塞链

分析结果提交给测试人员,检查配置并联系开发人员进行修正。

SQL Server Availability Group Failover 测试

兼容性测试:

测试脚本:

环境:windows failover cluster 

主库执行脚本:

USE [master]
GO
ALTER AVAILABILITY GROUP [test_AG]
MODIFY REPLICA ON N’host1′ WITH (FAILOVER_MODE = AUTOMATIC)
GO

报错:
Msg 35215, Level 16, State 17, Line 3
The Alter operation is not allowed on availability replica ‘UELT1WASFSD01VS’, because automatic failover mode is an invalid configuration on a SQL Server Failover Cluster Instance. Retry the operation by specifying manual failover mode.

结果证明AG 自动failover的属性和windows failover cluster不兼容,如果要启用AG自动failover必须使用单实例模式的sql server.


2.AG自动failover功能测试

测试1:
在sscm中手动停止主库sql service,观察是否自动failover到备库。

测试2:

直接关闭主库所在的主机,观察是否failover到备库

测试1和测试2均实现自动failover。

在failover期间,会有应用程序连接拒绝出现:

11/01/19 06:04:10.368 [0x00003688] [spid 769] SQLState: 08S01, Native Error: 10054, Severity: 0, State: 10, Line: 0
[Microsoft][SQL Server Native Client 11.0]TCP Provider: An existing connection was forcibly closed by the remote host.


3.AG Failover性能测试
table: 5218283 rows clustered.
query: select count(*) from test
command:.\ostress -Suedv1wasfsdba01 -dtest -E -Q”select count(*) from dbo.test” -ooutput -mstress -n500 -r100 -T10 -T88 -T146

Primary Session numberSecondary session numberstFailover time(s)ndFailover time(s)rdFailover time(s)
3000555
0300252523
5000555
0500444631
10000555
01000898891


session 1000的时候,客户端报连接超时错误:
解决方案:
exec sp_configure ‘max worker threads’,4096
go
reconfigure

总结:

1.Avalibility group 自动failover 和 FCI failover不兼容,只能应用于 standalone instance.

2.Avalibility group automatic failover 执行sql service 宕机 failover and 主机宕机 failover.

3. failover 时间取决于备库上只读会话并发数目,可以降低主要备库上的只读负载,而在非主要备库上执行只读负载的方式来降低failover的时间。

SQL Server database mail问题诊断一例

产品环境sql server database的mail发不出邮件,影响客户的业务,在数据库中进行诊断

诊断sql:

EXEC msdb.dbo.sp_send_dbmail

@profile_name = ‘DB Messaging’,

@recipients = ‘jamesz@gmail.com’,

@body = ‘James_test_2019’,

@body_format = ‘HTML’,

@subject = ‘James_test_2019’

执行过后并没有收到邮件,查询event log表,也没有任何记录:

SELECT * FROM msdb.dbo.sysmail_event_log

ORDER BY log_id DESC

检查邮件的状态,都是unsent:

SELECT top 100 * FROM msdb.dbo.sysmail_allitems 

ORDER BY send_request_date DESC

根据以下链接的建议进行诊断:

所有诊断项均正常。

同时检查sql server errorlog, windows event log和windows cluster log,发现周末出现的storage异常和mail功能失效关联端倪

The brief analysis for mail can’t sent problem:

sql errorlog显示sql instance的恢复启动时间是 2019-10-25 11:04:50.050 CDT

但是在 windows cluster log,  clussvc.exe 在时间段 2019/10/25-07:54:02.903 to 2019/10/25-16:04:18.499并没有记录

恰好在这段时间,storage 存储异常

之后sql instance在cluster service有问题的情况下启动工作

并且获取了本地local的信息而不是cluster service的信息.

windows event log显示,database mail通过连接. 到db而出现错误证明上述猜测

Server Name: ., Database Name: msdb

解决方案就是重启备用节点后,做一个干净状态的failover,让sql instance在正常的cluster service基础之上。

 SQL Server Errorlog:

2019-10-25 11:04:50.050  Server  Microsoft SQL Server 2014 (SP3-CU2) (KB4482960) – 12.0.6214.1 (X64)    Feb  2 2019 01:10:18    Copyright (c) Microsoft Corporation   Enterprise Edition: Core-based Licensing (64-bit) on Windows NT 6.3 <X64> (Build 9600: )  

2019-10-25 11:04:50.050  Server  UTC adjustment: -5:00

2019-10-25 11:04:50.050  Server  (c) Microsoft Corporation.

2019-10-25 11:04:50.050  Server  All rights reserved.

2019-10-25 11:04:50.050  Server  Server process ID is 5700.

2019-10-25 11:04:50.050  Server  System Manufacturer: ‘Cisco Systems Inc’, System Model: ‘UCSB-B200-M3’.

2019-10-25 11:04:50.060  Server  Authentication mode is MIXED.

2019-10-25 11:04:50.060  Server  Logging SQL Server messages in file ‘D:\MSSQL12.MSSQLSERVER\MSSQL\Log\ERRORLOG’.

2019-10-25 11:04:50.060  Server  The service account is ‘TAN\svc.DBAHFDN01VS.sql’. This is an informational message; no user action is required.

2019-10-25 11:04:50.060  Server  Registry startup parameters:     -d D:\MSSQL12.MSSQLSERVER\MSSQL\DATA\master.mdf    -e D:\MSSQL12.MSSQLSERVER\MSSQL\Log\ERRORLOG    -l D:\MSSQL12.MSSQLSERVER\MSSQL\DATA\mastlog.ldf    -T 3226

2019-10-25 11:04:50.060  Server  Command Line Startup Parameters:    -s “MSSQLSERVER”

2019-10-25 11:04:50.450  Server  SQL Server detected 2 sockets with 10 cores per socket and 20 logical processors per socket, 40 total logical processors; using 40 logical processors based on SQL Server licensing. This is an informational message; no user action is required.

2019-10-25 11:04:50.450  Server  SQL Server is starting at normal priority base (=7). This is an informational message only. No user action is required.

2019-10-25 11:04:52.180  spid9s  The NETBIOS name of the local node that is running the server is ‘DBAHFDN01N1’. This is an informational message only. No user action is required.

 cluster log:

00002750.0000251c::2019/10/25-07:54:02.903 DBG   [RCM] rcm::RcmResource::Control: (QUORUM_NEW, SWITCH_MONITORS)

00002750.00000ed8::2019/10/25-07:54:02.903 DBG   [RCM] QUORUM_NEW: Ignoring reset because resource flag 2 is not set.

00002750.00000ed8::2019/10/25-07:54:02.903 DBG   [RCM] QUORUM_NEW: Ignoring reset because resource flag 32 is not set.

00002750.0000251c::2019/10/25-07:54:02.903 DBG   [RCM] Switch monitor call for QUORUM_NEW when there were pending controls, enqueuing the switch

00002750.0000251c::2019/10/25-07:54:02.903 DBG   [RCM] Pending controls for QUORUM_NEW are empty now but we’re not in a state that can switch monitors; removing SWITCH_MONITORS control.

00002750.00000ed8::2019/10/25-07:54:02.903 DBG   [RCM] rcm::RcmGroup::ComputeFailoverThreshold=> (Cluster Group, 0, computed)

00002750.00000ed8::2019/10/25-07:54:02.903 DBG   [RCM] rcm::RcmResource::StmPreOnlineCheck(QUORUM_NEW)

00000b4c.00000b50::2019/10/25-16:04:18.499 INFO  —————————–+ LOG BEGIN +—————————–

00000b4c.00000b50::2019/10/25-16:04:18.499 INFO  [CS] Starting clussvc as a service

00000b4c.00000b50::2019/10/25-16:04:18.499 INFO  [CS] cluster service logging level is 5

00000b4c.000009cc::2019/10/25-16:04:18.546 INFO  [CS] Creating cluster node <vector len=’1′>

00000b4c.000009cc::2019/10/25-16:04:18.546 INFO      <item>ClusSvc</item>

00000b4c.000009cc::2019/10/25-16:04:18.546 INFO  </vector>

00000b4c.000009bc::2019/10/25-16:04:18.546 INFO  [StartupConfig]: Initializing.

Windwos event log:

10/30/2019 12:05:03 PM

1) Exception Information

===================

Exception Type: Microsoft.SqlServer.Management.SqlIMail.Server.Common.BaseException

Message: There was an error on the connection. Reason: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server), connection parameters: Server Name: ., Database Name: msdb

Data: System.Collections.ListDictionaryInternal

TargetSite: Void OpenConnection(Microsoft.SqlServer.Management.Common.SqlConnectionInfo)

HelpLink: NULL

Source: DatabaseMailEngine

StackTrace Information

===================

   at Microsoft.SqlServer.Management.SqlIMail.Server.DataAccess.ConnectionManager.OpenConnection(SqlConnectionInfo connectionInfo)

   at Microsoft.SqlServer.Management.SqlIMail.Server.DataAccess.DataAccessAdapter.OpenConnection(String dbServerName, String dbName, String userName, String password, String appName, Int32 connectionTimeout)

   at Microsoft.SqlServer.Management.SqlIMail.Server.DataAccess.DataAccessAdapter.OpenConnection(String dbServerName, String dbName, String userName, String password, Int32 connectionTimeout)

   at Microsoft.SqlServer.Management.SqlIMail.IMailProcess.QueueItemProcesser.ProcessQueueItems(String dbName, String dbServerName, Int32 lifetimeMinimumSec, LogLevel loggingLevel, Byte[] encryptionKey, Int32 connectionTimeout)

2) Exception Information

===================

Exception Type: System.Data.SqlClient.SqlException

Errors: System.Data.SqlClient.SqlErrorCollection

Class: 20

LineNumber: 0

Number: 2

Procedure: NULL

Server: 

State: 0

Source: .Net SqlClient Data Provider

ErrorCode: -2146232060

Message: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server)

Data: System.Collections.ListDictionaryInternal

TargetSite: System.Data.ProviderBase.DbConnectionInternal GetConnection(System.Data.Common.DbConnection)

HelpLink: NULL

StackTrace Information

===================

   at System.Data.ProviderBase.DbConnectionPool.GetConnection(DbConnection owningObject)

   at System.Data.ProviderBase.DbConnectionFactory.GetConnection(DbConnection owningConnection)

   at System.Data.ProviderBase.DbConnectionClosed.OpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory)

   at System.Data.SqlClient.SqlConnection.Open()

   at Microsoft.SqlServer.Management.SqlIMail.Server.DataAccess.ConnectionManager.OpenConnection(SqlConnectionInfo connectionInfo)

SQL Server如何配置cdc进行ETL

企业核心业务系统oltp的数据需要通过ETL同步到数据仓库,原始的ETL流程通过定制化从SQL Server中进行数据抽取,经过生产环境的监控,发现ETL过程的query会对生产系统造成额外负载。于是制定了通过cdc进行增量数据同步的方案:

方案选型中比对了SQL server的trigger,CT,CDC, temporary table各种方式,选型对比如下:

TriggerCTCDCTemporal table
Sync waySynchronousSynchronousAsynchronousSynchronous
Internal workHeavy than indexSame as indexCall sp_replcmds   to collect from tlog no direct workload. 
Table part in transactionYesYesNoYes
Historical Data retentionManual controlNoYesYes 

通过表格对比可以看到,cdc通过异步非侵入式进行增量数据捕获,使用了sp_replcmds,这个过程和sql server的transactional replication中的log reader agent是相同的,缺点就是会对占用原始database的datafile和logfile,造成数据增长。

在cdc的使用过程中,比较重要的是将cdc数据置于单独的filegroup内,这样,在数据管理恢复和性能上会减少对于原始oltp的影响。具体过程如下:

  1. 在db级别启用cdc时需要先将db的默认filegroup改成cdc,这样cdc对应的元数据信息,例如经常变化的表cdc.lsn_time_mapping可以存放到单独的filegroup中
  2. 对于表的cdc数据使用@fileGroup_Name参数指定filegroup

–enable cdc filegroup

USE DB1

ALTER DATABASE DB1 ADD FILEGROUP CDC

GO

ALTER DATABASE DB1 ADD FILE

(

NAME=’CDC’,

FILENAME=’D:\DATA\DB_CDC1.ndf’,

SIZE = 1024MB,

MAXSIZE = unlimited,

FILEGROWTH=256MB

)TO FILEGROUP CDC

GO

USE DB1

GO 

ALTER DATABASE DB1 MODIFY FILEGROUP [CDC] DEFAULT

GO  

EXEC sys.sp_cdc_enable_db  

GO  

ALTER DATABASE DB1 MODIFY FILEGROUP [DATA] DEFAULT

go

EXEC sys.sp_cdc_enable_table @source_schema = N’dbo’,@source_name = ‘T1′,@role_name = N’cdc_Admin’,@fileGroup_Name = N’CDC’ 

CDC建立后有capture job和clean job,当cdc的数据增量非常大的时候,需要适当调整job的参数:

EXEC sys.sp_cdc_change_job  

 @job_type = ‘capture’ 

 ,@maxtrans = 5000      –每个扫描循环可以处理的最多事务数 

 ,@maxscans = 100       –为了从日志中提取所有行而要执行的最大扫描循环次数 

 ,@continuous = 1       –连续运行最多处理(max_trans * max_scans)个事务 

 ,@pollinginterval = 1

记一次分区表update调优过程

原始SQL:

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263CREATE OR REPLACE PROCEDURE sp_upd_suppressed_emails(A_LIMIT_BULK IN PLS_INTEGER DEFAULT 2000,A_COMMIT_AFTER IN PLS_INTEGER DEFAULT 2000isv_operation Varchar2(20) := 'START';v_last         PLS_INTEGER;   v_start        PLS_INTEGER;   v_end          PLS_INTEGER;  l_row          PLS_INTEGER; CURSOR curs_contact IS SELECT id FROM contactWHERE active_ind = 2 and suppress_flag = 'Y' and bounce_flag = 'Y' and consumer_id in (select id from consumers where login_name IN ('ASPIRAFOCUS''CDFW''PSACSFSUSR') ) and  email is not null ORDER BY id;  TYPE contactId IS RECORD ( id contact.id%TYPE);  TYPE v_contactId_tbl IS TABLE OF contactId INDEX BY PLS_INTEGER; v_contact v_contactId_tbl;     BEGIN DBMS_OUTPUT.PUT_LINE ('start reading data..' );v_operation := 'BEFORE OPEN';OPEN curs_contact; LOOP v_operation := 'BEFORE FETCH';FETCH curs_contact BULK COLLECT INTO v_contact LIMIT A_LIMIT_BULK;v_operation := 'AFTER FETCH'; EXIT WHEN v_contact.COUNT = 0;  v_start := 1;   v_last := v_contact.COUNT;  l_row := 0;       LOOP           DBMS_OUTPUT.PUT_LINE ('LOOP 2 top ');EXIT WHEN v_start > v_last;v_end := LEAST (v_start + A_COMMIT_AFTER - 1, v_last);     DBMS_OUTPUT.PUT_LINE ('LOOP 2 v_start ' || v_start || ' end ' || v_end ); BEGIN              v_operation := 'UPDATE_LOAD';FORALL i IN v_start .. v_end UPDATE contact SET active_ind = 1, suppress_flag = 'N', bounce_flag = 'N'  WHERE id = v_contact (i).id;               END;DBMS_OUTPUT.PUT_LINE ('after forall start ' || v_start || ' end ' || v_end );COMMIT;v_start := v_end + 1;         END LOOP;DBMS_OUTPUT.PUT_LINE ('LOOP 2 bottom v_start ' || v_start || ' end ' || v_end );COMMIT;END LOOP;DBMS_OUTPUT.PUT_LINE ('stop reading data..' ); EXCEPTION WHEN OTHERS THEN DBMS_OUTPUT.PUT_LINE ('Exception:Others-> v_operation = ' || v_operation || ', rowcount # ' || l_row || ' SQLCODE ' || SQLCODE || ' ERR_MSG = ' || SUBSTR(SQLERRM, 1, 500) );  END sp_upd_suppressed_emails;

原始SQL在执行过程中经过几小时未完成,查看sql统计信息:

Stat NameStatement TotalPer Execution% Snap Total
Elapsed Time (ms)2,997,435187,339.6853.56
CPU Time (ms)2,896,748181,046.7265.48
Executions16
Buffer Gets521,592,79132,599,549.4469.21
Disk Reads10,974685.881.12
Parse Calls10.060.00
Rows32,0002,000.00
User I/O Wait Time (ms)2,525
Cluster Wait Time (ms)1,146
Application Wait Time (ms)4
Concurrency Wait Time (ms)1
Invalidations0
Version Count1
Sharable Mem(KB)51

Back to Plan 1(PHV: 3689516641)

Execution Plan

IdOperationNameRowsBytesCost (%CPU)TimePstartPstop
0UPDATE STATEMENT16798 (100)
1   UPDATECONTACT
2     PARTITION RANGE ALL15116798 (1)00:00:0111048575
3       PARTITION HASH SINGLE15116798 (1)00:00:01KEYKEY
4         TABLE ACCESS FULLCONTACT15116798 (1)00:00:01

表数据量非常大,根据created_date时间字段分区,ID字段进行子hash分区,上面建立了基于ID字段的global hash分区索引。

从上面情况看出,原始SQL分批2000个ID进行批量update,但执行计划走的是TABLE ACCESS FULL, 并且进行了分区的PRATITION RANGE ALL操作,可以看出2000条记录的批量更新由于不能利用global hash分区索引进行检索,主要原因是由于hash分区适用于等值操作,对于范围或者批量操作,可能遍历很多hash分区,导致结果不如TABLE ACCESS FULL。

针对这种情况,根据业务逻辑,将批量条件从ID改成其他非global hash索引列,执行时间降低为5分钟。

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556CREATE OR REPLACE PROCEDURE sp_upd_suppressed_emails(A_LIMIT_BULK IN PLS_INTEGER DEFAULT 2000,A_COMMIT_AFTER IN PLS_INTEGER DEFAULT 2000isv_operation Varchar2(20) := 'START';v_last         PLS_INTEGER;v_start        PLS_INTEGER;v_end          PLS_INTEGER;l_row          PLS_INTEGER;CURSOR curs_contact IS SELECT distinct email FROM contactWHERE active_ind = 2 and suppress_flag = 'Y' and bounce_flag = 'Y'and consumer_id in (select id from consumers where login_name IN ('ASPIRAFOCUS''CDFW''PSACSFSUSR') )and  email is not null ORDER BY email;TYPE contactId IS RECORD ( email contact.email%TYPE);TYPE v_contactId_tbl IS TABLE OF contactId INDEX BY PLS_INTEGER;v_contact v_contactId_tbl;v_upd_date date ;BEGINDBMS_OUTPUT.PUT_LINE ('start reading data..' );v_operation := 'BEFORE OPEN';select sysdate into v_upd_date from dual;OPEN curs_contact;LOOPv_operation := 'BEFORE FETCH';FETCH curs_contact BULK COLLECT INTO v_contact LIMIT A_LIMIT_BULK;v_operation := 'AFTER FETCH';EXIT WHEN v_contact.COUNT = 0;v_start := 1;v_last := v_contact.COUNT;l_row := 0;LOOPDBMS_OUTPUT.PUT_LINE ('LOOP 2 top ');EXIT WHEN v_start > v_last;v_end := LEAST (v_start + A_COMMIT_AFTER - 1, v_last);DBMS_OUTPUT.PUT_LINE ('LOOP 2 v_start ' || v_start || ' end ' || v_end );BEGINv_operation := 'UPDATE_LOAD';FORALL i IN v_start .. v_endUPDATE contact SET active_ind = 1, suppress_flag = 'N', bounce_flag = 'N' , modified_date =  v_upd_dateWHERE email = v_contact (i).emailand   active_ind = 2 and suppress_flag = 'Y' and bounce_flag = 'Y'and consumer_id in (select id from consumers where login_name IN ('ASPIRAFOCUS''CDFW''PSACSFSUSR') )  ;END;DBMS_OUTPUT.PUT_LINE ('after forall start ' || v_start || ' end ' || v_end );COMMIT;v_start := v_end + 1;END LOOP;DBMS_OUTPUT.PUT_LINE ('LOOP 2 bottom v_start ' || v_start || ' end ' || v_end );COMMIT;END LOOP;DBMS_OUTPUT.PUT_LINE ('stop reading data..' );EXCEPTIONWHEN OTHERS THENDBMS_OUTPUT.PUT_LINE ('Exception:Others-> v_operation = ' || v_operation || ', rowcount # ' || l_row || ' SQLCODE ' || SQLCODE || ' ERR_MSG = ' || SUBSTR(SQLERRM, 1, 500) );END sp_upd_suppressed_emails;