Skip to content

test: flaky sql-tap/index4.test.lua test hangs #24

Closed
@avtikhon

Description

@avtikhon

Tarantool version:

| Tarantool 2.7.0-74-g6ad720d6d
| Target: Darwin-x86_64-RelWithDebInfo
| Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
| Compiler: /Library/Developer/CommandLineTools/usr/bin/cc /Library/Developer/CommandLineTools/usr/bin/c++
| C_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-gnu-alignof-expression -Werror
| CXX_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Werror

OS version: OSX

Bug description: https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993

artifacts.zip

Results file checksum: d41d8cd98f00b204e9800998ecf8427e

[067] sql-tap/index4.test.lua                         vinyl           Test timeout of 110 secs reached	[ fail ]
[067] Test failed! Output from reject file /tmp/tnt/rejects/sql-tap/index4.reject:
[067] 
[067] Last 15 lines of Tarantool Log file [Instance "app_server"][/tmp/tnt/067_sql-tap/index4.test.lua:vinyl.tarantool.log]:
[067] Worker "067_sql-tap" got failed test; restarted the server

How to reproduce:

  1. use mac host with OSX
  2. create 2 similar repositories tnt and tarantool
  3. enter tnt repository, and patch test-run sub-directory
commit 5f55b46dc56f95ca98e0277ee8cfb75210a8bbfd (origin/avtikhon/tmp_all_flaky_fixes_min_prs)
Author: Alexander V. Tikhonov <[email protected]>
Date:   Thu Dec 3 09:53:02 2020 +0000

    Enable fragile list tests run in parallel

diff --git a/lib/worker.py b/lib/worker.py
index a8643fe..698e9d2 100644
--- a/lib/worker.py
+++ b/lib/worker.py
@@ -89,7 +89,7 @@ def get_task_groups():
             res[key + '_fragile'] = {
                 'gen_worker': gen_worker,
                 'task_ids': fragile_task_ids,
-                'is_parallel': False,
+                'is_parallel': suite.is_parallel(),
                 'show_reproduce_content': suite.show_reproduce_content(),
             }
     return res

  1. and run hard load processes in 100 loop (enough in time to get the issue):
( export PATH=$PATH:~/tnt/src ; export REPLICATION_SYNC_TIMEOUT=500 ; export TEST_TIMEOUT=510 ; export NO_OUTPUT_TIMEOUT=520 ; l=0 ; while ./test-run.py -j400 `for r in {1..100} ; do echo sql-tap/index4.test.lua ; done` `for s in {1..30} ; do echo replication/election_qsync_stress.test.lua ; done` `for r in {1..4} ; do echo vinyl/ ; done` 2>&1 ; do l=$(($l+1)) ; echo ======== $l ============= ; if [[ $l -gt 100 ]]; then break ; fi ; done ) > a.log &

# for Linux host dev1 use debian-stretch image or similar with command
rm -rf atop_all.log ; ( atop -a -R -w atop_all.log 10000 & echo $! >atop.pid & ) ; ( export PATH=$PATH:/tnt/src ; export REPLICATION_SYNC_TIMEOUT=100 ; export TEST_TIMEOUT=110 ; export NO_OUTPUT_TIMEOUT=120 ; date ; time ./test-run.py -j 200 --builddir /tnt --vardir var_hdd_vinyl `for r in {1..200} ; do echo vinyl/gh-5141-invalid-vylog-file ; done` --force 2>&1 ; sleep 1 ; kill -USR2 `cat atop.pid` ; date ) > test_atop.log &
  1. wait a few minutes to give the hard processes heavy load the host, check with:
iostat -c 1000
            disk0       cpu    load average
 KB/t  tps  MB/s  us sy id   1m   5m   15m
 18.11   26  0.46  36 43 20  421.03 199.69 137.88
134.08  602 78.77  60 40  0  421.03 199.69 137.88
142.56  528 73.48  58 42  0  421.03 199.69 137.88
  1. enter tarantool repository, disable 'fragile' option in file 'sql-tap/suite.ini'
  2. check the test fails, run (test needs near to 350 secs):
(export PATH=$PATH:~/tarantool/src ; export REPLICATION_SYNC_TIMEOUT=100 ; export TEST_TIMEOUT=110 ; export NO_OUTPUT_TIMEOUT=120 ; time ./test-run.py sql-tap/index4.test.lua --conf vinyl ) > a.log 2>&1 &
  1. check the test passes, run (test needs near to 350 secs):
(export PATH=$PATH:~/tarantool/src ; export REPLICATION_SYNC_TIMEOUT=500 ; export TEST_TIMEOUT=510 ; export NO_OUTPUT_TIMEOUT=520 ; time ./test-run.py sql-tap/index4.test.lua --conf vinyl ) > a.log 2>&1 &

Checked test on timings with different configurations with hard load in parallel:

# of runs * conf Real User Sys
1*memtx/vinyl 5.16 real 9.35 user 0.92 sys
10*memtx/vinyl 17.52 real 52.15 user 4.44 sys
50*memtx 26.42 real 79.33 user 6.57 sys
50*vinyl 215.60 real 201.73 user 17.52 sys
50*memtx/vinyl 366.82 real 278.42 user 24.38 sys
100*memtx 242.83 real 170.48 user 14.80 sys
100*vinyl 690.04 real 408.50 user 35.93 sys
100*memtx/vinyl 730.13 real 557.03 user 48.60 sys

Also check more inserts and found that test always passes:

diff --git a/test/sql-tap/index4.test.lua b/test/sql-tap/index4.test.lua
index 4ea713a58..c012646d7 100755
--- a/test/sql-tap/index4.test.lua
+++ b/test/sql-tap/index4.test.lua
@@ -41,6 +41,9 @@ test:do_execsql_test(
           INSERT INTO t1 SELECT randomblob(102) FROM t1;     -- 16384
           INSERT INTO t1 SELECT randomblob(102) FROM t1;     -- 32768
           INSERT INTO t1 SELECT randomblob(102) FROM t1;     -- 65536
+          INSERT INTO t1 SELECT randomblob(102) FROM t1;     -- 131072
+          INSERT INTO t1 SELECT randomblob(102) FROM t1;     -- 262144
+          INSERT INTO t1 SELECT randomblob(102) FROM t1;     -- 524288
         COMMIT;
     ]])

Checked w/o hard load, but with multi runs:

100*vinyl      931.07 real      3237.52 user       204.97 sys

Checked with hard load, but with single run:

1*vinyl          8m55.057s real     0m30.794s user     0m2.775s sys

Checked with hard load and with multi runs

100*vinyl    39m0.270s real    54m50.115s user      3m49.988s sys

Seems that the issue can be closed, due to no real exists except timeout must be tuned.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions