Skip to content

test: flaky box/gh-5422-broken_snapshot.test.lua on lack of memory #93

Open
@avtikhon

Description

@avtikhon

Tarantool 2.8.0-114-g9ccd4eab6
Target: Linux-x86_64-Debug
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
Compiler: /usr/bin/cc /usr/bin/c++
C_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-gnu-alignof-expression -fno-gnu89-inline -Wno-cast-function-type -Werror
CXX_FLAGS: -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -fopenmp -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-format-truncation -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror

OS: Linux

Issues:

  1. results file checksum: b74a0965d28b5fbfa3f738abf2d1f3dd
[007] box/gh-5422-broken_snapshot.test.lua
[007] [Instance "gh-5422-broken_snapshot"] Tarantool server failed to start
[007]
[007] Last 15 lines of Tarantool Log file [Instance "gh-5422-broken_snapshot"][/source/test/var_hdd_vinyl/007_box/gh-5422-broken_snapshot.log]:
[007] 2021-03-12 06:14:18.592 [23224] main/103/gh-5422-broken_snapshot C> Tarantool 2.8.0-114-g9ccd4eab6
[007] 2021-03-12 06:14:18.592 [23224] main/103/gh-5422-broken_snapshot C> log level 5
[007] 2021-03-12 06:14:18.595 [23224] main/103/gh-5422-broken_snapshot I> mapping 268435456 bytes for memtx tuple arena...
[007] 2021-03-12 06:14:18.595 [23224] main/103/gh-5422-broken_snapshot I> Actual slab_alloc_factor calculated on the basis of desired slab_alloc_factor = 1.044274
[007] 2021-03-12 06:14:18.595 [23224] main/103/gh-5422-broken_snapshot I> mapping 134217728 bytes for vinyl tuple arena...
[007] 2021-03-12 06:14:18.600 [23224] main/103/gh-5422-broken_snapshot I> instance uuid ef6b0d94-09c1-460b-a290-a195937e9d05
[007] 2021-03-12 06:14:18.601 [23224] main/103/gh-5422-broken_snapshot I> instance vclock {1: 10004}
[007] 2021-03-12 06:14:18.601 [23224] iproto/101/main I> binary: bound to unix/:(socket)
[007] 2021-03-12 06:14:18.601 [23224] main/103/gh-5422-broken_snapshot I> recovery start
[007] 2021-03-12 06:14:18.601 [23224] main/103/gh-5422-broken_snapshot I> recovering from `/source/test/var_hdd_vinyl/007_box/gh-5422-broken_snapshot/00000000000000010004.snap'
[007] 2021-03-12 06:14:19.457 [23224] main/103/gh-5422-broken_snapshot I> cluster uuid 0a65570c-3308-4a6d-90fb-8ac992b2039d
[007] 2021-03-12 06:14:19.514 [23224] main/103/gh-5422-broken_snapshot I> assigned id 1 to replica ef6b0d94-09c1-460b-a290-a195937e9d05
[007] 2021-03-12 06:14:19.544 [23224] main/103/gh-5422-broken_snapshot xlog.c:1945 E> can't open tx: /source/test/var_hdd_vinyl/007_box/gh-5422-broken_snapshot/00000000000000010004.snap: has some data af
ter eof marker at 54775 
[007] 2021-03-12 06:14:19.544 [23224] main/103/gh-5422-broken_snapshot memtx_engine.c:210 E> snapshot `unction' has no EOF marker
[007] 2021-03-12 06:14:19.545 [23224] main/103/gh-5422-broken_snapshot I> recover from `/source/test/var_hdd_vinyl/007_box/gh-5422-broken_snapshot/00000000000000000000.xlog'
[007]
[007] [Instance "box"] Failed to start tarantool instance "gh-5422-broken_snapshot"
[007] [ fail ]
  1. results file checksum: e0876f5e7f3add6bc448347cda6d4899
No output during 490 seconds. Will abort after 520 seconds without output. List of workers not reporting the status:
- 006_box [box/gh-5422-broken_snapshot.test.lua, None] at var_hdd_vinyl/006_box/gh-5422-broken_snapshot.result:127
[006] box/gh-5422-broken_snapshot.test.lua                            Test timeout of 510 secs reached  [ fail ]
[006]
[006] Test failed! Result content mismatch:
[006] --- box/gh-5422-broken_snapshot.result    Mon Mar  8 16:59:08 2021
[006] +++ var_hdd_vinyl/rejects/box/gh-5422-broken_snapshot.reject      Fri Mar 12 06:22:41 2021
[006] @@ -111,189 +111,17 @@
[006]  for i = 1, 10000, 1 do
[006]      assert(val[i] ~= nil)
[006]  end;
[006] ----
[006] -...
[006] +[Lost current connection]
[006]  test_run:cmd("setopt delimiter ''");
[006] ----
[006] -- true
[006] -...
[006] +[Lost current connection]
[006] +-- restore snapshot
[006] +-- truncate
[006]
  1. results file checksum: f18ea16ef3fbfdc357aad77977e7c309
No output during 440 seconds. Will abort after 520 seconds without output. List of workers not reporting the status:
- 013_box [box/gh-5422-broken_snapshot.test.lua, None] at var_hdd_vinyl/013_box/gh-5422-broken_snapshot.result:100
[013] box/gh-5422-broken_snapshot.test.lua                            Test timeout of 510 secs reached  [ fail ]
[013]
[013] Test failed! Result content mismatch:
[013] --- box/gh-5422-broken_snapshot.result    Mon Mar  8 16:59:08 2021
[013] +++ var_hdd_vinyl/rejects/box/gh-5422-broken_snapshot.reject      Fri Mar 12 06:44:42 2021
[013] @@ -98,202 +98,3 @@
[013]  - true
[013]  ...
[013]  test_run:cmd("switch test")
[013] ----
[013] -- true
[013] -...
[013] -test_run:cmd("restart server test with script='box/gh-5422-broken_snapshot.lua'")
[013] -test_run:cmd("setopt delimiter ';'")
[013] ----
[013] -- true
[013] -...
[013] --- check that all data valid
[013] -val = box.space.test:select()
[013] 
  1. results file checksum: 859f5e6651591b0460e1881c95468f99
[2021-03-12 07:37:43.140961] [ fail ]
[2021-03-12 07:37:43.141060] 
[2021-03-12 07:37:43.141060] Test failed! Result content mismatch:
[2021-03-12 07:37:43.144473] --- box/gh-5422-broken_snapshot.result     Mon Mar  8 16:59:08 2021
[2021-03-12 07:37:43.144555] +++ var_hdd_vinyl/rejects/box/gh-5422-broken_snapshot.reject       Fri Mar 12 07:37:43 2021
[2021-03-12 07:37:43.144624] @@ -118,182 +118,3 @@
[2021-03-12 07:37:43.144724]  - true
[2021-03-12 07:37:43.144774]  ...
[2021-03-12 07:37:43.144825]  test_run:cmd("switch default")
[2021-03-12 07:37:43.144886] ----
[2021-03-12 07:37:43.144935] -- true
[2021-03-12 07:37:43.144980] -...
[2021-03-12 07:37:43.145065] --- restore snapshot
[2021-03-12 07:37:43.145118] -os.execute(string.format('cp %s.save %s', snapfile, snapfile))
[2021-03-12 07:37:43.145165] ----
[2021-03-12 07:37:43.145220] -- 0
[2021-03-12 07:37:43.145327] -...
[2021-03-12 07:37:43.145376] --- truncate
[2021-03-12 07:37:43.145422] -os.execute(string.format('dd if=%s.save of=%s bs=%d count=1', snapfile, snapfile, size))
[2021-03-12 07:37:43.145527] [005] box/gh-5422-broken_snapshot.test.lua                            Test timeout of 51 secs reached   [ fail ]
[005]
[005] Test failed! Result content mismatch:
[005] --- box/gh-5422-broken_snapshot.result    Mon Mar  8 16:59:08 2021
[005] +++ var_hdd_vinyl/rejects/box/gh-5422-broken_snapshot.reject      Fri Mar 12 09:35:41 2021
[005] @@ -6,294 +6,3 @@ 
[005]  ---
[005]  ...
[005]  test_run:cmd("restart server default")
[005] -fio = require 'fio'
[005] ----
[005] -...
[005] -test_run:cmd("setopt delimiter ';'")
[005] ----
[005] -- true
[005] -...
[005] -function get_snap_file ()
[005] -    local snapfile = nil
[005] -    local directory = fio.pathjoin(fio.cwd(), 'gh-5422-broken_snapshot')
[005] 

[2021-03-12 07:37:43.145527] Last 15 lines of Tarantool Log file [Instance "box"][/source/test/var_hdd_vinyl/001_box/box.log]:
[2021-03-12 07:37:43.145889] 2021-03-12 07:29:13.418 [31344] main/103/box I> recovering from `/source/test/var_hdd_vinyl/001_box/box/00000000000000000000.snap'
[2021-03-12 07:37:43.145950] 2021-03-12 07:29:13.427 [31344] main/103/box I> cluster uuid 88e6d315-1236-4398-9277-4c58666a98e9
[2021-03-12 07:37:43.146009] 2021-03-12 07:29:13.446 [31344] main/103/box I> assigned id 1 to replica b8a4a6c5-b20f-4777-9dd1-798d9efaff38
[2021-03-12 07:37:43.146087] 2021-03-12 07:29:13.446 [31344] main/103/box I> recover from `/source/test/var_hdd_vinyl/001_box/box/00000000000000000000.xlog'
[2021-03-12 07:37:43.146133] 2021-03-12 07:29:13.446 [31344] main/103/box I> done `/source/test/var_hdd_vinyl/001_box/box/00000000000000000000.xlog'
[2021-03-12 07:37:43.146176] 2021-03-12 07:29:13.447 [31344] main/103/box I> ready to accept requests
[2021-03-12 07:37:43.146281] 2021-03-12 07:29:13.447 [31344] main/103/box C> leaving orphan mode
[2021-03-12 07:37:43.146326] 2021-03-12 07:29:13.447 [31344] main/103/box I> set 'log_level' configuration option to 5
[2021-03-12 07:37:43.146364] 2021-03-12 07:29:13.447 [31344] main/106/checkpoint_daemon I> scheduled next checkpoint for Fri Mar 12 08:53:05 2021
[2021-03-12 07:37:43.146404] 2021-03-12 07:29:13.448 [31344] main/103/box I> set 'memtx_memory' configuration option to 107374182
[2021-03-12 07:37:43.146499] 2021-03-12 07:29:13.448 [31344] main/103/box I> set 'listen' configuration option to "\/source\/test\/var_hdd_vinyl\/001_box\/box.socket-iproto"
[2021-03-12 07:37:43.146599] 2021-03-12 07:29:13.448 [31344] main/103/box I> set 'replication_sync_timeout' configuration option to 500
[2021-03-12 07:37:43.146638] 2021-03-12 07:29:13.448 [31344] main/103/box I> set 'log_format' configuration option to "plain"
[2021-03-12 07:37:43.146714] 2021-03-12 07:29:13.448 [31344] main/119/console/unix/:/source/test/var_hdd_vinyl/001_box/box.socket-admin I> started
[2021-03-12 07:37:43.146753] 2021-03-12 07:29:13.448 [31344] main C> entering the event loop
[2021-03-12 07:37:43.146863] DEBUG: Cleanup non-default servers: ['test', 'proxy']
[2021-03-12 07:37:43.150647] DEBUG: [Instance box] Stopping the server...
[2021-03-12 07:37:43.164516] DEBUG: [Instance box] Installing the server...
  1. (w/o swap and small timeout 50) results file checksum: ce55ff62188abc1e7dea5aa72d10b9f4
[005] box/gh-5422-broken_snapshot.test.lua                            Test timeout of 51 secs reached   [ fail ]
[005]
[005] Test failed! Result content mismatch:
[005] --- box/gh-5422-broken_snapshot.result    Mon Mar  8 16:59:08 2021
[005] +++ var_hdd_vinyl/rejects/box/gh-5422-broken_snapshot.reject      Fri Mar 12 09:35:41 2021
[005] @@ -6,294 +6,3 @@ 
[005]  ---
[005]  ...
[005]  test_run:cmd("restart server default")
[005] -fio = require 'fio'
[005] ----
[005] -...
[005] -test_run:cmd("setopt delimiter ';'")
[005] ----
[005] -- true
[005] -...
[005] -function get_snap_file ()
[005] -    local snapfile = nil
[005] -    local directory = fio.pathjoin(fio.cwd(), 'gh-5422-broken_snapshot')
[005] 

results file checksum: f00be1030a87b60f27003430d16346c1
results file checksum: d519c7674867c715f74f337b7f86d4e0

No output during 20 seconds. Will abort after 52 seconds without output. List of workers not reporting the status:
- 002_box [box/gh-5422-broken_snapshot.test.lua, None] at var_hdd_vinyl/002_box/gh-5422-broken_snapshot.result:67
[002] box/gh-5422-broken_snapshot.test.lua                            Test timeout of 51 secs reached   [ fail ]
[002]
[002] Test failed! Result content mismatch:
[002] --- box/gh-5422-broken_snapshot.result    Mon Mar  8 16:59:08 2021
[002] +++ var_hdd_vinyl/rejects/box/gh-5422-broken_snapshot.reject      Fri Mar 12 10:02:18 2021
[002] @@ -58,242 +58,10 @@
[002]  ---
[002]  ...
[002]  box.snapshot()   
[002] ----
[002] -- ok
[002] -...
[002]  test_run:cmd("switch default")
[002] ----
[002] -- true
[002] -...
[002] +[Lost current connection]
[002]  snapfile = get_snap_file()
[002] ----

Reproduce on dev1:

Try memory overload for container run with:
--cpus=2 --memory=8G --memory-swap=12G --memory-reservation=8G:

  1. run in container:
rm -rf rss_persec.log ; ( ( while date && sleep 1 ; do cat /sys/fs/cgroup/memory/memory.stat ; done ) >>rss_persec.log & echo $! >rss.pid & ) ; ( export PATH=$PATH:/tnt/src ; export REPLICATION_SYNC_TIMEOUT=2500 ; export TEST_TIMEOUT=2510 ; export NO_OUTPUT_TIMEOUT=2520 ; date ; time ./test-run.py -j 1200 --builddir /tnt --vardir var_hdd_vinyl `for r in {1..64} ; do echo box/gh-5422-broken_snapshot. ; done` --force 2>&1 ; sleep 1 ; kill -USR2 `cat rss.pid` ; date ) > test.log &
  1. run on host which runs container:
docker events
2021-03-16T08:21:57.833804548+03:00 container oom 7129eaff03192cfda5896e17ae3506935cc59c2c6310015b47ed28ffc7c41cc0 (image=registry.gitlab.com/tarantool/tarantool/testing/debian-stretch, name=goofy_mcnulty)
2021-03-16T08:22:31.626244711+03:00 container oom 7129eaff03192cfda5896e17ae3506935cc59c2c6310015b47ed28ffc7c41cc0 (image=registry.gitlab.com/tarantool/tarantool/testing/debian-stretch, name=goofy_mcnulty)
2021-03-16T08:22:36.535503324+03:00 container oom 7129eaff03192cfda5896e17ae3506935cc59c2c6310015b47ed28ffc7c41cc0 (image=registry.gitlab.com/tarantool/tarantool/testing/debian-stretch, name=goofy_mcnulty)
  1. check in container RSS maximums with:
grep total_rss\  rss_persec.log | sort

Continue investigation at #98

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions