Skip to content

Crash detector looks broken #252

Open
@Totktonada

Description

@Totktonada

System information

Linux.
test-run: e843552.
tarantool: 2.7.0-111-g28f3b2f1e.
python2: 2.7.17.
gevent: 1.5_alpha2 and 20.6.2.
greenlet: 0.4.15-r1 and 0.4.16.

How to observe the problem

  1. Pull and build recent tarantool (2.7.0-111-g28f3b2f1e in my case).

  2. Copy test/app-tap/test-timeout.test.lua from Add test_timeout to limit test run time #244 (comment) (don't forget to set the executable bit: chmod a+x test/app-tap/test-timeout.test.lua).

  3. Mangle test/app-tap/debug/server.lua to fail after 1 second. Place this code at the end of the file:

    local fiber = require('fiber')
    fiber.create(function()
        fiber.sleep(1)
        os.exit(1)
    end)
  4. Run the test: ./test/test-run.py -j1 app-tap/test-timeout.test.lua.

Expected: the fail of the non-default server detected and the testing fails with appropriate message after ~1 second.

Got: fail after 120 seconds (default --no-output-timeout value), no report about the fail of the non-default server.

Investigation

I observed that self.process.returncode in TarantoolServer.crash_detect() is 0, while the process returns 1.

After any of the following two patches the exit code becomes correct.

Variant 1:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index 481b08f..6624ccf 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -8,7 +8,7 @@ import re
 import shlex
 import shutil
 import signal
-import subprocess
+from gevent import subprocess
 import sys
 import time
 import yaml

Variant 2:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index 481b08f..a26e7c6 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -928,7 +928,7 @@ class TarantoolServer(Server):
         while self.process.returncode is None:
             self.process.poll()
             if self.process.returncode is None:
-                gevent.sleep(0.1)
+                time.sleep(0.1)
 
         if self.process.returncode in [0, -signal.SIGKILL, -signal.SIGTERM]:
             return

test-run fails after this in app_server.py on if retval['returncode'] != 0, but nevermind, it'll be fixed soon. The tarantool instance that executes app-tap/test-timeout.test.lua hangs in the while true do end, but it will be fixed within the scope of #65 and #157.

It seems there is some problem around python's subprocess module and gevent module, but I failed to create a small reproducer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions