Go 语言踩坑记——panic 与 recover

[作者简介] 易乐天，小米信息技术部海外商城组

题记

Go 语言自发布以来，一直以高性能、高并发著称。因为标准库提供了 http 包，即使刚学不久的程序员，也能轻松写出 http 服务程序。

不过，任何事情都有两面性。一门语言，有它值得骄傲的有点，也必定隐藏了不少坑。新手若不知道这些坑，很容易就会掉进坑里。《 Go 语言踩坑记》系列博文将以 Go 语言中的 panic 与 recover 开头，给大家介绍笔者踩过的各种坑，以及填坑方法。

初识 panic 和 recover

panic
panic 这个词，在英语中具有恐慌、恐慌的等意思。从字面意思理解的话，在 Go 语言中，代表极其严重的问题，程序员最害怕出现的问题。一旦出现，就意味着程序的结束并退出。Go 语言中 panic 关键字主要用于主动抛出异常，类似 java 等语言中的 throw 关键字。
recover
recover 这个词，在英语中具有恢复、复原等意思。从字面意思理解的话，在 Go 语言中，代表将程序状态从严重的错误中恢复到正常状态。Go 语言中 recover 关键字主要用于捕获异常，让程序回到正常状态，类似 java 等语言中的 try ... catch 。

笔者有过 6 年 linux 系统 C 语言开发经历。C 语言中没有异常捕获的概念，没有 try ... catch ，也没有 panic 和 recover 。不过，万变不离其宗，异常与 if error then return 方式的差别，主要体现在函数调用栈的深度上。如下图：

正常逻辑下的函数调用栈，是逐个回溯的，而异常捕获可以理解为：程序调用栈的长距离跳转。这点在 C 语言里，是通过 setjump 和 longjump 这两个函数来实现的。

try catch 、 recover 、setjump 等机制会将程序当前状态（主要是 cpu 的栈指针寄存器 sp 和程序计数器 pc ， Go 的 recover 是依赖 defer 来维护 sp 和 pc ）保存到一个与 throw、panic、longjump共享的内存里。当有异常的时候，从该内存中提取之前保存的 sp 和 pc 寄存器值，直接将函数栈调回到 sp 指向的位置，并执行 ip 寄存器指向的下一条指令，将程序从异常状态中恢复到正常状态。

深入 panic 和 recover

源码

panic 和 recover 的源码在 Go 源码的 src/runtime/panic.go 里，名为 gopanic 和 gorecover 的函数。

// gopanic 的代码，在 src/runtime/panic.go 第 454 行

// The implementation of the predeclared function panic.
func gopanic(e interface{}) {
  gp := getg()
  if gp.m.curg != gp {
    print("panic: ")
    printany(e)
    print("\n")
    throw("panic on system stack")
  }

  if gp.m.mallocing != 0 {
    print("panic: ")
    printany(e)
    print("\n")
    throw("panic during malloc")
  }
  if gp.m.preemptoff != "" {
    print("panic: ")
    printany(e)
    print("\n")
    print("preempt off reason: ")
    print(gp.m.preemptoff)
    print("\n")
    throw("panic during preemptoff")
  }
  if gp.m.locks != 0 {
    print("panic: ")
    printany(e)
    print("\n")
    throw("panic holding locks")
  }

  var p _panic
  p.arg = e
  p.link = gp._panic
  gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

  atomic.Xadd(&runningPanicDefers, 1)

  for {
    d := gp._defer
    if d == nil {
      break
    }

    // If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
    // take defer off list. The earlier panic or Goexit will not continue running.
    if d.started {
      if d._panic != nil {
        d._panic.aborted = true
      }
      d._panic = nil
      d.fn = nil
      gp._defer = d.link
      freedefer(d)
      continue
    }

    // Mark defer as started, but keep on list, so that traceback
    // can find and update the defer's argument frame if stack growth
    // or a garbage collection happens before reflectcall starts executing d.fn.
    d.started = true

    // Record the panic that is running the defer.
    // If there is a new panic during the deferred call, that panic
    // will find d in the list and will mark d._panic (this panic) aborted.
    d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

    p.argp = unsafe.Pointer(getargp(0))
    reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
    p.argp = nil

    // reflectcall did not panic. Remove d.
    if gp._defer != d {
      throw("bad defer entry in panic")
    }
    d._panic = nil
    d.fn = nil
    gp._defer = d.link

    // trigger shrinkage to test stack copy. See stack_test.go:TestStackPanic
    //GC()

    pc := d.pc
    sp := unsafe.Pointer(d.sp) // must be pointer so it gets adjusted during stack copy
    freedefer(d)
    if p.recovered {
      atomic.Xadd(&runningPanicDefers, -1)

      gp._panic = p.link
      // Aborted panics are marked but remain on the g.panic list.
      // Remove them from the list.
      for gp._panic != nil && gp._panic.aborted {
        gp._panic = gp._panic.link
      }
      if gp._panic == nil { // must be done with signal
        gp.sig = 0
      }
      // Pass information about recovering frame to recovery.
      gp.sigcode0 = uintptr(sp)
      gp.sigcode1 = pc
      mcall(recovery)
      throw("recovery failed") // mcall should not return
    }
  }

  // ran out of deferred calls - old-school panic now
  // Because it is unsafe to call arbitrary user code after freezing
  // the world, we call preprintpanics to invoke all necessary Error
  // and String methods to prepare the panic strings before startpanic.
  preprintpanics(gp._panic)

  fatalpanic(gp._panic) // should not return
  *(*int)(nil) = 0      // not reached
  }

// gorecover 的代码，在 src/runtime/panic.go 第 585 行

// The implementation of the predeclared function recover.
// Cannot split the stack because it needs to reliably
// find the stack segment of its caller.
//
// TODO(rsc): Once we commit to CopyStackAlways,
// this doesn't need to be nosplit.
//go:nosplit
func gorecover(argp uintptr) interface{} {
// Must be in a function running as part of a deferred call during the panic.
// Must be called from the topmost function of the call
// (the function used in the defer statement).
// p.argp is the argument pointer of that topmost deferred function call.
// Compare against argp reported by caller.
// If they match, the caller is the one who can recover.
gp := getg()
p := gp._panic
if p != nil && !p.recovered && argp == uintptr(p.argp) {
  p.recovered = true
  return p.arg
}
return nil
}

从函数代码中我们可以看到 panic 内部主要流程是这样：

获取当前调用者所在的 g ，也就是 goroutine
遍历并执行 g 中的 defer 函数
如果 defer 函数中有调用 recover ，并发现已经发生了 panic ，则将 panic 标记为 recovered
在遍历 defer 的过程中，如果发现已经被标记为 recovered ，则提取出该 defer 的 sp 与 pc，保存在 g 的两个状态码字段中。

调用 runtime.mcall 切到 m->g0 并跳转到 recovery 函数，将前面获取的 g 作为参数传给 recovery 函数。
runtime.mcall 的代码在 go 源码的 src/runtime/asm_xxx.s 中，xxx 是平台类型，如 amd64 。代码如下：

// src/runtime/asm_amd64.s 第 274 行

// func mcall(fn func(*g))
// Switch to m->g0's stack, call fn(g).
// Fn must never return. It should gogo(&g->sched)
// to keep running g.
TEXT runtime·mcall(SB), NOSPLIT, $0-8
    MOVQ	fn+0(FP), DI

    get_tls(CX)
    MOVQ	g(CX), AX	// save state in g->sched
    MOVQ	0(SP), BX	// caller's PC
    MOVQ	BX, (g_sched+gobuf_pc)(AX)
    LEAQ	fn+0(FP), BX	// caller's SP
    MOVQ	BX, (g_sched+gobuf_sp)(AX)
    MOVQ	AX, (g_sched+gobuf_g)(AX)
    MOVQ	BP, (g_sched+gobuf_bp)(AX)

    // switch to m->g0 & its stack, call fn
    MOVQ	g(CX), BX
    MOVQ	g_m(BX), BX
    MOVQ	m_g0(BX), SI
    CMPQ	SI, AX	// if g == m->g0 call badmcall
    JNE	3(PC)
    MOVQ	$runtime·badmcall(SB), AX
    JMP	AX
    MOVQ	SI, g(CX)	// g = m->g0
    MOVQ	(g_sched+gobuf_sp)(SI), SP	// sp = m->g0->sched.sp
    PUSHQ	AX
    MOVQ	DI, DX
    MOVQ	0(DI), DI
    CALL	DI
    POPQ	AX
    MOVQ	$runtime·badmcall2(SB), AX
    JMP	AX
    RET

　　这里之所以要切到 m->g0 ，主要是因为 Go 的 runtime 环境是有自己的堆栈和 goroutine，而 recovery 是在 runtime 环境下执行的，所以要先调度到 m->g0 来执行 recovery 函数。

recovery 函数中，利用 g 中的两个状态码回溯栈指针 sp 并恢复程序计数器 pc 到调度器中，并调用 gogo 重新调度 g ，将 g 恢复到调用 recover 函数的位置， goroutine 继续执行。
代码如下：

// gorecover 的代码，在 src/runtime/panic.go 第 637 行

// Unwind the stack after a deferred function calls recover
// after a panic. Then arrange to continue running as though
// the caller of the deferred function returned normally.
func recovery(gp *g) {
    // Info about defer passed in G struct.
    sp := gp.sigcode0
    pc := gp.sigcode1

    // d's arguments need to be in the stack.
    if sp != 0 && (sp < gp.stack.lo || gp.stack.hi < sp) {
        print("recover: ", hex(sp), " not in [", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n")
        throw("bad recovery")
    }

    // Make the deferproc for this d return again,
    // this time returning 1.  The calling function will
    // jump to the standard return epilogue.
    gp.sched.sp = sp
    gp.sched.pc = pc
    gp.sched.lr = 0
    gp.sched.ret = 1
    gogo(&gp.sched)
}

// src/runtime/asm_amd64.s 第 274 行

// func gogo(buf *gobuf)
// restore state from Gobuf; longjmp
TEXT runtime·gogo(SB), NOSPLIT, $16-8
    MOVQ	buf+0(FP), BX		// gobuf
    MOVQ	gobuf_g(BX), DX
    MOVQ	0(DX), CX		// make sure g != nil
    get_tls(CX)
    MOVQ	DX, g(CX)
    MOVQ	gobuf_sp(BX), SP	// restore SP
    MOVQ	gobuf_ret(BX), AX
    MOVQ	gobuf_ctxt(BX), DX
    MOVQ	gobuf_bp(BX), BP
    MOVQ	$0, gobuf_sp(BX)	// clear to help garbage collector
    MOVQ	$0, gobuf_ret(BX)
    MOVQ	$0, gobuf_ctxt(BX)
    MOVQ	$0, gobuf_bp(BX)
    MOVQ	gobuf_pc(BX), BX
    JMP	BX

以上便是 Go 底层处理异常的流程，精简为三步便是：

defer 函数中调用 recover
触发 panic 并切到 runtime 环境获取在 defer 中调用了 recover 的 g 的 sp 和 pc
恢复到 defer 中 recover 后面的处理逻辑

都有哪些坑

前面提到，panic 函数主要用于主动触发异常。我们在实现业务代码的时候，在程序启动阶段，如果资源初始化出错，可以主动调用 panic 立即结束程序。对于新手来说，这没什么问题，很容易做到。

但是，现实往往是残酷的—— Go 的 runtime 代码中很多地方都调用了 panic 函数，对于不了解 Go 底层实现的新人来说，这无疑是挖了一堆深坑。如果不熟悉这些坑，是不可能写出健壮的 Go 代码。

接下来，笔者给大家细数下都有哪些坑。

数组 ( slice ) 下标越界

这个比较好理解，对于静态类型语言，数组下标越界是致命错误。如下代码可以验证：

package main

import (
    "fmt"
)

func foo(){
    defer func(){
        if err := recover(); err != nil {
            fmt.Println(err)
        }
    }()
    var bar = []int{1}
    fmt.Println(bar[1])
}

func main(){
    foo()
    fmt.Println("exit")
}

输出：

1 2	runtime error: index out of range exit

因为代码中用了 recover ，程序得以恢复，输出 exit。

如果将 recover 那几行注释掉，将会输出如下日志：

panic: runtime error: index out of range

goroutine 1 [running]:
main.foo()
  /home/letian/work/go/src/test/test.go:14 +0x3e
main.main()
  /home/letian/work/go/src/test/test.go:18 +0x22
exit status 2

访问未初始化的指针或 nil 指针

对于有 c/c++ 开发经验的人来说，这个很好理解。但对于没用过指针的新手来说，这是最常见的一类错误。
如下代码可以验证：

package main

import (
  "fmt"
)

func foo(){
  defer func(){
      if err := recover(); err != nil {
          fmt.Println(err)
      }
  }()
  var bar *int
  fmt.Println(*bar)
}

func main(){
  foo()
  fmt.Println("exit")
}

输出：

1 2	runtime error: invalid memory address or nil pointer dereference exit

如果将 recover 那几行代码注释掉，则会输出：

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4869ff]

goroutine 1 [running]:
main.foo()
  /home/letian/work/go/src/test/test.go:14 +0x3f
main.main()
  /home/letian/work/go/src/test/test.go:18 +0x22
exit status 2

试图往已经 close 的 chan 里发送数据

这也是刚学用 chan 的新手容易犯的错误。如下代码可以验证：

package main

import (
  "fmt"
)

func foo(){
  defer func(){
      if err := recover(); err != nil {
          fmt.Println(err)
      }
  }()
  var bar = make(chan int, 1)
  close(bar)
  bar<-1
}

func main(){
  foo()
  fmt.Println("exit")
}

输出：

1 2	send on closed channel exit

如果注释掉 recover ，将输出：

panic: send on closed channel

goroutine 1 [running]:
main.foo()
  /home/letian/work/go/src/test/test.go:15 +0x83
main.main()
  /home/letian/work/go/src/test/test.go:19 +0x22
exit status 2

源码处理逻辑在 src/runtime/chan.go 的 chansend 函数中，如下：

// src/runtime/chan.go 第 269 行

/*
  * generic single channel send/recv
  * If block is not nil,
  * then the protocol will not
  * sleep but return if it could
  * not complete.
  *
  * sleep can wake up with g.param == nil
  * when a channel involved in the sleep has
  * been closed.  it is easiest to loop and re-run
  * the operation; we'll see that it's now closed.
  */
func chansend(c *hchan, ep unsafe.Pointer, block bool, callerpc uintptr) bool {
    if c == nil {
        if !block {
            return false
        }
        gopark(nil, nil, waitReasonChanSendNilChan, traceEvGoStop, 2)
        throw("unreachable")
    }

    if debugChan {
        print("chansend: chan=", c, "\n")
    }

    if raceenabled {
        racereadpc(c.raceaddr(), callerpc, funcPC(chansend))
    }

    // Fast path: check for failed non-blocking operation without acquiring the lock.
    //
    // After observing that the channel is not closed, we observe that the channel is
    // not ready for sending. Each of these observations is a single word-sized read
    // (first c.closed and second c.recvq.first or c.qcount depending on kind of channel).
    // Because a closed channel cannot transition from 'ready for sending' to
    // 'not ready for sending', even if the channel is closed between the two observations,
    // they imply a moment between the two when the channel was both not yet closed
    // and not ready for sending. We behave as if we observed the channel at that moment,
    // and report that the send cannot proceed.
    //
    // It is okay if the reads are reordered here: if we observe that the channel is not
    // ready for sending and then observe that it is not closed, that implies that the
    // channel wasn't closed during the first observation.
    if !block && c.closed == 0 && ((c.dataqsiz == 0 && c.recvq.first == nil) ||
        (c.dataqsiz > 0 && c.qcount == c.dataqsiz)) {
        return false
    }

    var t0 int64
    if blockprofilerate > 0 {
        t0 = cputicks()
    }

    lock(&c.lock)

    if c.closed != 0 {
        unlock(&c.lock)
        panic(plainError("send on closed channel"))
    }

    if sg := c.recvq.dequeue(); sg != nil {
        // Found a waiting receiver. We pass the value we want to send
        // directly to the receiver, bypassing the channel buffer (if any).
        send(c, sg, ep, func() { unlock(&c.lock) }, 3)
        return true
    }

    if c.qcount < c.dataqsiz {
        // Space is available in the channel buffer. Enqueue the element to send.
        qp := chanbuf(c, c.sendx)
        if raceenabled {
            raceacquire(qp)
            racerelease(qp)
        }
        typedmemmove(c.elemtype, qp, ep)
        c.sendx++
        if c.sendx == c.dataqsiz {
            c.sendx = 0
        }
        c.qcount++
        unlock(&c.lock)
        return true
    }

    if !block {
        unlock(&c.lock)
        return false
    }

    // Block on the channel. Some receiver will complete our operation for us.
    gp := getg()
    mysg := acquireSudog()
    mysg.releasetime = 0
    if t0 != 0 {
        mysg.releasetime = -1
    }
    // No stack splits between assigning elem and enqueuing mysg
    // on gp.waiting where copystack can find it.
    mysg.elem = ep
    mysg.waitlink = nil
    mysg.g = gp
    mysg.isSelect = false
    mysg.c = c
    gp.waiting = mysg
    gp.param = nil
    c.sendq.enqueue(mysg)
    goparkunlock(&c.lock, waitReasonChanSend, traceEvGoBlockSend, 3)
    // Ensure the value being sent is kept alive until the
    // receiver copies it out. The sudog has a pointer to the
    // stack object, but sudogs aren't considered as roots of the
    // stack tracer.
    KeepAlive(ep)

    // someone woke us up.
    if mysg != gp.waiting {
        throw("G waiting list is corrupted")
    }
    gp.waiting = nil
    if gp.param == nil {
        if c.closed == 0 {
            throw("chansend: spurious wakeup")
        }
        panic(plainError("send on closed channel"))
    }
    gp.param = nil
    if mysg.releasetime > 0 {
        blockevent(mysg.releasetime-t0, 2)
    }
    mysg.c = nil
    releaseSudog(mysg)
    return true
}

并发读写相同 map

对于刚学并发编程的同学来说，并发读写 map 也是很容易遇到的问题。如下代码可以验证：

package main

import (
  "fmt"
)

func foo(){
  defer func(){
      if err := recover(); err != nil {
          fmt.Println(err)
      }
  }()
  var bar = make(map[int]int)
  go func(){
      defer func(){
          if err := recover(); err != nil {
              fmt.Println(err)
          }
      }()
      for{
          _ = bar[1]
      }
  }()
  for{
      bar[1]=1
  }
}

func main(){
  foo()
  fmt.Println("exit")
}

输出：

fatal error: concurrent map read and map write

goroutine 5 [running]:
runtime.throw(0x4bd8b0, 0x21)
  /home/letian/.gvm/gos/go1.12/src/runtime/panic.go:617 +0x72 fp=0xc00004c780 sp=0xc00004c750 pc=0x427f22
runtime.mapaccess1_fast64(0x49eaa0, 0xc000088180, 0x1, 0xc0000260d8)
  /home/letian/.gvm/gos/go1.12/src/runtime/map_fast64.go:21 +0x1a8 fp=0xc00004c7a8 sp=0xc00004c780 pc=0x40eb58
main.foo.func2(0xc000088180)
  /home/letian/work/go/src/test/test.go:21 +0x5c fp=0xc00004c7d8 sp=0xc00004c7a8 pc=0x48708c
runtime.goexit()
  /home/letian/.gvm/gos/go1.12/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00004c7e0 sp=0xc00004c7d8 pc=0x450e51
created by main.foo
  /home/letian/work/go/src/test/test.go:14 +0x68

goroutine 1 [runnable]:
main.foo()
  /home/letian/work/go/src/test/test.go:25 +0x8b
main.main()
  /home/letian/work/go/src/test/test.go:30 +0x22
exit status 2

细心的朋友不难发现，输出日志里没有出现我们在程序末尾打印的 exit，而是直接将调用栈打印出来了。查看 src/runtime/map.go 中的代码不难发现这几行：

1
2
3

if h.flags&hashWriting != 0 {
  throw("concurrent map read and map write")
}

与前面提到的几种情况不同，runtime 中调用 throw 函数抛出的异常是无法在业务代码中通过 recover 捕获的，这点最为致命。所以，对于并发读写 map 的地方，应该对 map 加锁。

类型断言

在使用类型断言对 interface 进行类型转换的时候也容易一不小心踩坑，而且这个坑是即使用 interface 有一段时间的人也容易忽略的问题。如下代码可以验证：

package main

import (
    "fmt"
)

func foo(){
    defer func(){
        if err := recover(); err != nil {
            fmt.Println(err)
        }
    }()
    var i interface{} = "abc"
    _ = i.([]string)
}

func main(){
    foo()
    fmt.Println("exit")
}

输出：

1 2	interface conversion: interface {} is string, not []string exit

源码在 src/runtime/iface.go 中，如下两个函数：

// panicdottypeE is called when doing an e.(T) conversion and the conversion fails.
// have = the dynamic type we have.
// want = the static type we're trying to convert to.
// iface = the static type we're converting from.
func panicdottypeE(have, want, iface *_type) {
    panic(&TypeAssertionError{iface, have, want, ""})
}

// panicdottypeI is called when doing an i.(T) conversion and the conversion fails.
// Same args as panicdottypeE, but "have" is the dynamic itab we have.
func panicdottypeI(have *itab, want, iface *_type) {
    var t *_type
    if have != nil {
        t = have._type
    }
    panicdottypeE(t, want, iface)
}

下回预告

Go 语言踩坑记之 channel 与 goroutine。

作者

易乐天，小米信息技术部海外商城组

招聘

信息部是小米公司整体系统规划建设的核心部门，支撑公司国内外的线上线下销售服务体系、供应链体系、ERP 体系、内网 OA 体系、数据决策体系等精细化管控的执行落地工作，服务小米内部所有的业务部门以及 40 家生态链公司。

同时部门承担大数据基础平台研发和微服务体系建设落，语言涉及 Java、Go，长年虚位以待对大数据处理、大型电商后端系统、微服务落地有深入理解和实践的各路英雄。

欢迎投递简历：jin.zhang(a)xiaomi.com（武汉）

Tags: Go

← 投稿须知浅析 RPC 与基本实现 →

扫描二维码，分享此文章