通过系统的native bridge实现注入zygote

canyie

发布于2020年8月18日

字数：3k字

时长：14分钟

之前研究art的时候发现了native bridge，简单来说这东西是主要作用就是为了能运行不同指令集的so（比如x86的设备运行arm的app），而arm设备上这个东西一般都是关闭的，研究了一下后发现这东西挺适合动手脚的，刚好自己在用的Riru被针对了，所以有了这篇博客。把对应的示例代码传到了github：NbInjection，接下来我们聊一下这个小玩具。

源码分析

大家都知道的，zygote对应的可执行文件就是app_process，它的main函数代码如下（已精简）：

int main(int argc, char* const argv[])
{
    AppRuntime runtime(argv[0], computeArgBlockSize(argc, argv));
    // Process command line arguments
    // ignore argv[0]
    argc--;
    argv++;

    if (zygote) {
        runtime.start("com.android.internal.os.ZygoteInit", args, zygote);
    } else if (className) {
        runtime.start("com.android.internal.os.RuntimeInit", args, zygote);
    } else {
        fprintf(stderr, "Error: no class name or --zygote supplied.\n");
        app_usage();
        LOG_ALWAYS_FATAL("app_process: no class name or --zygote supplied.");
    }
}

AppRuntime继承自AndroidRuntime，而AndroidRuntime的代码大概是这样的：

/*
 * Start the Android runtime.  This involves starting the virtual machine
 * and calling the "static void main(String[] args)" method in the class
 * named by "className".
 *
 * Passes the main function two arguments, the class name and the specified
 * options string.
 */
void AndroidRuntime::start(const char* className, const Vector<String8>& options, bool zygote)
{
    ALOGD(">>>>>> START %s uid %d <<<<<<\n",
            className != NULL ? className : "(unknown)", getuid());

    /* start the virtual machine */
    JniInvocation jni_invocation;
    jni_invocation.Init(NULL);
    JNIEnv* env;
    if (startVm(&mJavaVM, &env, zygote, primary_zygote) != 0) {
        return;
    }
    onVmCreated(env);

    /*
     * Register android functions.
     */
    if (startReg(env) < 0) {
        ALOGE("Unable to register all android natives\n");
        return;
    }

    // ...
}

这个函数做的最重要一件事就是把虚拟机启动起来（startVm），然后调用传入类的main方法。
追踪这个startVm方法你会发现调用到了Runtime::Init初始化runtime，这个函数很长，截取了一段对我们来说最重要的：

bool Runtime::Init(RuntimeArgumentMap&& runtime_options_in) {
  // ...
  // Look for a native bridge.
  //
  // The intended flow here is, in the case of a running system:
  //
  // Runtime::Init() (zygote):
  //   LoadNativeBridge -> dlopen from cmd line parameter.
  //  |
  //  V
  // Runtime::Start() (zygote):
  //   No-op wrt native bridge.
  //  |
  //  | start app
  //  V
  // DidForkFromZygote(action)
  //   action = kUnload -> dlclose native bridge.
  //   action = kInitialize -> initialize library
  //
  //
  // The intended flow here is, in the case of a simple dalvikvm call:
  //
  // Runtime::Init():
  //   LoadNativeBridge -> dlopen from cmd line parameter.
  //  |
  //  V
  // Runtime::Start():
  //   DidForkFromZygote(kInitialize) -> try to initialize any native bridge given.
  //   No-op wrt native bridge.
  {
    std::string native_bridge_file_name = runtime_options.ReleaseOrDefault(Opt::NativeBridge);
    is_native_bridge_loaded_ = LoadNativeBridge(native_bridge_file_name);
  }
  // ...
}

在Runtime::Init里会加载native bridge，LoadNativeBridge()函数是这样实现的：

bool LoadNativeBridge(const char* nb_library_filename,
                      const NativeBridgeRuntimeCallbacks* runtime_cbs) {
  // We expect only one place that calls LoadNativeBridge: Runtime::Init. At that point we are not
  // multi-threaded, so we do not need locking here.

  if (nb_library_filename == nullptr || *nb_library_filename == 0) {
    CloseNativeBridge(false);
    return false;
  } else {
    if (!NativeBridgeNameAcceptable(nb_library_filename)) {
      CloseNativeBridge(true);
    } else {
      // Try to open the library.
      void* handle = dlopen(nb_library_filename, RTLD_LAZY);
      if (handle != nullptr) {
        callbacks = reinterpret_cast<NativeBridgeCallbacks*>(dlsym(handle,
                                                                   kNativeBridgeInterfaceSymbol));
        if (callbacks != nullptr) {
          if (isCompatibleWith(NAMESPACE_VERSION)) {
            // Store the handle for later.
            native_bridge_handle = handle;
          } else {
            callbacks = nullptr;
            dlclose(handle);
            ALOGW("Unsupported native bridge interface.");
          }
        } else {
          dlclose(handle);
        }
      }

      // Two failure conditions: could not find library (dlopen failed), or could not find native
      // bridge interface (dlsym failed). Both are an error and close the native bridge.
      if (callbacks == nullptr) {
        CloseNativeBridge(true);
      } else {
        runtime_callbacks = runtime_cbs;
        state = NativeBridgeState::kOpened;
      }
    }
    return state == NativeBridgeState::kOpened;
  }
}

发现了什么没有！！是我们熟悉的dlopen！！dlopen会执行目标库的.init_array中的所有函数，而让自己的函数进入.init_array实际上只需要声明__attribute__((constructor))就好了，完全没有难度啊！
hey，先冷静一下，我们还有一个问题不知道答案：这个native bridge是从哪传进来的？答案很简单，回过头看一下AndroidRuntime::startVm()就明白了：

/*
 * Start the Dalvik Virtual Machine.
 *
 * Various arguments, most determined by system properties, are passed in.
 * The "mOptions" vector is updated.
 *
 * CAUTION: when adding options in here, be careful not to put the
 * char buffer inside a nested scope.  Adding the buffer to the
 * options using mOptions.add() does not copy the buffer, so if the
 * buffer goes out of scope the option may be overwritten.  It's best
 * to put the buffer at the top of the function so that it is more
 * unlikely that someone will surround it in a scope at a later time
 * and thus introduce a bug.
 *
 * Returns 0 on success.
 */
int AndroidRuntime::startVm(JavaVM** pJavaVM, JNIEnv** pEnv, bool zygote, bool primary_zygote)
{
    JavaVMInitArgs initArgs;
    // ...

    // Native bridge library. "0" means that native bridge is disabled.
    //
    // Note: bridging is only enabled for the zygote. Other runs of
    //       app_process may not have the permissions to mount etc.
    property_get("ro.dalvik.vm.native.bridge", propBuf, "");
    if (propBuf[0] == '\0') {
        ALOGW("ro.dalvik.vm.native.bridge is not expected to be empty");
    } else if (zygote && strcmp(propBuf, "0") != 0) {
        snprintf(nativeBridgeLibrary, sizeof("-XX:NativeBridge=") + PROPERTY_VALUE_MAX,
                 "-XX:NativeBridge=%s", propBuf);
        addOption(nativeBridgeLibrary);
    }
    // ...
    initArgs.version = JNI_VERSION_1_4;
    initArgs.options = mOptions.editArray();
    initArgs.nOptions = mOptions.size();
    initArgs.ignoreUnrecognized = JNI_FALSE;

    /*
     * Initialize the VM.
     *
     * The JavaVM* is essentially per-process, and the JNIEnv* is per-thread.
     * If this call succeeds, the VM is ready, and we can start issuing
     * JNI calls.
     */
    if (JNI_CreateJavaVM(pJavaVM, pEnv, &initArgs) < 0) {
        ALOGE("JNI_CreateJavaVM failed\n");
        return -1;
    }

    return 0;
}

原来是读取的ro.dalvik.vm.native.bridge这个系统属性啊，等等，这个属性名字是以.ro开头的，也就代表着这个属性是只读的，一旦设置不能修改…… 另一个问题是，这个属性定义在default.prop中，而非常规的build.prop，这个文件改不了，每次开机都会重新读取，那还玩啥啊，拜拜……
等等！谁说这条属性就只能由厂商修改了？

利用

我拿来测试的设备是一台Google Pixel 3（Android 10，Magisk 20.4），因为有magisk所以直接写成了magisk模块；没有magisk的话可以考虑修改ramdisk.img（此方法同样适用于模拟器），将default.prop中的ro.dalvik.vm.native.bridge修改为我们的so文件名就好了（注意文件必须在系统的lib下面）
这里就当你把环境配置好了吧，让我们继续：
写一个函数，往里面写入代码，加上__attribute__((constructor))，编译，放/system/lib64和/system/lib下面，修改ro.dalvik.vm.native.bridge为我们的文件名，重启，成功，完结撒花……

当然不可能这么容易，此时虽然你已经把代码成功注入到了zygote进程，但是还有一些问题要处理，让我们来细数一下。

系统原有的native bridge被覆盖

native bridge这东西对arm设备上来说基本没啥用，然而对x86设备来说，没有这玩意你就没法用只支持arm的app，也就是说你连微信都用不了……
要解决这个问题，还是得看源码，看看系统是怎么调用的native bridge里的函数：

void* NativeBridgeGetTrampoline(void* handle, const char* name, const char* shorty,
                                uint32_t len) {
  if (NativeBridgeInitialized()) {
    return callbacks->getTrampoline(handle, name, shorty, len);
  }
  return nullptr;
}

是用的一个叫callbacks的全局变量啊，看下这个callbacks是啥：

// Native bridge interfaces to runtime.
struct NativeBridgeCallbacks {
  // Version number of the interface.
  uint32_t version;

  bool (*initialize)(const struct NativeBridgeRuntimeCallbacks* runtime_cbs,
                     const char* private_dir, const char* instruction_set);

  void* (*loadLibrary)(const char* libpath, int flag);

  void* (*getTrampoline)(void* handle, const char* name, const char* shorty, uint32_t len);
  // ...
}

// Pointer to the callbacks. Available as soon as LoadNativeBridge succeeds, but only initialized
// later.
static const NativeBridgeCallbacks* callbacks = nullptr;

原来是一个指向NativeBridgeCallbacks的指针，这个叫做NativeBridgeCallbacks的结构体里包含函数指针，运行时会找到对应的函数指针然后调用。
这个变量是在哪初始化的呢：

// The symbol name exposed by native-bridge with the type of NativeBridgeCallbacks.
static constexpr const char* kNativeBridgeInterfaceSymbol = "NativeBridgeItf";

bool LoadNativeBridge(const char* nb_library_filename,
                      const NativeBridgeRuntimeCallbacks* runtime_cbs) {
      // Try to open the library.
      void* handle = dlopen(nb_library_filename, RTLD_LAZY);
      if (handle != nullptr) {
        callbacks = reinterpret_cast<NativeBridgeCallbacks*>(dlsym(handle,
                                                                   kNativeBridgeInterfaceSymbol));
        if (callbacks != nullptr) {
          if (isCompatibleWith(NAMESPACE_VERSION)) {
            // Store the handle for later.
            native_bridge_handle = handle;
          } else {
            callbacks = nullptr;
            dlclose(handle);
            ALOGW("Unsupported native bridge interface.");
          }
        } else {
          dlclose(handle);
        }
      }
    return state == NativeBridgeState::kOpened;
  }
}

是从native bridge的so库中找到的，对应符号是NativeBridgeItf。
既然系统是这样做的，那我们就顺着系统来，在合适的时候偷梁换柱一下。
首先声明一个对应类型的变量NativeBridgeItf：

1	__attribute__ ((visibility ("default"))) NativeBridgeCallbacks NativeBridgeItf;

注：如果你使用c++，记得加上extern "C"。
然后，在系统dlopen我们的库时，会执行.init_array里的函数，我们可以在这里动手脚：

if (real_nb_filename[0] == '\0') {
    LOGW("ro.dalvik.vm.native.bridge is not expected to be empty");
} else if (strcmp(real_nb_filename, "0") != 0) {
    LOGI("The system has real native bridge support, libname %s", real_nb_filename);
    const char* error_msg;
    void* handle = dlopen(real_nb_filename, RTLD_LAZY);
    if (handle) {
        void* real_nb_itf = dlsym(handle, "NativeBridgeItf");
        if (real_nb_itf) {
            // sizeof(NativeBridgeCallbacks) maybe changed in other android version
            memcpy(&NativeBridgeItf, real_nb_itf, sizeof(NativeBridgeCallbacks));
            return;
        }
        errro_msg = dlerror();
        dlclose(handle);
    } else {
        errro_msg = dlerror();
    }
    LOGE("Could not setup NativeBridgeItf for real lib %s: %s", real_nb_filename, error_msg);
}

简单解释一下：系统是通过读取我们的NativeBridgeItf这个变量来获取要执行的对应函数的，那我们就可以仿照系统，从真正的native bridge中读取这个变量，覆盖掉我们暴露出去的那个NativeBridgeItf，这样就会走真实的native bridge callbacks。
注：这里还有个坑，NativeBridgeCallbacks这个结构体的大小在其他系统版本是不同的，如果只复制固定大小，要么复制不全要么越界；所以这里需要按照版本判断一下。

无法驻留在内存中

当你兴致勃勃地写好了代码，运行时你会发现各种奇怪的bug，排查N遍后你才发现，你写好的这个so在内存中不知道什么时候消失了？？
让我们看看系统的那个LoadNativeBridge：

void* handle = dlopen(nb_library_filename, RTLD_LAZY);
if (handle != nullptr) {
    callbacks = reinterpret_cast<NativeBridgeCallbacks*>(dlsym(handle, kNativeBridgeInterfaceSymbol));
    if (callbacks != nullptr) {
      if (isCompatibleWith(NAMESPACE_VERSION)) {
        // Store the handle for later.
        native_bridge_handle = handle;
      } else {
        callbacks = nullptr;
        dlclose(handle);
        ALOGW("Unsupported native bridge interface.");
      }
    } else {
      dlclose(handle);
    }
}

如果isCompatibleWith这个函数返回false，那么就会close掉我们的so库。

// The policy of invoking Nativebridge changed in v3 with/without namespace.
// Suggest Nativebridge implementation not maintain backward-compatible.
static bool isCompatibleWith(const uint32_t version) {
  // Libnativebridge is now designed to be forward-compatible. So only "0" is an unsupported
  // version.
  if (callbacks == nullptr || callbacks->version == 0 || version == 0) {
    return false;
  }

  // If this is a v2+ bridge, it may not be forwards- or backwards-compatible. Check.
  if (callbacks->version >= SIGNAL_VERSION) {
    return callbacks->isCompatibleWith(version);
  }

  return true;
}

是通过callbacks->version和callbacks->isCompatibleWith这个函数指针判断的。
那我们需要在系统没有native bridge时设置一下这些东西。（如果系统有native bridge那么在上面NativeBridgeItf就已经被覆盖了）
你需要把callbacks里面的东西都设置一下，以免发生其他问题；还好还好，那些函数只需要写个空实现就行，需要注意的是版本，比如5.0就只接受v1版本的native bridge，而7.0时只接受v3及以上版本。

把这些设置好了以后，你的so库能成功驻留在zygote进程的内存中了；然而，你在应用进程中找不到这个so库，这是因为新进程fork出来以后，如果不需要native bridge，系统会卸载它：

static void ZygoteHooks_nativePostForkChild(JNIEnv* env,
                                            jclass,
                                            jlong token,
                                            jint runtime_flags,
                                            jboolean is_system_server,
                                            jboolean is_zygote,
                                            jstring instruction_set) {
  // ...
  if (instruction_set != nullptr && !is_system_server) {
    ScopedUtfChars isa_string(env, instruction_set);
    InstructionSet isa = GetInstructionSetFromString(isa_string.c_str());
    Runtime::NativeBridgeAction action = Runtime::NativeBridgeAction::kUnload;
    if (isa != InstructionSet::kNone && isa != kRuntimeISA) {
      action = Runtime::NativeBridgeAction::kInitialize;
    }
    runtime->InitNonZygoteOrPostFork(env, is_system_server, is_zygote, action, isa_string.c_str());
  } else {
    runtime->InitNonZygoteOrPostFork(
        env,
        is_system_server,
        is_zygote,
        Runtime::NativeBridgeAction::kUnload,
        /*isa=*/ nullptr,
        profile_system_server);
  }
}

void Runtime::InitNonZygoteOrPostFork(
    JNIEnv* env,
    bool is_system_server,
    // This is true when we are initializing a child-zygote. It requires
    // native bridge initialization to be able to run guest native code in
    // doPreload().
    bool is_child_zygote,
    NativeBridgeAction action,
    const char* isa,
    bool profile_system_server) {
  if (is_native_bridge_loaded_) {
    switch (action) {
      case NativeBridgeAction::kUnload:
        UnloadNativeBridge();
        is_native_bridge_loaded_ = false;
        break;
      case NativeBridgeAction::kInitialize:
        InitializeNativeBridge(env, isa);
        break;
    }
  }
  // ...
}

这个过程我们很难干预，然而其实我们可以换个思路：既然系统要卸载这个so库，那我们就让它卸载；我们已经可以在zygote里执行任意代码了，那么写个新so库把主要逻辑放里面，在这个假的native bridge里dlopen()这个新库，假的native bridge直接当个loader不就好了嘛！而且这样的话实际上我们不用实现那堆函数，只需要把version设置成一个无效的值（比如0），这样系统检测到版本无效就会自动关闭我们的假native bridge库，也不用担心那些回调函数会被调用~

总结

利用native bridge可以实现比较简单的zygote注入，实际用起来需要费点功夫，不过都是体力活，比如每个版本中NativeBridgeCallbacks这个结构体的大小之类的；以后可能会把这东西应用在我的Dreamland上。
文末再放一下示例代码链接：NbInjection
QQ群：949888394，欢迎一起来玩~
文章可能有疏漏，也可能有更好的办法；欢迎交流讨论~

《空中浩劫》里的法航447

2009年6月1日（UTC时间），法国航空447号班机（机型空中客车A330-203、注册号F-GZCP）在大西洋中部雷达盲区神秘失踪，后被证实坠毁，机上228人（乘客216人、机组成员12人）...

Android R上的隐藏API限制学习笔记

2018年发布的Android 9中引入了对隐藏API的限制，这对整个Android生态来说当然是一件好事，但也严重限制了以往我们通过反射等手段实现的“黑科技”（如插件化等），所以开发者们纷纷寻...