之前研究art的时候发现了native bridge,简单来说这东西是主要作用就是为了能运行不同指令集的so(比如x86的设备运行arm的app),而arm设备上这个东西一般都是关闭的,研究了一下后发现这东西挺适合动手脚的,刚好自己在用的Riru被针对了,所以有了这篇博客。把对应的示例代码传到了github:NbInjection,接下来我们聊一下这个小玩具。

源码分析

大家都知道的,zygote对应的可执行文件就是app_process,它的main函数代码如下(已精简):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int main(int argc, char* const argv[])
{
AppRuntime runtime(argv[0], computeArgBlockSize(argc, argv));
// Process command line arguments
// ignore argv[0]
argc--;
argv++;

if (zygote) {
runtime.start("com.android.internal.os.ZygoteInit", args, zygote);
} else if (className) {
runtime.start("com.android.internal.os.RuntimeInit", args, zygote);
} else {
fprintf(stderr, "Error: no class name or --zygote supplied.\n");
app_usage();
LOG_ALWAYS_FATAL("app_process: no class name or --zygote supplied.");
}
}

AppRuntime继承自AndroidRuntime,而AndroidRuntime的代码大概是这样的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/*
* Start the Android runtime. This involves starting the virtual machine
* and calling the "static void main(String[] args)" method in the class
* named by "className".
*
* Passes the main function two arguments, the class name and the specified
* options string.
*/
void AndroidRuntime::start(const char* className, const Vector<String8>& options, bool zygote)
{
ALOGD(">>>>>> START %s uid %d <<<<<<\n",
className != NULL ? className : "(unknown)", getuid());

/* start the virtual machine */
JniInvocation jni_invocation;
jni_invocation.Init(NULL);
JNIEnv* env;
if (startVm(&mJavaVM, &env, zygote, primary_zygote) != 0) {
return;
}
onVmCreated(env);

/*
* Register android functions.
*/
if (startReg(env) < 0) {
ALOGE("Unable to register all android natives\n");
return;
}

// ...
}

这个函数做的最重要一件事就是把虚拟机启动起来(startVm),然后调用传入类的main方法。
追踪这个startVm方法你会发现调用到了Runtime::Init初始化runtime,这个函数很长,截取了一段对我们来说最重要的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
bool Runtime::Init(RuntimeArgumentMap&& runtime_options_in) {
// ...
// Look for a native bridge.
//
// The intended flow here is, in the case of a running system:
//
// Runtime::Init() (zygote):
// LoadNativeBridge -> dlopen from cmd line parameter.
// |
// V
// Runtime::Start() (zygote):
// No-op wrt native bridge.
// |
// | start app
// V
// DidForkFromZygote(action)
// action = kUnload -> dlclose native bridge.
// action = kInitialize -> initialize library
//
//
// The intended flow here is, in the case of a simple dalvikvm call:
//
// Runtime::Init():
// LoadNativeBridge -> dlopen from cmd line parameter.
// |
// V
// Runtime::Start():
// DidForkFromZygote(kInitialize) -> try to initialize any native bridge given.
// No-op wrt native bridge.
{
std::string native_bridge_file_name = runtime_options.ReleaseOrDefault(Opt::NativeBridge);
is_native_bridge_loaded_ = LoadNativeBridge(native_bridge_file_name);
}
// ...
}

在Runtime::Init里会加载native bridge,LoadNativeBridge()函数是这样实现的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
bool LoadNativeBridge(const char* nb_library_filename,
const NativeBridgeRuntimeCallbacks* runtime_cbs) {
// We expect only one place that calls LoadNativeBridge: Runtime::Init. At that point we are not
// multi-threaded, so we do not need locking here.

if (nb_library_filename == nullptr || *nb_library_filename == 0) {
CloseNativeBridge(false);
return false;
} else {
if (!NativeBridgeNameAcceptable(nb_library_filename)) {
CloseNativeBridge(true);
} else {
// Try to open the library.
void* handle = dlopen(nb_library_filename, RTLD_LAZY);
if (handle != nullptr) {
callbacks = reinterpret_cast<NativeBridgeCallbacks*>(dlsym(handle,
kNativeBridgeInterfaceSymbol));
if (callbacks != nullptr) {
if (isCompatibleWith(NAMESPACE_VERSION)) {
// Store the handle for later.
native_bridge_handle = handle;
} else {
callbacks = nullptr;
dlclose(handle);
ALOGW("Unsupported native bridge interface.");
}
} else {
dlclose(handle);
}
}

// Two failure conditions: could not find library (dlopen failed), or could not find native
// bridge interface (dlsym failed). Both are an error and close the native bridge.
if (callbacks == nullptr) {
CloseNativeBridge(true);
} else {
runtime_callbacks = runtime_cbs;
state = NativeBridgeState::kOpened;
}
}
return state == NativeBridgeState::kOpened;
}
}

发现了什么没有!!是我们熟悉的dlopen!!dlopen会执行目标库的.init_array中的所有函数,而让自己的函数进入.init_array实际上只需要声明__attribute__((constructor))就好了,完全没有难度啊!
hey,先冷静一下,我们还有一个问题不知道答案:这个native bridge是从哪传进来的?答案很简单,回过头看一下AndroidRuntime::startVm()就明白了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*
* Start the Dalvik Virtual Machine.
*
* Various arguments, most determined by system properties, are passed in.
* The "mOptions" vector is updated.
*
* CAUTION: when adding options in here, be careful not to put the
* char buffer inside a nested scope. Adding the buffer to the
* options using mOptions.add() does not copy the buffer, so if the
* buffer goes out of scope the option may be overwritten. It's best
* to put the buffer at the top of the function so that it is more
* unlikely that someone will surround it in a scope at a later time
* and thus introduce a bug.
*
* Returns 0 on success.
*/
int AndroidRuntime::startVm(JavaVM** pJavaVM, JNIEnv** pEnv, bool zygote, bool primary_zygote)
{
JavaVMInitArgs initArgs;
// ...

// Native bridge library. "0" means that native bridge is disabled.
//
// Note: bridging is only enabled for the zygote. Other runs of
// app_process may not have the permissions to mount etc.
property_get("ro.dalvik.vm.native.bridge", propBuf, "");
if (propBuf[0] == '\0') {
ALOGW("ro.dalvik.vm.native.bridge is not expected to be empty");
} else if (zygote && strcmp(propBuf, "0") != 0) {
snprintf(nativeBridgeLibrary, sizeof("-XX:NativeBridge=") + PROPERTY_VALUE_MAX,
"-XX:NativeBridge=%s", propBuf);
addOption(nativeBridgeLibrary);
}
// ...
initArgs.version = JNI_VERSION_1_4;
initArgs.options = mOptions.editArray();
initArgs.nOptions = mOptions.size();
initArgs.ignoreUnrecognized = JNI_FALSE;

/*
* Initialize the VM.
*
* The JavaVM* is essentially per-process, and the JNIEnv* is per-thread.
* If this call succeeds, the VM is ready, and we can start issuing
* JNI calls.
*/
if (JNI_CreateJavaVM(pJavaVM, pEnv, &initArgs) < 0) {
ALOGE("JNI_CreateJavaVM failed\n");
return -1;
}

return 0;
}

原来是读取的ro.dalvik.vm.native.bridge这个系统属性啊,等等,这个属性名字是以.ro开头的,也就代表着这个属性是只读的,一旦设置不能修改…… 另一个问题是,这个属性定义在default.prop中,而非常规的build.prop,这个文件改不了,每次开机都会重新读取,那还玩啥啊,拜拜……
等等!谁说这条属性就只能由厂商修改了?

利用

我拿来测试的设备是一台Google Pixel 3(Android 10,Magisk 20.4),因为有magisk所以直接写成了magisk模块;没有magisk的话可以考虑修改ramdisk.img(此方法同样适用于模拟器),将default.prop中的ro.dalvik.vm.native.bridge修改为我们的so文件名就好了(注意文件必须在系统的lib下面)
这里就当你把环境配置好了吧,让我们继续:
写一个函数,往里面写入代码,加上__attribute__((constructor)),编译,放/system/lib64和/system/lib下面,修改ro.dalvik.vm.native.bridge为我们的文件名,重启,成功,完结撒花……

当然不可能这么容易,此时虽然你已经把代码成功注入到了zygote进程,但是还有一些问题要处理,让我们来细数一下。

系统原有的native bridge被覆盖

native bridge这东西对arm设备上来说基本没啥用,然而对x86设备来说,没有这玩意你就没法用只支持arm的app,也就是说你连微信都用不了……
要解决这个问题,还是得看源码,看看系统是怎么调用的native bridge里的函数:

1
2
3
4
5
6
7
void* NativeBridgeGetTrampoline(void* handle, const char* name, const char* shorty,
uint32_t len) {
if (NativeBridgeInitialized()) {
return callbacks->getTrampoline(handle, name, shorty, len);
}
return nullptr;
}

是用的一个叫callbacks的全局变量啊,看下这个callbacks是啥:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Native bridge interfaces to runtime.
struct NativeBridgeCallbacks {
// Version number of the interface.
uint32_t version;

bool (*initialize)(const struct NativeBridgeRuntimeCallbacks* runtime_cbs,
const char* private_dir, const char* instruction_set);

void* (*loadLibrary)(const char* libpath, int flag);

void* (*getTrampoline)(void* handle, const char* name, const char* shorty, uint32_t len);
// ...
}

// Pointer to the callbacks. Available as soon as LoadNativeBridge succeeds, but only initialized
// later.
static const NativeBridgeCallbacks* callbacks = nullptr;

原来是一个指向NativeBridgeCallbacks的指针,这个叫做NativeBridgeCallbacks的结构体里包含函数指针,运行时会找到对应的函数指针然后调用。
这个变量是在哪初始化的呢:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// The symbol name exposed by native-bridge with the type of NativeBridgeCallbacks.
static constexpr const char* kNativeBridgeInterfaceSymbol = "NativeBridgeItf";

bool LoadNativeBridge(const char* nb_library_filename,
const NativeBridgeRuntimeCallbacks* runtime_cbs) {
// Try to open the library.
void* handle = dlopen(nb_library_filename, RTLD_LAZY);
if (handle != nullptr) {
callbacks = reinterpret_cast<NativeBridgeCallbacks*>(dlsym(handle,
kNativeBridgeInterfaceSymbol));
if (callbacks != nullptr) {
if (isCompatibleWith(NAMESPACE_VERSION)) {
// Store the handle for later.
native_bridge_handle = handle;
} else {
callbacks = nullptr;
dlclose(handle);
ALOGW("Unsupported native bridge interface.");
}
} else {
dlclose(handle);
}
}
return state == NativeBridgeState::kOpened;
}
}

是从native bridge的so库中找到的,对应符号是NativeBridgeItf
既然系统是这样做的,那我们就顺着系统来,在合适的时候偷梁换柱一下。
首先声明一个对应类型的变量NativeBridgeItf:

1
__attribute__ ((visibility ("default"))) NativeBridgeCallbacks NativeBridgeItf;

注:如果你使用c++,记得加上extern "C"
然后,在系统dlopen我们的库时,会执行.init_array里的函数,我们可以在这里动手脚:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
if (real_nb_filename[0] == '\0') {
LOGW("ro.dalvik.vm.native.bridge is not expected to be empty");
} else if (strcmp(real_nb_filename, "0") != 0) {
LOGI("The system has real native bridge support, libname %s", real_nb_filename);
const char* error_msg;
void* handle = dlopen(real_nb_filename, RTLD_LAZY);
if (handle) {
void* real_nb_itf = dlsym(handle, "NativeBridgeItf");
if (real_nb_itf) {
// sizeof(NativeBridgeCallbacks) maybe changed in other android version
memcpy(&NativeBridgeItf, real_nb_itf, sizeof(NativeBridgeCallbacks));
return;
}
errro_msg = dlerror();
dlclose(handle);
} else {
errro_msg = dlerror();
}
LOGE("Could not setup NativeBridgeItf for real lib %s: %s", real_nb_filename, error_msg);
}

简单解释一下:系统是通过读取我们的NativeBridgeItf这个变量来获取要执行的对应函数的,那我们就可以仿照系统,从真正的native bridge中读取这个变量,覆盖掉我们暴露出去的那个NativeBridgeItf,这样就会走真实的native bridge callbacks。
注:这里还有个坑,NativeBridgeCallbacks这个结构体的大小在其他系统版本是不同的,如果只复制固定大小,要么复制不全要么越界;所以这里需要按照版本判断一下。

无法驻留在内存中

当你兴致勃勃地写好了代码,运行时你会发现各种奇怪的bug,排查N遍后你才发现,你写好的这个so在内存中不知道什么时候消失了??
让我们看看系统的那个LoadNativeBridge:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
void* handle = dlopen(nb_library_filename, RTLD_LAZY);
if (handle != nullptr) {
callbacks = reinterpret_cast<NativeBridgeCallbacks*>(dlsym(handle, kNativeBridgeInterfaceSymbol));
if (callbacks != nullptr) {
if (isCompatibleWith(NAMESPACE_VERSION)) {
// Store the handle for later.
native_bridge_handle = handle;
} else {
callbacks = nullptr;
dlclose(handle);
ALOGW("Unsupported native bridge interface.");
}
} else {
dlclose(handle);
}
}

如果isCompatibleWith这个函数返回false,那么就会close掉我们的so库。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// The policy of invoking Nativebridge changed in v3 with/without namespace.
// Suggest Nativebridge implementation not maintain backward-compatible.
static bool isCompatibleWith(const uint32_t version) {
// Libnativebridge is now designed to be forward-compatible. So only "0" is an unsupported
// version.
if (callbacks == nullptr || callbacks->version == 0 || version == 0) {
return false;
}

// If this is a v2+ bridge, it may not be forwards- or backwards-compatible. Check.
if (callbacks->version >= SIGNAL_VERSION) {
return callbacks->isCompatibleWith(version);
}

return true;
}

是通过callbacks->versioncallbacks->isCompatibleWith这个函数指针判断的。
那我们需要在系统没有native bridge时设置一下这些东西。(如果系统有native bridge那么在上面NativeBridgeItf就已经被覆盖了)
你需要把callbacks里面的东西都设置一下,以免发生其他问题;还好还好,那些函数只需要写个空实现就行,需要注意的是版本,比如5.0就只接受v1版本的native bridge,而7.0时只接受v3及以上版本。

把这些设置好了以后,你的so库能成功驻留在zygote进程的内存中了;然而,你在应用进程中找不到这个so库,这是因为新进程fork出来以后,如果不需要native bridge,系统会卸载它:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
static void ZygoteHooks_nativePostForkChild(JNIEnv* env,
jclass,
jlong token,
jint runtime_flags,
jboolean is_system_server,
jboolean is_zygote,
jstring instruction_set) {
// ...
if (instruction_set != nullptr && !is_system_server) {
ScopedUtfChars isa_string(env, instruction_set);
InstructionSet isa = GetInstructionSetFromString(isa_string.c_str());
Runtime::NativeBridgeAction action = Runtime::NativeBridgeAction::kUnload;
if (isa != InstructionSet::kNone && isa != kRuntimeISA) {
action = Runtime::NativeBridgeAction::kInitialize;
}
runtime->InitNonZygoteOrPostFork(env, is_system_server, is_zygote, action, isa_string.c_str());
} else {
runtime->InitNonZygoteOrPostFork(
env,
is_system_server,
is_zygote,
Runtime::NativeBridgeAction::kUnload,
/*isa=*/ nullptr,
profile_system_server);
}
}

void Runtime::InitNonZygoteOrPostFork(
JNIEnv* env,
bool is_system_server,
// This is true when we are initializing a child-zygote. It requires
// native bridge initialization to be able to run guest native code in
// doPreload().
bool is_child_zygote,
NativeBridgeAction action,
const char* isa,
bool profile_system_server) {
if (is_native_bridge_loaded_) {
switch (action) {
case NativeBridgeAction::kUnload:
UnloadNativeBridge();
is_native_bridge_loaded_ = false;
break;
case NativeBridgeAction::kInitialize:
InitializeNativeBridge(env, isa);
break;
}
}
// ...
}

这个过程我们很难干预,然而其实我们可以换个思路:既然系统要卸载这个so库,那我们就让它卸载;我们已经可以在zygote里执行任意代码了,那么写个新so库把主要逻辑放里面,在这个假的native bridge里dlopen()这个新库,假的native bridge直接当个loader不就好了嘛!而且这样的话实际上我们不用实现那堆函数,只需要把version设置成一个无效的值(比如0),这样系统检测到版本无效就会自动关闭我们的假native bridge库,也不用担心那些回调函数会被调用~

总结

利用native bridge可以实现比较简单的zygote注入,实际用起来需要费点功夫,不过都是体力活,比如每个版本中NativeBridgeCallbacks这个结构体的大小之类的;以后可能会把这东西应用在我的Dreamland上。
文末再放一下示例代码链接:NbInjection
QQ群:949888394,欢迎一起来玩~
文章可能有疏漏,也可能有更好的办法;欢迎交流讨论~