Hexagon DSP — CPU offload

Milind Deore
6 min readMar 8, 2017

Help out the underdog! Beat the Cybercrime learn how.

DISCLAIMER : If you plan to create your own .so, make sure it is signed by the device manufacturer else it won’t work.

Snapdragon has Hexagon DSP processor to offload all the math intensive load consuming low power by optimum performance. This guide give details of Hexagon DSP application on Android OS.

SnapDragon has there processors running inside the SOC, namely:

  1. Krait CPU — General purpose processor that usually runs android applications.
  2. Adreno GPU — This is largely used for graphics processing like rendering.
  3. Hexagon DSP — Hexagon specially designed for multi-media acceleration, this helps CPU to offload the task to DSP and save energy and thereby offering optimum performance.

Hexagon has two instances, one is for modem DSP (mDSP) dedicated to modem processing whereas the other instance is application DSP (aDSP) used for multimedia application. With V4 and V5 hexagon aDSP is expanded to support image processing for camera and video. Computer Vision is also accelerated by using inbuilt library called FastCV. Well the snapdragon internal architecture is not know but by looking at the below diagram, it looks like that north-bridge(memory controller) and south-bridge(I/O controller) are connected over fabric, which makes it super fast interconnect. The DMA operation is unknown on latest 800 series of Sanpdragon through.

Both the instance of Hexagon are connected to north-bridge over system fabric. The present linux kernel (Android OS) has driver for ION buffers which make zero copy during CPU offload work to DSP.

Implementing application on the Android using FastCV Framework

FastCV is optimized computer vision library created specifically for snapdragon on hexagon DSP. In fact quite a few ops are hand crafted to suit performance. Based on the implementation, FastCV can run on CPU or aDSP.

On CPU alone

On aDSP

To accelerate the computer vision application, CPU can offload the work to aDSP and pictorially it looks like following:

Before we get into the details of the implementation, there are some key concepts that we need to understand:

FastRPC

RPC gives wings to the application, because now CPU can offload the more complex logic to aDSP (like: arithmetic calculation, machine learning, image processing, FastCV, etc) and it can do generic task. Remote Procedure Call (RPC) has some overhead as it needs to do Marshaling/De-marshaling, but it is minuscule compared to the potential logic that runs on DSP.

  1. Zero Copy — Only when using ION buffers.
  2. Ensures cache coherency for input and output buffers.
  3. All calls are blocking in nature.
  4. Each HLOS thread is handled on its own, unique thread on the aDSP.
  5. Threads are automatically cleared up on HLOS process exit.

Client : User mode process that initiates the remote invocation

Stub : Auto generated code linked in with the user mode process that takes care of marshaling parameters

ADSPRPC Driver : ADSPRPC kernel driver that receives the remote invocation, queues them up and then waits for the response after signaling the remote side

ADSPRPC Framework : ADSPRPC framework dequeues the messages from the queue and dispatches them for processing

Skel : Auto generated code that takes care of unmarshaling parameters

Object : Method implementation. This is in the form of .so (shared object).

Design Consideration

FastRPC was designed to facilitate remote procedure calls between aDSP and APPS (the application processor).

  • aDSP and APPS share DRAM but not l2/l1 caches
  • aDSP has a small limited number of physical mappings it can efficiently support

To support any high performance compute applications memory has to be mapped directly from the application processor to the aDSP. Because aDSP cannot support a large number of dis-contiguous physical pages, ION (a contiguous memory allocator for android) is required.

By separating buffers into “in” and “rout” the driver can overlap cache synchronization between aDSP and APPS, cutting RPC latencies by 50%. “inrout” is not required because it is equivalent to passing the same buffer as in and rout.

The protocol is synchronous for these reasons:

  • It’s trivial for users to create a thread to add asynchronous behavior from userspace.
  • It would add additional complexity for the kernel to manage state between aDSP and APPS in an asynchronous call.
  • It would not get memory to the aDSP any faster, performance is dominated by the cache synchronization operations.

Dynamic Loading

The remote file system is used by the loader on the hexagon aDSP to read (.so) shared object files. Yes, we can have as many .so files as we want. The shared object files are stored on the HLOS’s file system, in our case its Android. There is a specific default location where we aDSP looks for all the shared objects and this path needs to be consistent across HLOS.

  • On the Android builds the remote file system is implicitly Implemented by libadsprpc.so
  • It uses the calling processes content to open files.
  • The default file system root directory is /system/vendor/lib/rfsa/adsp/ or /vendor/lib/rfsa/adsp/
  • The optional environment variable ADSP_LIBRARY_PATH can be used to override the default file system root directory. It contains a list of directories to search when dlw_Open(<library>) is called.

Android — Software Componenets

Application Processor

/system/vendor/lib/libadsprpc.so : Shared object library that needs to be lined with the user space application invoking the remote procedure call. This library interfaces with the kernel driver to initiate the remote invocation to aDSP

/system/vendor/lib/libcdsprpc.so : Shared object library that needs to be lined with the user space application invoking the remote procedure call. This library interfaces with the kernel driver to initiate the remote invocation to cDSP

aDSP Processor

adsp_proc/platform/fastrpc-s md/build/platform_libs/qdsp6 /AAAAAAAA/fastrpc-smd.lib : ADSPRPC framework library that gets linked with the aDSP image. It acts as the transport accepting remove invocations originating from applications processor.

Setup Fast RPC

Follow these steps to setup the FastRPC driver.

Note: when making FastRPC calls the Remote file system is also handled by the FastRPC driver so no additional setup is required.

Check if the adsprpc.ko driver is running

adb shell ls /dev/adsprpc-smd/* If file does not exist start the driver as follows */
adb root adb wait-for-device
adb shell insmod /system/lib/modules/adsprpc.ko

Note:- If module is not found and /dev/adsprpc-smd is not present this device doesn’t support FastRPC.

Re-check if the adsprpc.ko driver is running

adb shell ls /dev/adsprpc-smd

FastRPC driver is now setup.

Optionally check that adsprpcd is running. This deamon could be located in /system/bin or /system/vendor/bin or some other vendor defined location. If running and logcat -s adsprpcdoesn’t show any errors FastRPC is enabled and operational. If not running start the deamon and check locat -s adsprpc for errors. On some older releases if this deamon is not running calls into DSP will block until the daemon starts.

Using ION Allocator

The DSP hardware is not well suited for handling dis-contiguous memory, which is commonly the kind of memory users get from a simple “malloc” call. Android provides a contiguous memory allocator called ION. For more in depth documentation of ION see http://lwn.net/Articles/480055.

Different versions of Android provide slightly incompatible implementations of ion. This guide describes some of the differences between versions of ion APIs.

This is work-in-progress and hence i will add more stuff in coming days…

--

--