[LINUX] Processor tracing mechanism to read from perf (perf + Intel PT / ARM CoreSight)

I read the source code of perf for various reasons, so I will explain it briefly. This time, we will look at the processing of the perf record command, which is in charge of recording events in perf. Especially in recent years, the CPU has a trace mechanism, and perf also takes advantage of it, so in this article, I would like to pay attention to the part that cooperates with the processor trace mechanism of the CPU in the perf record. To tell the truth, I'm more interested in processor traces such as ** Intel Processor Trace (Intel PT) ** and ** ARM CoreSight ** than perf, but since these are implemented as perf events on Linux, perf I'm hungry to analyze the implementation of the command.

1. Perf architecture

Originally, perf had a predecessor named Performance counters for Linux (PCL) and was provided as an interface tool to the performance counters (PMC) provided by the CPU. It is now possible to access various sources other than performance counters, making it a more versatile tracing tool. The figure below outlines the architecture. In addition to PMC, perf at this time can access traces such as k / uprobe, various events in the Linux kernel, and guest OS information (perf kvm) on the hypervisor.

perf.png[^4]より引用

The accessible sources can be viewed with sudo perf list.

$ sudo perf list
List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

These sources are categorized inside perf by software | hardware-specific source types and events, such as `PERF_TYPE_HARDWARE``` and `PERF_TYPE_SOFTWARE```. ](Https://elixir.bootlin.com/linux/v5.5-rc4/source/include/uapi/linux/perf_event.h) Use the ** -e option ** to specify the source from the perf command. For example, if you want to record an event related to a page fault in / bin / ls, do as follows.

$ sudo perf record -vv -e page-faults /bin/ls
$ perf report

To get perf events from a third party program, go to perf_event_open (2) and set the source type to PERF_TYPE_XXXX You can specify it.

(By the way, there used to be a vulnerability that could be elevated with perf_event_open (2))

2. How Perf record works

First, the event acquisition loop, which is the main process of perf record, will be described. Since perf supports multiple commands such as report and stat in addition to record, it is abstracted as a function pointer in the source code and the processing for each command is defined. Actually, run_argv (& argc, & argv); is called from the main function The function pointer for each command is called through run_builtin (). The process corresponding to the record command is cmd_record (), and from here The called __cmd_record is the main loop for event acquisition. By the way, before going into the explanation of this main loop processing, the record structure passed to the argument of __cmd_record earlier Let's take a look at /perf/builtin-record.c#L78).

struct record {
	...
	struct perf_data	data; //Interface with external storage
	struct auxtrace_record	*itr; //Interface to AUX API
	struct evlist	*evlist; //Ring buffer for each event
	struct perf_session	*session;
	...
};

An instance of the structure is defined in in the same code. This record structure will play a central role in saving the context within the record command processing thereafter. Only the main members are listed above. Events that can be retrieved by perf are stored in the ring buffer for each event type, but perf struct evlist in each of these ring buffers. Access with rc4 / source / tools / perf / util / evlist.h # L51). After that, struct perf_data is used to save the data read using evlist as perf_data in external storage. /util/data.h#L23). The ring buffer is overwritten when the next event occurs, so it is necessary to output data to a file in the external storage at regular intervals.

Event source-->Ring buffer--(evlist)--> perf --(perf_data)-->External storage

However, if the source of the event is the processor trace mechanism of the CPU, in addition to the struct evlist, [struct auxtrace_record](https://elixir.bootlin.com/linux/v5.5-rc4/source/tools/perf/ Event acquisition is performed using the interface called util / auxtrace.h # L326). This will be dealt with in detail in the next section.

Processor trace-->AUX buffer--(evlist + auxtrace_record)--> perf --(perf_data)-->External storage

By the way, __cmd_record, which is the main process of perf record, keeps executing the series of processes just described in an infinite loop. The following is the source code of the main part.

static int __cmd_record(struct record *rec, int argc, const char **argv)
{

	struct perf_tool *tool = &rec->tool;
	struct record_opts *opts = &rec->opts;
	struct perf_data *data = &rec->data;
	struct perf_session *session;
	int fd;
        
        ...

	//Event data(perf_data)External storage fd to spit out regularly
	fd = perf_data__fd(data);
	rec->session = session;

        ...

	// 
	if (record__open(rec) != 0) {
		err = -1;
		goto out_child;
	}

	trigger_ready(&auxtrace_snapshot_trigger);
	trigger_ready(&switch_output_trigger);
	perf_hooks__invoke_record_start();
	for (;;) {
		unsigned long long hits = rec->samples;

		//Ring buffer(evlist)Stop recording events to
		if (trigger_is_hit(&switch_output_trigger) || done || draining)
			perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);

		//Ring buffer(evlist)Read data from. Also writes to external storage for each fixed data size
		if (record__mmap_read_all(rec, false) < 0) {
			trigger_error(&auxtrace_snapshot_trigger);
			trigger_error(&switch_output_trigger);
			err = -1;
			goto out_child;
		}

		
		if (trigger_is_hit(&switch_output_trigger)) {
			
			if (rec->evlist->bkw_mmap_state == BKW_MMAP_RUNNING)
				continue;
			trigger_ready(&switch_output_trigger);

			/*
			 * Reenable events in overwrite ring buffer after
			 * record__mmap_read_all(): we should have collected
			 * data from it.
			 */
			//Ring buffer again(evlist)Allow event recording to
			perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_RUNNING);

			
			fd = record__switch_output(rec, false);
			
			/* re-arm the alarm */
			if (rec->switch_output.time)
				alarm(rec->switch_output.time);
		}

		if (hits == rec->samples) {
			if (done || draining)
				break;
			//Wait on poll until the next event arrives and can be read from the ring buffer
			err = evlist__poll(rec->evlist, -1);
			

			if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
				draining = true;
		}
	}
}

I put a comment on the part that is doing the main processing. The main loop is inside for (;;), but what you are doing is very simple. At a certain timing trigger_is_hit (& switch_output_trigger)` `, the event recording to the ring buffer is temporarily stopped (BKW_MMAP_DATA_PENDING), and in the meantime, the data on the ring buffer is quickly" `record__mmap_read_all (rec, false)` `. Exhale to external storage. Then enable event recording to the ring buffer again (BKW_MMAP_RUNNING) That's it This series of export processing occurs at regular intervals withalarm (rec-> switch_output.time); `. As an aside, the timing when a new event arrives can be known by polling to the source. In the above loop, after the event writing is completed, ```evlist__poll (rec-> evlist, -1); `for each rec-> samples waits until the next event comes.

This rec-> samples can be adjusted with the ** -c option ** of perf record.

3. Perf and AUX buffer

Now, let's take a look at the part that works with the CPU processor trace, which is the main theme of this article. As mentioned a bit in the previous section, perf record gets events from processor trace struct auxtrace_record Use an interface called .h # L326).


struct auxtrace_record {
	int (*recording_options)(struct auxtrace_record *itr,
				 struct evlist *evlist,
				 struct record_opts *opts);
	size_t (*info_priv_size)(struct auxtrace_record *itr,
				 struct evlist *evlist);
	int (*info_fill)(struct auxtrace_record *itr,
			 struct perf_session *session,
			 struct perf_record_auxtrace_info *auxtrace_info,
			 size_t priv_size);
	void (*free)(struct auxtrace_record *itr);
	int (*snapshot_start)(struct auxtrace_record *itr);
	int (*snapshot_finish)(struct auxtrace_record *itr);
	int (*find_snapshot)(struct auxtrace_record *itr, int idx,
			     struct auxtrace_mmap *mm, unsigned char *data,
			     u64 *head, u64 *old);
	int (*parse_snapshot_options)(struct auxtrace_record *itr,
				      struct record_opts *opts,
				      const char *str);
	u64 (*reference)(struct auxtrace_record *itr);
	int (*read_finish)(struct auxtrace_record *itr, int idx);
	unsigned int alignment;
	unsigned int default_aux_sample_size;
};

This auxtrace_record is an interface for accessing the Auxiliary (AUX) buffer, which is a buffer prepared for recording events of processor traces. It can be enabled with the HAVE_AUXTRACE_SUPPORT config.

3.1 What is an AUX buffer?

Originally this AUX buffer was used by Intel engineers to patch and ** Linux kernel * to ** perf commands ** for recording Intel PT events. It was implemented as a patch to *, and now ARM CoreSight also uses the AUX buffer. Details will be described later. As an aside, the AUX buffer is an interface implementation that was not initially expected because it was added after the patch that added Intel PT support to Linux. I think that the. The AUX, a ring buffer independent of the evlist, was needed for the purpose of transferring the Intel PT trace decoding process to user space. There is a comment at the link of the AUX patch to the above Linux kernel as follows.

The single most notable thing is that while PT outputs trace data in a compressed binary format, 
it will still generate hundreds of megabytes of trace data per second per core. 
Decoding this binary stream takes 2-3 orders of magnitude the cpu time that it takes to generate it.
These considerations make it impossible to carry out decoding in kernel space. 
Therefore, the trace data is exported to userspace as a zero-copy mapping that userspace 
can collect and store for later decoding. 
To address this, this patchset extends perf ring buffer with an "AUX space"

Intel PT gets megabytes of traces per second, so it's too much overhead to decode them in kernel space and then pass them to the user. Therefore, the kernel focuses on tracing by creating an AUX buffer that can be shared between the user and the kernel mode and leaving the decoding to the user space side. On this user space side, the access interface to the AUX buffer from the perf command is auxtrace_record. auxtrace_record is an abstraction of access to the AUX buffer auxtrace_record__init It is initialized with.

static int record__auxtrace_init(struct record *rec)
{
	if (!rec->itr) {
		rec->itr = auxtrace_record__init(rec->evlist, &err);
		if (err)
			return err;
	}
}

This auxtrace_record__init is defined using the __weak option and Intel PT and ARM CoreSight A function with the same name is defined in the code for .bootlin.com/linux/v5.5-rc4/source/tools/perf/arch/arm/util/auxtrace.c#L54). The processor trace enabled in the config is linked.

This abstracts all operations as function pointers for auxtrace_record, regardless of which processor trace function you are using.

3.2 Read AUX buffer and generate perf_data header

The trace recorded in the AUX buffer is [record__mmap_read_all] in the perf record main loop (https://elixir.bootlin.com/linux/v5.5-rc4/source/tools/perf/builtin-record.c#L1079) From record__mmap_read_evlist to [auxtrace_mmap__read](https://elixir.bootlin. com / linux / v5.5-rc4 / source / tools / perf / builtin-record.c # L586)

static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
				    bool overwrite, bool synch)
{
		if (map->auxtrace_mmap.base && !rec->opts.auxtrace_snapshot_mode &&
		    !rec->opts.auxtrace_sample_mode &&
		    record__auxtrace_mmap_read(rec, map) != 0) {
			rc = -1;
			goto out;
		}
}
static int record__mmap_read_all(struct record *rec, bool synch)
{
	int err;

	err = record__mmap_read_evlist(rec, rec->evlist, false, synch);
	if (err)
		return err;

	return record__mmap_read_evlist(rec, rec->evlist, true, synch);
}

The address of the mapped AUX buffer of the Linux kernel is auxtrace_mmap.base, and auxtrace_mmap__read is used to read from there.

Now, in the event loop, after reading with record__mmap_read_all is completed, [record__switch_output](https://elixir.bootlin.com/linux/v5.5-rc4/source/tools/perf/builtin- Save these data as perf_data in external storage with record.c # L1177). perf_data has a header area in addition to the raw information of the acquired trace, and writes and saves various meta information. For example, Intel PT has various options (config) [configurable] when tracing (https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt#L158). , The combination of options actually used is also saved in the perf_datan header. These metadata will be used later by perf_data parsing tools such as perf report. The perf_data header is generated by record__synthesize.

static int record__synthesize(struct record *rec, bool tail)
{

	if (rec->opts.full_auxtrace) {
		err = perf_event__synthesize_auxtrace_info(rec->itr, tool,
					session, process_synthesized_event);
		if (err)
			goto out;
	}

}

perf_event__synthesize_auxtrace_info is called.

int auxtrace_record__info_fill(struct auxtrace_record *itr,
			       struct perf_session *session,
			       struct perf_record_auxtrace_info *auxtrace_info,
			       size_t priv_size)
{
	if (itr)
		return itr->info_fill(itr, session, auxtrace_info, priv_size);
	return auxtrace_not_supported();
}

int perf_event__synthesize_auxtrace_info(struct auxtrace_record *itr,
					 struct perf_tool *tool,
					 struct perf_session *session,
					 perf_event__handler_t process)
{
	
	pr_debug2("Synthesizing auxtrace information\n");
	
	err = auxtrace_record__info_fill(itr, session, &ev->auxtrace_info,
					 priv_size); //AUX buffer perf_Conversion to data

	err = process(tool, ev, NULL, NULL); //Export to external storage

}

The actual header generation is auxtrace_record__info_fill, and the function pointer registered in the struct auxtrace_record earlier, info_fill, is in charge. This info_fill is implemented in Intel PT and ARM CoreSight respectively. By the way, process is process_synthesized_event, which writes perf_event to a file in external storage. That's right.

3.3 Intel Processor Trace (PT) As an example of implementing a function pointer for struct auxtrace_record, let's take a look at the Intel PT implementation of info_fill, which plays the role of header generation mentioned in the previous section.

The auxtrace_record initialization code auxtrace_record__init for Intel PT is Perform initialization with intel_pt_recording_init.

struct auxtrace_record *intel_pt_recording_init(int *err)
{
	...
	ptr->itr.info_priv_size = intel_pt_info_priv_size;
	ptr->itr.info_fill = intel_pt_info_fill;
	
	ptr->itr.snapshot_start = intel_pt_snapshot_start;
	ptr->itr.snapshot_finish = intel_pt_snapshot_finish;
	...
	ptr->itr.read_finish = intel_pt_read_finish;
	return &ptr->itr;
}

Intel_pt_info_fill in charge of info_fill of Intel PT Let's take a look at the implementation.

static int intel_pt_info_fill(struct auxtrace_record *itr,
			      struct perf_session *session,
			      struct perf_record_auxtrace_info *auxtrace_info,
			      size_t priv_size)
{
	...
	intel_pt_parse_terms(&intel_pt_pmu->format, "tsc", &tsc_bit);
	intel_pt_parse_terms(&intel_pt_pmu->format, "noretcomp",
			     &noretcomp_bit);
	intel_pt_parse_terms(&intel_pt_pmu->format, "mtc", &mtc_bit);
	mtc_freq_bits = perf_pmu__format_bits(&intel_pt_pmu->format,
					      "mtc_period");
	intel_pt_parse_terms(&intel_pt_pmu->format, "cyc", &cyc_bit);

	intel_pt_tsc_ctc_ratio(&tsc_ctc_ratio_n, &tsc_ctc_ratio_d);

	if (perf_pmu__scan_file(intel_pt_pmu, "max_nonturbo_ratio",
				"%lu", &max_non_turbo_ratio) != 1)
		max_non_turbo_ratio = 0;

	filter = intel_pt_find_filter(session->evlist, ptr->intel_pt_pmu);
	filter_str_len = filter ? strlen(filter) : 0;

	...
	
	auxtrace_info->priv[INTEL_PT_TSC_BIT] = tsc_bit;
	auxtrace_info->priv[INTEL_PT_NORETCOMP_BIT] = noretcomp_bit;
	auxtrace_info->priv[INTEL_PT_HAVE_SCHED_SWITCH] = ptr->have_sched_switch;
	auxtrace_info->priv[INTEL_PT_SNAPSHOT_MODE] = ptr->snapshot_mode;
	auxtrace_info->priv[INTEL_PT_PER_CPU_MMAPS] = per_cpu_mmaps;
	auxtrace_info->priv[INTEL_PT_MTC_BIT] = mtc_bit;
	auxtrace_info->priv[INTEL_PT_MTC_FREQ_BITS] = mtc_freq_bits;
	auxtrace_info->priv[INTEL_PT_TSC_CTC_N] = tsc_ctc_ratio_n;
	auxtrace_info->priv[INTEL_PT_TSC_CTC_D] = tsc_ctc_ratio_d;
	auxtrace_info->priv[INTEL_PT_CYC_BIT] = cyc_bit;
	auxtrace_info->priv[INTEL_PT_MAX_NONTURBO_RATIO] = max_non_turbo_ratio;
	auxtrace_info->priv[INTEL_PT_FILTER_STR_LEN] = filter_str_len;

	info = &auxtrace_info->priv[INTEL_PT_FILTER_STR_LEN] + 1;
}

intel_pt_parse_termsGet each option with and the headerauxtrace_info->privI just set it to.

4. Linux kernel and AUX buffer (PMU driver)

Intel PT and ARM CoreSight are called Performance Monitor Units (PMUs) in the Linux kernel, and Linux [implements] a dedicated driver to access the PMU from user space (https://github.com/). torvalds / linux / commit / 52ca9ced3f70779589e6ecc329baffe69d8f5f7a).

4.1. Intel Processor Trace (PT) The interface to the driver is ``` / sys / bus / event_source / devices / intel_pt /` ``. These drivers are in Initialization Code and perf_pmu_register It is registered as a PMU driver by (/ //elixir.bootlin.com/linux/v5.5-rc4/source/kernel/events/core.c#L10272). Both Intel PT and CoreSight drivers are function pointers for struct pmu in the Linux kernel. It is abstracted.

In the case of Intel PT, the implementation is as follows.

static struct pt_pmu pt_pmu;

static __init int pt_init(void)
{
	pt_pmu.pmu.capabilities	|= PERF_PMU_CAP_EXCLUSIVE | PERF_PMU_CAP_ITRACE;
	pt_pmu.pmu.attr_groups		 = pt_attr_groups;
	pt_pmu.pmu.task_ctx_nr		 = perf_sw_context;
	pt_pmu.pmu.event_init		 = pt_event_init;
	pt_pmu.pmu.add			 = pt_event_add;
	pt_pmu.pmu.del			 = pt_event_del;
	pt_pmu.pmu.start		 = pt_event_start;
	pt_pmu.pmu.stop			 = pt_event_stop;
	pt_pmu.pmu.snapshot_aux		 = pt_event_snapshot_aux;
	pt_pmu.pmu.read			 = pt_event_read;
	pt_pmu.pmu.setup_aux		 = pt_buffer_setup_aux;
	pt_pmu.pmu.free_aux		 = pt_buffer_free_aux;
	pt_pmu.pmu.addr_filters_sync     = pt_event_addr_filters_sync;
	pt_pmu.pmu.addr_filters_validate = pt_event_addr_filters_validate;
	pt_pmu.pmu.nr_addr_filters       =
		intel_pt_validate_hw_cap(PT_CAP_num_address_ranges);

	ret = perf_pmu_register(&pt_pmu.pmu, "intel_pt", -1);

	return ret;

4.1.1. Securing the AUX buffer

I think it is intuitively clear that the start and end of the trace are implemented in pmu.start / stop, but what is interesting is that the function pointer is used as pmu.setup_aux / free_aux until the allocation / solution of the AUX buffer. It is. AUX should be just a memory space shared between the user and the kernel, so why is this kind of operation implemented for each Intel PT or CoreSight? [Material] about Intel PT (https://conference.hitb.org/hitbsecconf2017ams/materials/D1T1%20-%20Richard%20Johnson%20-%20Harnessing%20Intel%20Processor%20Trace%20on%20Windows%20for%20Vulnerability%20Discovery .pdf ) To find the answer. Let's take a look at 13p-15p on this slide.

(Slide 13p)
• Different kinds of trace filtering:
1. Current Privilege Level (CPL) – used to trace all of user or kernel
2. PML4 Page Table – used to trace a single process
3. Instruction Pointer – used to trace a particular slice of code (or module)
• Two types of output logging:
1. Single Range
2. Table of Physical Addresses 

The former describes filtering when taking traces, and the latter describes the output method for discharging the acquired trace information. On perf, the trace is abstracted as a model to write to the AUX buffer, but in fact Intel PT has two ways to write the trace information, "Single Range" and "Table of Physical Address (ToPA)". The former Single Range (slide 14p) spits out trace information into a continuous physical address area secured by the OS in advance. The latter ToPA (slide 15p) can spit out trace information even in a discontinuous physical address area by setting it in a table called ToPA.

Actually the pt_buffer_setup_aux set in pmu.setup_aux above Looking at the implementation, we can see that it uses two output models.

static void *
pt_buffer_setup_aux(struct perf_event *event, void **pages,
		    int nr_pages, bool snapshot)
{
	buf = kzalloc_node(sizeof(struct pt_buffer), GFP_KERNEL, node);
	
	INIT_LIST_HEAD(&buf->tables);

	ret = pt_buffer_try_single(buf, nr_pages); //Register buf as a Single Range model
	
	ret = pt_buffer_init_topa(buf, cpu, nr_pages, GFP_KERNEL); //Register buf as a ToPA model

	return buf;
}

4.1.2 Start tracing

By the way, pt_event_start set in pmu.start is writing the Intel PT trace information to the AUX buffer secured by the method in the previous section. /source/arch/x86/events/intel/pt.c#L1515).

static void pt_event_start(struct perf_event *event, int mode)
{
	struct hw_perf_event *hwc = &event->hw;
	struct pt *pt = this_cpu_ptr(&pt_ctx);
	struct pt_buffer *buf;

	buf = perf_aux_output_begin(&pt->handle, event); // aux_Get the address of the AUX buffer secured by setup
	
	pt_config_buffer(buf); //Set as the output destination of Intel PT
	pt_config(event); //Other settings such as address filter

	return;
}

The AUX buffer (buf) secured by aux_setup earlier is just set as the output destination by pt_config_buffer (buf).

The processing of pt_config_buffer is as follows.

static void pt_config_buffer(struct pt_buffer *buf)
{
	struct pt *pt = this_cpu_ptr(&pt_ctx);
	u64 reg, mask;
	void *base;

	//AUX buffer is in one of the formats where Single Range is ToPA
	if (buf->single) {
		base = buf->data_pages[0];
		mask = (buf->nr_pages * PAGE_SIZE - 1) >> 7;
	} else {
		base = topa_to_page(buf->cur)->table;
		mask = (u64)buf->cur_idx;
	}

	//Convert AUX buffer address to physical address
	reg = virt_to_phys(base);
	if (pt->output_base != reg) {
	//The physical address of the AUX buffer is the output destination of Intel PT(MSR_IA32_RTIT_OUTPUT_BASE)Set to
		pt->output_base = reg;
		wrmsrl(MSR_IA32_RTIT_OUTPUT_BASE, reg);
	}
}

The trace output destination address of Intel PT can be set by MSR_IA32_RTIT_OUTPUT_BASE.

4.2. ARM CoreSight The interface to the driver is ``` / sys / bus / event_source / devices / cs_etm /` ``. The CoreSight driver is also described on Linaro's official blog and slide [^ 8], so if you are interested, read it. ARM CoreSight is a tracing mechanism installed in the CPU like Intel PT. The following is a schematic diagram.

coresight.png Quoted from [^ 8]

A dedicated trace chip called ** Embedded Trace Macrocell (ETM) ** is installed in each processor, and the trace information acquired from these multiple cores is called ** funnel **. The chip called ** Embedded Trace Buffer It is in the form of a buffer called (ETB) **. Modern ARM processors use the Trace Memory Controller (TMC), which is an extension of ETB [http://infocenter.arm.com/help/index.jsp?topic=0com.arm.doc.ddi0461b/ Cacddhga.html). ETB is [now defined] as one of the trace architectures in this TMC (http://infocenter.arm.com/help/index.jsp?topic=0com.arm.doc.ddi0461b/CACECIII .html), but this time it does not use TMC and assumes the architecture when the trace information is collected directly in ETB like the previous ARM processor.

The Linux kernel driver is Initialization Code and perf_pmu_register is registered as a PMU driver.

static int __init etm_perf_init(void)
{
	int ret;

	etm_pmu.capabilities		= (PERF_PMU_CAP_EXCLUSIVE |
					   PERF_PMU_CAP_ITRACE);

	etm_pmu.attr_groups		= etm_pmu_attr_groups;
	etm_pmu.task_ctx_nr		= perf_sw_context;
	etm_pmu.read			= etm_event_read;
	etm_pmu.event_init		= etm_event_init;
	etm_pmu.setup_aux		= etm_setup_aux;
	etm_pmu.free_aux		= etm_free_aux;
	etm_pmu.start			= etm_event_start;
	etm_pmu.stop			= etm_event_stop;
	etm_pmu.add			= etm_event_add;
	etm_pmu.del			= etm_event_del;
	etm_pmu.addr_filters_sync	= etm_addr_filters_sync;
	etm_pmu.addr_filters_validate	= etm_addr_filters_validate;
	etm_pmu.nr_addr_filters		= ETM_ADDR_CMP_MAX;

	ret = perf_pmu_register(&etm_pmu, CORESIGHT_ETM_PMU_NAME, -1);
	if (ret == 0)
		etm_perf_up = true;

	return ret;
}

4.2.1 Securing AUX buffer

As usual, let's first look at the implementation of pmu.setup_aux / free_aux, which allocates / solves the AUX buffer. etm_setup_aux is `` `sink_ops (sink)-> The AUX buffer is secured by alloc_buffer```.

static void *etm_setup_aux(struct perf_event *event, void **pages,
			   int nr_pages, bool overwrite)
{
	if (!sink_ops(sink)->alloc_buffer || !sink_ops(sink)->free_buffer)
		goto err;

	/* Allocate the sink buffer for this session */
	event_data->snk_config =
			sink_ops(sink)->alloc_buffer(sink, event, pages,
						     nr_pages, overwrite);
	
	return event_data;
}

Sink is a CoreSight term for the source that gets the trace. You can get the Sink list with the following command.

linaro@linaro-nano:~$ ls /sys/bus/coresight/devices/
20010000.etf 20040000.main-funnel 22040000.etm 22140000.etm
230c0000.cluster1-funnel 23240000.etm coresight-replicator 20030000.tpiu
20070000.etr 220c0000.cluster0-funnel 23040000.etm 23140000.etm
23340000.etm

There are various Sinks such as ETM, ETR (a type of TMC) and funnel. You can set perf as follows.

./perf record -e cs_etm/@20070000.etr/ --perf-thread ./main

`Sink_ops (sink)` changes depending on which Sink is specified, but this time it is assumed that the good old ETB is used as the Sink. In this case, sink_ops will be etb_sink_ops.

static void *etb_alloc_buffer(struct coresight_device *csdev,
			      struct perf_event *event, void **pages,
			      int nr_pages, bool overwrite)
{
	int node;
	struct cs_buffers *buf;

	node = (event->cpu == -1) ? NUMA_NO_NODE : cpu_to_node(event->cpu);

	buf = kzalloc_node(sizeof(struct cs_buffers), GFP_KERNEL, node);
	if (!buf)
		return NULL;

	buf->snapshot = overwrite;
	buf->nr_pages = nr_pages;
	buf->data_pages = pages;

	return buf;
}
static const struct coresight_ops_sink etb_sink_ops = {
	.enable		= etb_enable,
	.disable	= etb_disable,
	.alloc_buffer	= etb_alloc_buffer,
	.free_buffer	= etb_free_buffer,
	.update_buffer	= etb_update_buffer,
};

Etb_alloc_buffer specifies the alloc_buffer used to secure the AUX buffer. The contents are just kzalloc. This is set on the CoreSight side as ETB.

4.2.2. Start tracing

Now, after that, etm_event_start of etm_pmu.start to start tracing ) But this is also very simple as with Intel PT.

static void etm_event_start(struct perf_event *event, int flags)
{
	/*
	 * Deal with the ring buffer API and get a handle on the
	 * session's information.
	 */
	event_data = perf_aux_output_begin(handle, event); //Get AUX buffer address

	path = etm_event_cpu_path(event_data, cpu);
	/* We need a sink, no need to continue without one */
	sink = coresight_get_sink(path);
	
	/* Finally enable the tracer */
	if (source_ops(csdev)->enable(csdev, event, CS_MODE_PERF)) //Start tracing
		goto fail_disable_path;

out:
	return;

Just get the address of the AUX buffer and start tracing.

5. Address filter

Up to the previous section, we explained the process of recording and reading traces using processor traces, but there are other important elements in the performance monitoring tool. It's ** filtering ** of traces. Processor traces take traces of all the processing performed on the CPU, from the application to the kernel, but in many cases users do not require all these traces. I want a trace of a particular process or statistics about some kernel events. It is possible to extract all the necessary information with the perf command after tracing all of this information with the processor trace, but it is unreasonable to acquire all the trace information generated in units of several megabytes per second.

5.1. Intel Processor Trace (PT) Intel PT has a filtering function, which can be set through the Model Specific Register (MSR) to trace only a specific range. The slides shown in the previous section are reprinted.

(Slide 13p)
• Different kinds of trace filtering:
1. Current Privilege Level (CPL) – used to trace all of user or kernel
2. PML4 Page Table – used to trace a single process
3. Instruction Pointer – used to trace a particular slice of code (or module)
• Two types of output logging:
1. Single Range
2. Table of Physical Addresses 

Apparently there are three types of Different kinds of trace filtering. For details, refer to ** Intel SDM Vol 3 36.2.4 Trace Filtering **. Currently, the Linux kernel only supports filters that trace only the area specified by the third Instruction Pointer (IP) (Patch. ))

By the way, the setting of this IP filter is pt_event_start that appeared in the previous section. Called from pt_config, pt_config_filters .bootlin.com/linux/v5.5-rc4/source/arch/x86/events/intel/pt.c#L442).

static u64 pt_config_filters(struct perf_event *event)
{
	struct pt_filters *filters = event->hw.addr_filters;
	struct pt *pt = this_cpu_ptr(&pt_ctx);
	unsigned int range = 0;
	u64 rtit_ctl = 0;


	perf_event_addr_filters_sync(event); //Set the latest filter information to event

	for (range = 0; range < filters->nr_filters; range++) {
		struct pt_filter *filter = &filters->filter[range];


		/* avoid redundant msr writes */
		if (pt->filters.filter[range].msr_a != filter->msr_a) {
		//The starting address of the area to be filtered(IA32_RTIT_ADDRn_A)Set to
			wrmsrl(pt_address_ranges[range].msr_a, filter->msr_a);
			pt->filters.filter[range].msr_a = filter->msr_a;
		}

		if (pt->filters.filter[range].msr_b != filter->msr_b) {
		//The size of the area to be filtered(IA32_RTIT_ADDRn_A)Set to
			wrmsrl(pt_address_ranges[range].msr_b, filter->msr_b);
			pt->filters.filter[range].msr_b = filter->msr_b;
		}

		rtit_ctl |= filter->config << pt_address_ranges[range].reg_off;
	}

	return rtit_ctl;
}

Four IP filters can be set. Just set the start address in IA32_RTIT_ADDRn_A and the size in IA32_RTIT_ADDRn_B (n = 0-3) As with gdb's hardware breakpoint settings, this kind of limited number of hardware register settings first syncs the user-configured value in the software (which can change dynamically). After that, the implementation pattern is to write to the actual register. I saw it from my parents' face.

5.2. ARM CoreSight Whereas Intel PT used MSR to set the filter range, CoreSight registered the memory area allocated by itself in CoreSight as an interface, and subsequent exchanges are performed on this area (Memory Mapped). Interface).

From etm4_enable_perf to etm4_enable_hw .com /linux/v5.5-rc4/source/drivers/hwtracing/coresight/coresight-etm4x.c#L107) sets the memory area for the interface.

static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
{
	rc = coresight_claim_device_unlocked(drvdata->base);

	/* Disable the trace unit before programming trace registers */
	writel_relaxed(0, drvdata->base + TRCPRGCTLR);

	/* wait for TRCSTATR.IDLE to go up */
	if (coresight_timeout(drvdata->base, TRCSTATR, TRCSTATR_IDLE_BIT, 1))
		dev_err(etm_dev,
			"timeout while waiting for Idle Trace Status\n");

	writel_relaxed(config->pe_sel, drvdata->base + TRCPROCSELR);
	writel_relaxed(config->cfg, drvdata->base + TRCCONFIGR);
	...
}

writel_relaxed (val, addr) is a macro of the str instruction that writes a value (val) to the address (addr).

By the way, the address filters on CoreSight are called Address comparators. Since the function depends on the mounting of the chip, it seems that the access method differs depending on the actual machine. I will add more about this later.

The comparator is set in etm4_set_event_filters.

static int etm4_set_event_filters(struct etmv4_drvdata *drvdata,
				  struct perf_event *event)
{
	/* Sync events with what Perf got */
	perf_event_addr_filters_sync(event);

	for (i = 0; i < filters->nr_filters; i++) {
		struct etm_filter *filter = &filters->etm_filter[i];
		enum etm_addr_type type = filter->type;

		/* See if a comparator is free. */
		comparator = etm4_get_next_comparator(drvdata, type);//Get the currently available comparator

		switch (type) { //Set filters for each type of comparator.
		case ETM_ADDR_TYPE_RANGE:
			etm4_set_comparator_filter(config,
						   filter->start_addr,
						   filter->stop_addr,
						   comparator);
			break;
		case ETM_ADDR_TYPE_START:
		case ETM_ADDR_TYPE_STOP:
			/* Get the right start or stop address */
			address = (type == ETM_ADDR_TYPE_START ?
				   filter->start_addr :
				   filter->stop_addr);

			/* Configure comparator */
			etm4_set_start_stop_filter(config, address,
						   comparator, type);

			break;
		default:
			ret = -EINVAL;
			goto out;
		}
	}
}

6. Conclusion

The processor trace of the CPU is widely applied to reverse engineering and vulnerability analysis, especially in the cyber security world, and is an unavoidable function. This time, I read the processor trace processing built into Linux as a perf event. In particular, ARM's CoreSight is a very new technology, so there are few current documents. I have high expectations for the future of the community. My three days ended with writing an article.

Reference [^1] Linux Perf Tools Overview and Current Developments [^2] CoreSight, Perf and the OpenCSD Library [^ 3] How perf, ftrace works [^4] perf Examples [^5] Programming and Performance Visualization Tools [^6] Enhance performance analysis with Intel Processor Trace. [^7] Harnessing Intel Processor Trace on Windows for Vulnerability Discovery (HITB17) [^8] Hardware Assisted Tracing on ARM with CoreSight and OpenCSD

Recommended Posts

Processor tracing mechanism to read from perf (perf + Intel PT / ARM CoreSight)