[LINUX] Machine language embedding in C language

Introduction

Recently, I had a constrained C programming task on a tweet at https://twitter.com/reiya_2200/status/1130761526959267841.

image.png

The one that comes to my mind is main recursion, but it's not very interesting, so I tried to embed machine language, so I tried it as a material.

As a prerequisite, x86_64 Linux + gcc, and the reproduction environment is Ubuntu18 (WSL / Windows10). ** Even with the same Linux, the behavior can change significantly in different environments **. Please be careful not to be bad.

Obedient embedding

First the code

So tweet was ** the following code with the machine language obediently embedded **.

emb1.c


#include <stdio.h>
int main(int argc,char *argv[]) {
  static char __attribute__((section(".text"))) s[]="1\xc0H\x83\xc7\bH\x8b\27H\x85\xd2t\t\xf\xbe\22\215D\20\xd0\xeb\xeb\xc3";
  return !printf("%d\n",((int(*)(void*))s)(argv));
}

When executed, the total of the 1-digit numbers specified in the argument is output properly as shown below.

Compile / execute


$ gcc emb1.c
/tmp/ccfyvQPW.s: Assembler messages:
/tmp/ccfyvQPW.s:3: Warning: ignoring changed section attributes for .text
$ ./a.out 4 6 4 9
23

I think it is unnecessary for those who are accustomed to it, but I will explain it for the time being.

Function implementation and machine language

The policy is to implement the part that calculates the sum as a function and make it machine language.

An easy implementation would be code like this:

sum.c


int sum(void *pv) {
  char **pc=pv;
  int s=0;
  while ( *++pc ) {
    s+=**pc-'0';
  }
  return s;
}

The argument pv assumes that ʻargvis passed. Thechar * array indicated by ʻargv is terminated with a null pointer at the end, so it can actually be processed without ʻargc`.

If you compile this and check the machine language, it will look like this:

Compile / disassemble


$ gcc -c -Os -fno-asynchronous-unwind-tables -fno-stack-protector sum.c && objdump -SCr sum.o

sum.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <sum>:
   0:   31 c0                   xor    %eax,%eax
   2:   48 83 c7 08             add    $0x8,%rdi
   6:   48 8b 17                mov    (%rdi),%rdx
   9:   48 85 d2                test   %rdx,%rdx
   c:   74 09                   je     17 <sum+0x17>
   e:   0f be 12                movsbl (%rdx),%edx
  11:   8d 44 10 d0             lea    -0x30(%rax,%rdx,1),%eax
  15:   eb eb                   jmp    2 <sum+0x2>
  17:   c3                      retq

This means that the machine language of the function part will be a code sequence of 31,c0,48,83, ... in hexadecimal. If this is expressed in ASCII character range, it can be expressed as printable characters. " 1 \ xc0H \ x83 \ xc7 \ bH \ x8b \ 27H \ x85 \ xd2t \ t \ xf \ xbe \ 22 \ 215D \ 20 \ It will be xd0 \ xeb \ xeb \ xc3 ".

Note that gcc's -Os (size optimization) and other extra processing options are specified to prevent the code from being embedded in the code in order to shorten the code. I wanted an option ** to shorten it as ASCII characters if possible, but I can't help it because it doesn't exist.

Machine language embedding

So I understood the machine language of the function part.

You can embed this in your code as a normal string.

All you have to do now is treat the string address as the function address and use it to call the function. In printf, it is((int (*) (void *)) s) (argv), and ʻint (*) (void *)is the address type of this function. (A pointer to a function that takes avoid * as an argument and returns a ʻint) and casts to it.

emb1.c(Repost)


#include <stdio.h>
int main(int argc,char *argv[]) {
  static char __attribute__((section(".text"))) s[]="1\xc0H\x83\xc7\bH\x8b\27H\x85\xd2t\t\xf\xbe\22\215D\20\xd0\xeb\xeb\xc3";
  return !printf("%d\n",((int(*)(void*))s)(argv));
}

However, there are two points to note. It's about the placement of the string s.

At this time, there is a mechanism to prevent unexpected memory areas from being executed as code. Without the above specifications, even if you try to execute the corresponding machine language code, it will cause SEGV. This is necessary knowledge in case the compiler cannot correctly determine that it is executable code, as in this case.

Embedded directly in the main function

Implementation overview

However, I'm a little dissatisfied with the code above. That's because the attribute makes the ELF section structure visible.

Is there a smarter way to embed machine language? So I wrote the following code.

emb2.c


#include <stdio.h>
int main(int argc,char *argv[]) {
  asm volatile(".string \"H\\x83\\xc6\\bH\\x8b\\x6H\\x85\\xc0t\\xf\\xf\\xbe\\0\\x8d|\\x7\\xcf\\x89|$\\f\\xeb\\xe7\\xb0\"");
  return !printf("%d\n",argc-1);
}

You can see that it still works.

Compile / execute


$ gcc emb2.c
$ ./a.out 4 6 4 9
23

Seeds and gimmicks

Of course the seeds are easy. If you specify the .string pseudo-instruction with the ʻasm volatile` instruction that embeds the assembly code in the C language code, the machine language equivalent to the character string is embedded in that place **.

Let's disassemble and check the ʻa.out` created after compilation.

Disassemble


$ objdump -SCr a.out | sed -ne '/<main>:$/,+20p'
000000000000064a <main>:
 64a:   55                      push   %rbp
 64b:   48 89 e5                mov    %rsp,%rbp
 64e:   48 83 ec 10             sub    $0x10,%rsp
 652:   89 7d fc                mov    %edi,-0x4(%rbp)
 655:   48 89 75 f0             mov    %rsi,-0x10(%rbp)
 659:   48 83 c6 08             add    $0x8,%rsi
 65d:   48 8b 06                mov    (%rsi),%rax
 660:   48 85 c0                test   %rax,%rax
 663:   74 0f                   je     674 <main+0x2a>
 665:   0f be 00                movsbl (%rax),%eax
 668:   8d 7c 07 cf             lea    -0x31(%rdi,%rax,1),%edi
 66c:   89 7c 24 0c             mov    %edi,0xc(%rsp)
 670:   eb e7                   jmp    659 <main+0xf>
 672:   b0 00                   mov    $0x0,%al
 674:   8b 45 fc                mov    -0x4(%rbp),%eax
 677:   83 e8 01                sub    $0x1,%eax
 67a:   89 c6                   mov    %eax,%esi
 67c:   48 8d 3d a1 00 00 00    lea    0xa1(%rip),%rdi        # 724 <_IO_stdin_used+0x4>
 683:   b8 00 00 00 00          mov    $0x0,%eax
 688:   e8 93 fe ff ff          callq  520 <printf@plt>

In this, the part from 48,83, c6,08 at address 659 to b0,00 at address 672 is embedded. The instruction at address 672 is not actually processed, but it is prepared as an instruction ending with 00 so that the NUL character (00) can be added to the embedded character string.

Original code

Now, about this embedded code.

This simply assumes the following processing. In other words, let's save the calculation result as it is with ʻargc` as the total.

Original code


#include <stdio.h>
int main(int argc,char *argv[]) {
  while ( *++argv ) { argc+=**argv-'0'-1; }
  return !printf("%d\n",argc-1);
}

By the way, according to the x86_64 Linux calling convention, the first argument ʻargc is stored in the rdi register (32-bit part is edi), and the second argument ʻargv is stored in the rsi register. So, if you loop them directly, it's OK.

Applicable assembly


 659:   add    $0x8,%rsi      #Advance argv by one element
 65d:   mov    (%rsi),%rax    #Load argv elements into rax
 660:   test   %rax,%rax      #Null pointer judgment
 663:   je     674            #Jump to immediately after the embedded part when a NULL pointer is detected
 665:   movsbl (%rax),%eax    #Read characters into eax
 668:   lea    -0x31(%rdi,%rax,1),%edi # argc(edi)Add to
 66c:   mov    %edi,0xc(%rsp) #Save edi to stack area
 670:   jmp    659            #Jump to the beginning of the embedded part
 672:   mov    $0x0,%al       #Dummy instruction

However, without optimization, ʻargc` used at the time of printf seemed to use the value once saved in the stack, not directly in the register. Therefore, the result of changing the esi register is reflected on the stack by the instruction at address 66c.

On the contrary, if optimization is performed, the stack area for saving ʻargcis not prepared, so the instruction at address 66c will destroy the stack. So, the output is as intended, but be aware that it will cause a segf at the exit ofmain`.

Specify optimization options


$ gcc -O3 emb2.c
$ ./a.out 4 6 4 9
23
Segmentation fault (core dumped)

At the end

What did you think. Corrupted C programmer level -10 I thought that I could feel the feeling equivalent to level -9, so I introduced the code.

Finally, for the time being, I will also leave the main recursion, which seems to be the expected solution of the questioner (because it can be assembled for the time being!).

Assumed solution?code


#include <stdio.h>
int main(int argc,char *argv[]) {
  return argc>0 ? !printf("%d\n",main(0,argv+1)) : *argv ? **argv-'0'+main(0,argv+1) : 0;
}

Recommended Posts

Machine language embedding in C language
Heapsort made in C language
Multi-instance module test in C language
Realize interface class in C language
Segfault with 16 characters in C language
Linked list (list_head / queue) in C language
Generate C language from S-expressions in Python
How to multi-process exclusive control in C language
Set up a UDP server in C language
Handle signals in C
C language ALDS1_3_B Queue
Access MongoDB in C
Next Python in C
[C language algorithm] Endianness
C API in Python 3
Try to make a Python module in C language
Try embedding Python in a C ++ program with pybind11
Go language to see and remember Part 7 C language in GO language
[C language algorithm] Block movement
Extend python in C ++ (Boost.NumPy)
C language ALDS1_4_B Binary Search
Machine learning in Delemas (practice)
Use regular expressions in C
A note for embedding the scripting language in a bash script
Note 2 for embedding the scripting language in a bash script
Programming language in "Hello World"
Imitated Python's Numpy in C #
Binary search in Python / C ++
I tried to illustrate the time and time in C language
Hello World in GO language
[C language] readdir () vs readdir_r ()
Used in machine learning EDA
C language ALDS1_4_A Linear Search
Minimum spanning tree in C #
Introduction to Socket API Learned in C Language Part 1 Server Edition
Write a table-driven test in C
Try implementing Yubaba in Go language
Try Embedding Visualization added in TensorFlow 0.12
Use optinal type-like in Go language
[Language processing 100 knocks 2020] Chapter 6: Machine learning
Function pointer and objdump ~ C language ~
100 Language Processing Knock Chapter 1 in Python
Writing C language with Sympy (metaprogramming)
Automate routine tasks in machine learning
ABC166 in Python A ~ C problem
High energy efficiency programming language C
Introduction to Protobuf-c (C language ⇔ Python)
100 Language Processing Knock 2020 Chapter 6: Machine Learning
Classification and regression in machine learning
When reading C ++ structs in Cython
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
Post to slack in Go language
Switch the language displayed in Django 1.9
Solve ABC036 A ~ C in Python
Start SQLite in a programming language
How to wrap C in Python
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
[C language algorithm] Binary search tree
Preprocessing in machine learning 2 Data acquisition
Solve ABC037 A ~ C in Python