Migrate A Running Linux Process on Another Machine Using CRIU
Introduction
Introduction
Process migration — the ability to move a running process from one machine to another — is a powerful capability in Linux. It enables load balancing, minimizes downtime, and enhances fault tolerance. This guide demonstrates how to achieve process migration using CRIU (Checkpoint/Restore In Userspace) with a simple example program written in C.
We’ll walk through installing CRIU, creating a program that updates a counter every second, checkpointing the program’s state, and restoring it, either on the same or another machine. Special attention is given to handling terminal-based processes using the --shell-job option.
Installing CRIU on Ubuntu
CRIU is available via Ubuntu’s repositories, but the latest version can be accessed by adding its official PPA.
Step 1: Install add-apt-repository (if missing)
If add-apt-repository is not installed, install it using:
sudo apt update
sudo apt install software-properties-common -yStep 2: Add the CRIU PPA
Add the CRIU PPA to ensure you get the latest version:
sudo add-apt-repository ppa:criu/ppaStep 3: Update the Package List
Refresh the list of available packages:
sudo apt updateStep 4: Install CRIU
Install CRIU using:
sudo apt install criu -yStep 5: Verify the Installation
Check if CRIU is installed correctly by running:
criu --versionYou should see the installed version, e.g., Version: 3.17.
Step 6: Test Compatibility
Verify that your system supports CRIU:
sudo criu checkIf everything is compatible, you’ll see:
Looks good.Creating the Example Program
We’ll create a simple program that updates a counter every second and prints the current time. This program serves as an ideal example for demonstrating CRIU’s checkpoint and restore features.
C Code: Counter Updating Every Second
Save the following code as demo.c :
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>
#include <time.h>
volatile int running = 1; // Flag to indicate if the program is running
// Function to handle termination signals
void handle_signal(int sig) {
printf("\nProcess received signal %d, exiting gracefully.\n", sig);
running = 0; // Stop the loop
}
int main() {
// Register signal handler
signal(SIGTERM, handle_signal);
signal(SIGINT, handle_signal);
printf("Process started with PID: %d\n", getpid());
printf("Updating every second...\n");
// Counter variable
unsigned long long counter = 0;
// Infinite loop to update the counter every second
while (running) {
time_t now = time(NULL);
printf("Counter: %llu | Time: %s", counter++, ctime(&now));
fflush(stdout); // Ensure the output is immediately printed
sleep(1); // Wait for 1 second
}
printf("Process stopped. Final counter value: %llu\n", counter);
return 0;
}Step 1: Compile and Run the Program
- Compile the program:
gcc demo.c -o demo2. Run it:
./demoYou’ll see output similar to this:
Process started with PID: 252114
Updating every second...
Counter: 0 | Time: Tue Nov 19 18:50:10 2024
Counter: 1 | Time: Tue Nov 19 18:50:11 2024
Counter: 2 | Time: Tue Nov 19 18:50:12 2024
Counter: 3 | Time: Tue Nov 19 18:50:13 2024
Counter: 4 | Time: Tue Nov 19 18:50:14 2024
Counter: 5 | Time: Tue Nov 19 18:50:15 2024Step 2: Checkpoint the Process
Processes connected to a terminal (like our example) need the --shell-job option to handle terminal-related context. The following oneliner will create a directory named as the process PID mentioned in step 1, in your computer will be another number, and then criu will dump everything related to this PID in this directory, at end will kill the process.
mkdir -p 252114 && criu dump -t 252114 -D ./252114 --shell-jobReturning back to the terminal where the process was executing we can see that the counter at the time of the termination was 9
Counter: 6 | Time: Tue Nov 19 18:50:16 2024
Counter: 7 | Time: Tue Nov 19 18:50:17 2024
Counter: 8 | Time: Tue Nov 19 18:50:18 2024
Counter: 9 | Time: Tue Nov 19 18:50:19 2024
KilledStep 3: Restore the Process
- Restore on the Same Machine: Use CRIU to restore the process:
criu restore -D ./252114/ --shell-job
Counter: 10 | Time: Tue Nov 19 18:55:59 2024
Counter: 11 | Time: Tue Nov 19 18:55:00 2024
Counter: 12 | Time: Tue Nov 19 18:55:01 2024
Counter: 13 | Time: Tue Nov 19 18:55:02 2024As we can see the program continued its execution where it left, but note that the time is not in the same minute since its execution resumed some minutes after, i am not sure but this makes me think that resuming processes that are use time to their calculations probably will crash for various reasons or at worst they will produce wrong results, so use with caution.
2. Restore on a remote machine: Transfer the checkpoint files to a remote machine using scp or rsync, and use the above command, but you should be aware of the following
- System Compatibility: Both machines must have compatible architectures, libraries, and kernel versions.
- Open Resources: Processes with active network connections or file dependencies may require additional options (e.g.,
--tcp-established). - Terminal Dependencies: Processes interacting with terminals must use the
--shell-joboption. - Downtime: The checkpoint and restore processes introduce some downtime, especially for large memory states.
Conclusion
CRIU simplifies the task of checkpointing and restoring Linux processes, enabling powerful use cases like live migration, fault recovery, and system maintenance. In this guide, we demonstrated how to checkpoint and restore a simple C program, both on the same and on a remote machine.
While CRIU is robust, it requires caution when handling processes that depend on system time or external resources. With proper configuration and understanding of its options, CRIU can be a game-changer for developers and system administrators seeking flexible process management solutions.
Explore CRIU and experience the power of live process migration today!