Anduril (old codename: FIR) uses static causal analysis and a novel feedback-driven
algorithm to quickly search the enormous fault space for the root-cause fault
and timing.
Anduril is developed and tested under Ubuntu 18.04 to 20.04 with JDK 8.
Other systems and newer JDKs may also work.
The basic workflow of Anduril described in this README can be done in just one single node.
Our experiment node uses the CloudLab c220g5 node type, which has two
Intel Xeon Silver 4114 10-core CPUs at 2.20 GHz, 192GB ECC DDR4-2666 memory,
and a 1 TB 7200 RPM 6G SAS HDs.
Git (>= 2.16.2, version control)
Apache Maven (>= 3.6.3, for Anduril compilation)
Apache Ant (>= 1.10.9, artifact testing only, for zookeeper compilation)
JDK8 (openjdk recommended)
protobuf (==2.5.0, artifact testing only, for HDFS compilation)
If you do not have root permissions, install the dependencies this way:
Rootless installation
DEP=$HOME/anduril-dep # modify this path to where you want the dependencies installed
cd $DEP
wget xzvf jdk-8u301-linux-x64.tar.gz
tar xzvf openlogic-openjdk-8u422-b05-linux-x64.tar.gz
tar xzvf apache-maven-3.9.9-bin.tar.gz
tar xzvf apache-ant-1.10.14-bin.tar.gz
export PATH=$PATH:$DEP/openlogic-openjdk-8u422-b05-linux-x64/bin:~/apache-maven-3.9.9/bin:$DEP/apache-ant-1.10.14/bin:$DEP/protobuf-build/bin
export JAVA_HOME=$DEP/openlogic-openjdk-8u422-b05-linux-x64
echo "export PATH=$DEP/openlogic-openjdk-8u422-b05-linux-x64/bin:~/apache-maven-3.9.9/bin:$DEP/apache-ant-1.10.14/bin:$DEP/protobuf-build/bin:\$PATH" >> ~/.bashrc
echo "export JAVA_HOME=$DEP/openlogic-openjdk-8u422-b05-linux-x64" >> ~/.bashrc
Install protobuf, which is needed for HDFS compilation:
DEP=$HOME/anduril-dep # modify this path to where you want the dependencies installed
cd $DEP
cd protobuf-2.5.0/
autoreconf -f -i -Wall,no-obsolete
./configure --prefix=$DEP/protobuf-build
make -j4
make install
export PATH=$DEP/protobuf-build/bin:$PATH
echo "export PATH=$DEP/protobuf-build/bin:\$PATH" >> ~/.bashrc
protoc --version
1. Clone the repository
git clone
This repository contains the evaluated systems, so it is a bit large (around 3.5 GB). Make sure you have enough disk space.
2. Run the main experiments
There are 22 cases totaling up. Even though the target system of some of the
cases are same (e.g. there are 4 cases in ZooKeeper), the patch version may
differ a lot so the compilation, static analysis, and dynamic experiment config
differ a lot.
2.1 Compile the system codes
The first step is to compile the system code into classes so that they can be
utilized by our static analyzer. The system codes are in the directory
system/case_name. We need to switch to that directory and then run the
compilation commands. Besides the system code, we may also need to compile the
tests in the system code directory, which will serve as the workload for that
Since the compilations commands differ by cases, we prepare a script
in each case directory that you can invoke. For example:
cd systems/zookeeper-3006
We also provide a script to compile all cases:
cd systems
2.2 Find important logs
In the second step, the goal is to filter out important log entries in the failure log.
In experiments/case_name, there is a script that you can run the workload to get the logs. We run two times.
Then, move them to ground_truth/case_name together with the failure log named bad-run-log.txt. There is a script to filter out suspicious log entries.
# Assume there are good-run-log.txt, good-run-log-2.txt, and bad-run-log.txt
The output are diff_log_original.txt, diff_log_dd.txt, and diff_log_dd_set.txt in the directory ground_truth/case_name. Take an example of the format:
# First is the class and second is the line number
LeaderRequestProcessor 77
MBeanRegistry 128
ZooKeeperCriticalThread 48
PrepRequestProcessor 965
ClientCnxn$SendThread 1181
AppenderDynamicMBean 209
2.3 Peform static analysis
Before running static analysis, compile our toolkit first
In tool/,
mvn install -DskipTests
The scripts are in directory tool/bin. For case case_name, analyzer-${case_name}.sh will output causal graph tree.json in the directory you run the script and the instrumented class files. There is another post-processing step on the generated instrumnted class files through scripts in tool/move.
All the evaluation should happen in evaluation/case_name directory.
cd evaluation/case_name
If it is FIR:
2.4.2 Config of the experiment
The configuration file is
(Example from Artifact evaluation) FIR columns in Table II
There is one extra file called config-template. We can make the 6 corresponding from it by attaching extra configuration.
For example, in zookeeper-2247, config-template
The for either Fate or Crashtuner can be generated through:
cp config-sota
You can refer to or to see what happens.
2.4.3 (Optional) Prepare time table
If your configuration contains flaky.timeFeedback=true pr flaky.augFeedback=true, time table is needed.
./ # If it is in evaluation/case_name
./ > record-inject
java -jar reporter-1.0-SNAPSHOT-jar-with-dependencies.jar -t trials/ -s tree.json
2.4.4 Run the experiment
Driver will run the experiments and output the trials into trials. For trial with index i, injection-$i.json records the fault injection point while $i.out records the system output.
./ num_trials
./ num_trials
2.4.5 Check the reproduction result
There are two options, if check-${case_name}.sh is in the evaluation dir, we should use
`check-${case_name}.sh` trials
Else, it is incoporated into our reporter framework and can be checked with
We need three scripts, and is for the first 6 columns while and are for SOTA.
Suppose you want to get the row of case_name, copy the three scripts into the folder evaluaiton/case_name
The three scripts can be ran on three different machines. Before running the script, there are some fields needed to be edited”
Edit the scripts
In, the case_name should be changed to case_name. will run the 6 experiments shown in Table II sequentially and p1-p6 designate how many trials each experiment lasts. For example, if you set p1 to 20, the first experiment, Full Feedback, would last 20 trials. A rule of thumb is to set this to be two times the data in the Table II. It it exceeds 2000, decrease it to 2000. Or it can not be finished in one day.
Also note that for some cases, the three scripts are already there. You can directly run them and they serve as good examples for you do other experiments.
Run the script
They traverses the entire pipeline in section I, so you can just run the script to get the results.
Inspect the result
The first index of the trial in which the case is reproduced will be printed in Green color.
Anduril (old codename: FIR) uses static causal analysis and a novel feedback-driven algorithm to quickly search the enormous fault space for the root-cause fault and timing.
Table of Contents
OS and JDK:
node type, which has two Intel Xeon Silver 4114 10-core CPUs at 2.20 GHz, 192GB ECC DDR4-2666 memory, and a 1 TB 7200 RPM 6G SAS HDs.Git (>= 2.16.2, version control)
Apache Maven (>= 3.6.3, for Anduril compilation)
Apache Ant (>= 1.10.9, artifact testing only, for zookeeper compilation)
JDK8 (openjdk recommended)
protobuf (==2.5.0, artifact testing only, for HDFS compilation)
0. Install and configure dependencies
If you do not have root permissions, install the dependencies this way:
Rootless installation
Install protobuf, which is needed for HDFS compilation:
1. Clone the repository
This repository contains the evaluated systems, so it is a bit large (around 3.5 GB). Make sure you have enough disk space.
2. Run the main experiments
There are 22 cases totaling up. Even though the target system of some of the cases are same (e.g. there are 4 cases in ZooKeeper), the patch version may differ a lot so the compilation, static analysis, and dynamic experiment config differ a lot.
2.1 Compile the system codes
The first step is to compile the system code into classes so that they can be utilized by our static analyzer. The system codes are in the directory
. We need to switch to that directory and then run the compilation commands. Besides the system code, we may also need to compile the tests in the system code directory, which will serve as the workload for that case.Since the compilations commands differ by cases, we prepare a
script in each case directory that you can invoke. For example:We also provide a script to compile all cases:
2.2 Find important logs
In the second step, the goal is to filter out important log entries in the failure log.
, there is a script that you can run the workload to get the logs. We run two times.Then, move them to
together with the failure log namedbad-run-log.txt
. There is a script to filter out suspicious log entries.The output are
, anddiff_log_dd_set.txt
in the directoryground_truth/case_name
. Take an example of the format:2.3 Peform static analysis
Before running static analysis, compile our toolkit first
,The scripts are in directory
. For casecase_name
will output causal graphtree.json
in the directory you run the script and the instrumented class files. There is another post-processing step on the generated instrumnted class files through scripts intool/move
.For the state-of-the-art baselines,
Static analysis of Fate
Static analysis of Crashtuner
2.4 Run dynamic experiments
2.4.1 Preparation of the experiment
All the evaluation should happen in
directory. ForIf it is FIR:
2.4.2 Config of the experiment
The configuration file is
.(Example from Artifact evaluation) FIR columns in Table II
There is one extra file called
. We can make the 6
from it by attaching extra configuration. For example, inzookeeper-2247
for Full Feedback can be generated through:You can refer to
for all the 6 policies in FIR(Example from Artifact evaluation) FIR columns in Table II
There is one extra file called
for either Fate or Crashtuner can be generated through:You can refer to
to see what happens.2.4.3 (Optional) Prepare time table
If your configuration contains
, time table is needed.2.4.4 Run the experiment
Driver will run the experiments and output the trials into
. For trial with index i,injection-$i.json
records the fault injection point while$i.out
records the system output. FIR:SOTA:
2.4.5 Check the reproduction result
There are two options, if
is in the evaluation dir, we should useElse, it is incoporated into our reporter framework and can be checked with
We will uniformize it soon!
3. Artifact evaluation
The scripts are stored in
.Table II
We need three scripts
is for the first 6 columns
are for SOTA.Suppose you want to get the row of
, copy the three scripts into the folderevaluaiton/case_name
The three scripts can be ran on three different machines. Before running the script, there are some fields needed to be edited”
Edit the scripts
, the case_name should be changed tocase_name
will run the 6 experiments shown in Table II sequentially andp1-p6
designate how many trials each experiment lasts. For example, if you setp1
, the first experiment,Full Feedback
, would last20
trials. A rule of thumb is to set this to be two times the data in the Table II. It it exceeds2000
, decrease it to2000
. Or it can not be finished in one day.As to
, there is only one experiment, so onlyp1
exists.Also note that for some cases, the three scripts are already there. You can directly run them and they serve as good examples for you do other experiments.
Run the script
They traverses the entire pipeline in section I, so you can just run the script to get the results.
Inspect the result
The first index of the trial in which the case is reproduced will be printed in
color.Table III
Same idea as Table I. Edit and run