CrashScope Evaluation
In this section we provide a description of the studies conducted to evaluate FUSION and make all of the data available for replication purposes and further research. (Please note that this section contains results not explicitly discussed in our paper due to space limitations)
Research Questions
Study 1: Crash Detection Capability
- RQ1: What is CrashScope’s effectiveness in terms of detecting application crashes compared to other state-of- the-art Android testing approaches?
- RQ2: Does CrashScope detect different crashes compared to the other tools?
- RQ3: Are some CrashScope execution strategies more effective at detecting crashes or exceptions than others?
- RQ4: Does average application statement coverage correspond to a tool’s ability to detect crashes?
STUDY 2: Reproducibility & Readability
- RQ5: Are reports generated with CrashScope more reproducible than the original human written reports?
- RQ6: Are reports generated by CrashScope more readable than the original human written reports?
Study Descriptions
study 1: Crash Detection Capability
In order to compare CrashScope against other state-of-the-art automated input generation tools for Android, we utilized a subset of subject apps and tools available in the Androtest testing suite. Each tool in the suite was allowed to run for one hour for each of the remaining 61 subject apps, five times, whereas we ran all 12 combinations of the CrashScope strategies once on each of these apps. It is worth noting that the execution of tools in the Androtest suite (except for Android monkey) can not be controlled by a criteria such as maximum number of events. In the Androtest VMs, each tool ran on its required Android version, for CrashScope each subject application was run on an emulator with a 1200x1920 display resolution, 2GB of RAM, a 200 MB Virtual sdcard, and Android version 4.4.2 KitKat. We ran the tools listed in Table I, except Monkey, using Vagrant and VirtualBox. The Monkey tool was run for 100-700 event sequences (in 100 event deltas for seven total configurations) on an emulator with the same settings as above with a two-second delay between events, discarding trackball events. Each of these seven configurations was executed five times for each of the 61 subject apps, and every execution was instantiated with a different random seed. While Monkey is an available tool in Androtest, the authors of the tool chose to set no delay between events, meaning the number of events monkey executed over the course of 1 hour far exceeds the number of events generated by the other tools, which would have resulted in a biased comparison to CrashScope and the other automated testing tools. To facilitate a fair comparison, we chose to limit the number of events thrown by Android monkey to a range (100-700 events) that corresponds to the average number of events invoked by other tools. In order to give a complete picture of the effectiveness of CrashScope as compared to the other tools, we report data on both the statement coverage of the tools as well as crashes detected by each tool. Each of the subject applications in the Androtest suite was instrumented with the Emma code coverage tool, and we used this instrumentation to collect statement coverage data for each of the apps.
The underlying purpose of this study is to compare the crash detection capabilities of each of these tools and answer RQ1. However, we cannot make this comparison in a straightforward manner. CrashScope is able to accurately detect app crashes by detecting the standard Android dialog for exposing a crash (e.g., a text box containing the phrase “application name has stopped”). However, because the other analyzed tools do not support identifying crashes at runtime, there is no reliable automated manner to extract instances where the application crashed purely from the logcat. To obtain an approximation of the crashes detected by these tools, we parsed the logcat files generated for each tool in the Androtest VMs. Then, we isolated instances where exceptions occurred containing the FATAL EXCEPTION key marker, which were also associated with the process id (pid) of the app running during the logcat collection. While this filters out unwanted exceptions from the OS and other processes, unfortunately, it does not guarantee that the exceptions signify a crash caused by incorrect application logic. This could signify, among other things, a crash caused by the instrumentation of the controlling tool. Therefore, in order to conduct a consistent comparison to CrashScope, the authors manually inspected the instances of fatal exception stack traces returned by the logcat parsing, discarding duplicates and those caused by instrumentation problems, and we report the crash results of the other tools from this pruned list. The issues encountered when parsing the results from these other tools further highlight CrashScope’s utility, and the need for an automatic tool that can accurately detect and in turn effectively report crashes in mobile apps.
Table 1: Tools used in Crash Detection study