CrashScope: A Practical Automated Android Testing Tool

Team Members: Kevin MoranMario Linares-VásquezCarlos Bernal-Cárdenas, Christopher VendomeDenys Poshyvanyk

College of William & Mary --- SEMERU

 
 

Purpose

This project was created by the Software Engineering Maintenance and Evolution Research Unit (SEMERU) at the College of William & Mary, under the supervision of Dr. Denys Poshyvanyk.  The major goal of the CrashScope project is to provide a practical automated testing tool to developers that is capable exploring an application according to different strategies, generating highly detailed crash reports for applications, and enabling automated replay of crashes via captured testing scripts.

Video Demonstration

Publications

Study Data

We provide a dataset for the empirical validation we performed of the CrashScope tool which includes:

  • An overview of the results presented in the paper.
  • Access to all of the data generated from running the Androtest Benchmark suite, and results of our User Study.

CrashScope

CrashScope Workflow Overview

CrashScope Workflow Overview (Click for more detail)

The overall workflow of CrashScope is illustrated in the Figure above. The first step is to obtain the source code of the app, either directly or through decompilation, and detect Activities (by means of static analysis) that are related to contextual features in order to target the testing of such features. In other words, CrashScope will only test certain contextual app features (e.g., wifi off) if it finds instances where they are implemented in the source code. Next, the GUI-Ripping Engine systematically executes the app using various strategies, including enabling and disabling the contextual features (if run on an emulator) at the Activities of the app identified previously. If during the execution, uncaught exceptions are thrown, or the app crashes, dynamic execution information is saved to the CrashScope’s database, including detailed information regarding each event performed during the systematic exploration. After the execution data has been saved to the CrashScope database, the Natural Language Report Generator  parses the database and processes the information for each step of all executions that ended in a crash, generating an HTML based natural language crash report with expressive steps for reproduction. In addition, the Crash Script Generator parses the database and extracts the relevant information for each step in a crashing execution in order to create a replayable script containing adb input commands and markers for contextual state changes. The Script Replayer is able to replay these scripts by executing the sequence of adb input commands and interpreting the contextual state change signals, in order to reproduce the crash. 

Tools used for CrashScope Implementation

We provide the tools we used in our implementation of the Contextual Feature Extractor, the GUI-Ripping Engine, and Report Generator.

Tools used to implement the Contextual Feature Extractor:

  • APKTool: a tool for reverse engineering Android apk files.
  • Dex2jar: A conversion tool for .dex files and .class files.
  • jd-cmd: A command line Java Decompiler.

Tools used to implement the GUI-Ripping engine:

  • Android Debug Bridge (adb): A universal tool for communicating with Android devices and emulators
  • Hierarchy Viewer: A tool for examining and optimizing Android user interfaces.
  • UIAutomator: A tool that provides a set of APIs to build UI tests for Android applications and aid in interacting with GUI Components. 

tools used to Implement the CrashScope report generator:

  • Bootstrap: HTML, CSS, and JavaScript framework for developing web applications.
  • MySQL: Robust Relational Database.

Using CrashScope

*Please note that the CrashScope tool is currently under active maintenance.  Therefore testing jobs could take several days to complete.

(To register as a user, click the "I want to use CrashScope" link and register an email address and set up a password.  Then use this information to log into the tool interface.)

The tool is currently under construction, an Open Source Version will be coming soon!


CrashScope Evaluation

In this section we provide a description of the studies conducted to evaluate FUSION and make all of the data available for replication purposes and further research.  (Please note that this section contains results not explicitly discussed in our paper due to space limitations)

Research Questions

Study 1: Crash Detection Capability

  • RQ1: What is CrashScope’s effectiveness in terms of detecting application crashes compared to other state-of- the-art Android testing approaches?
  • RQ2: Does CrashScope detect different crashes compared to the other tools?
  • RQ3: Are some CrashScope execution strategies more effective at detecting crashes or exceptions than others?
  • RQ4: Does average application statement coverage correspond to a tool’s ability to detect crashes?

STUDY 2: Reproducibility & Readability

  • RQ5: Are reports generated with CrashScope more reproducible than the original human written reports?
  • RQ6: Are reports generated by CrashScope more readable than the original human written reports?

Study Descriptions

study 1: Crash Detection Capability

In order to compare CrashScope against other state-of-the-art automated input generation tools for Android, we utilized a subset of subject apps and tools available in the Androtest testing suite.  Each tool in the suite was allowed to run for one hour for each of the remaining 61 subject apps, five times, whereas we ran all 12 combinations of the CrashScope strategies once on each of these apps. It is worth noting that the execution of tools in the Androtest suite (except for Android monkey) can not be controlled by a criteria such as maximum number of events. In the Androtest VMs, each tool ran on its required Android version, for CrashScope each subject application was run on an emulator with a 1200x1920 display resolution, 2GB of RAM, a 200 MB Virtual sdcard, and Android version 4.4.2 KitKat. We ran the tools listed in Table I, except Monkey, using Vagrant and VirtualBox. The Monkey tool was run for 100-700 event sequences (in 100 event deltas for seven total configurations) on an emulator with the same settings as above with a two-second delay between events, discarding trackball events. Each of these seven configurations was executed five times for each of the 61 subject apps, and every execution was instantiated with a different random seed. While Monkey is an available tool in Androtest, the authors of the tool chose to set no delay between events, meaning the number of events monkey executed over the course of 1 hour far exceeds the number of events generated by the other tools, which would have resulted in a biased comparison to CrashScope and the other automated testing tools. To facilitate a fair comparison, we chose to limit the number of events thrown by Android monkey to a range (100-700 events) that corresponds to the average number of events invoked by other tools. In order to give a complete picture of the effectiveness of CrashScope as compared to the other tools, we report data on both the statement coverage of the tools as well as crashes detected by each tool. Each of the subject applications in the Androtest suite was instrumented with the Emma code coverage tool, and we used this instrumentation to collect statement coverage data for each of the apps.


The underlying purpose of this study is to compare the crash detection capabilities of each of these tools and answer RQ1. However, we cannot make this comparison in a straightforward manner. CrashScope is able to accurately detect app crashes by detecting the standard Android dialog for exposing a crash (e.g., a text box containing the phrase “application name has stopped”). However, because the other analyzed tools do not support identifying crashes at runtime, there is no reliable automated manner to extract instances where the application crashed purely from the logcat. To obtain an approximation of the crashes detected by these tools, we parsed the logcat files generated for each tool in the Androtest VMs. Then, we isolated instances where exceptions occurred containing the FATAL EXCEPTION key marker, which were also associated with the process id (pid) of the app running during the logcat collection. While this filters out unwanted exceptions from the OS and other processes, unfortunately, it does not guarantee that the exceptions signify a crash caused by incorrect application logic. This could signify, among other things, a crash caused by the instrumentation of the controlling tool. Therefore, in order to conduct a consistent comparison to CrashScope, the authors manually inspected the instances of fatal exception stack traces returned by the logcat parsing, discarding duplicates and those caused by instrumentation problems, and we report the crash results of the other tools from this pruned list. The issues encountered when parsing the results from these other tools further highlight CrashScope’s utility, and the need for an automatic tool that can accurately detect and in turn effectively report crashes in mobile apps. 

Table 1: Tools used in Crash Detection study

Tool NameAndroid VersionTool Type
MonkeyanyRandom
A3E Depth First SearchanySystematic
GUI RipperanyModel-Based
Dynodroidv2.3Random-Based
PUMAv4.1+Random-Based

Study 2: Reproducibility & Readability of Reports

To identify the crashes used for this study, we manually inspected the issue trackers of the apps on F-droid looking for reports that described an app crash. Then, we ran CrashScope on the version of the app that the crash was reported against to observe whether or not CrashScope was able to capture the crash on the same emulator configuration as the previous study. While we chose these bugs manually, the goal of this study is not to measure CrashScope’s effectiveness at discovering bugs (unlike the first study).

In order to answer RQ5 and RQ6, we asked 16 CS graduate students from William and Mary (a proxy for developers) to reproduce the eight crashes (four from the original human written reports, and four from CrashScope). The design matrix of this study was devised in such as way that each crash for each type of report was evaluated by four participants, no crash was evaluated twice for the same participant, and eight participants saw the human written reports first, and eight participants saw the CRASHSCOPE reports first, all in the interest of reducing bias. The system names were also anonymized (CrashScope to “System A” and the human written reports to “System B”). During the study, participants recorded the time it took them to reproduce the crash on a Nexus 7 device for each report, with a time limit of ten minutes for reproduction. If a participant could not reproduce the bug within the ten minute time frame or gave up in trying to reproduce the bug, that bug was marked as non-reproducible for that participant. To mitigate the “flaky test” problem, where outstanding factors such as Network I/O, varying sensor readings or app delay could cause difficulty of crash reproduction, when manually selecting the crashes and crash reports from the online repositories, the authors ensured that each bug was deterministically reproducible within the confines of the study environment (e.g. Using the proper version of the application that contains the bug and that the bug was always reproducible on the Nexus 7 tablet). Therefore, in order to answer RQ5, we measured how many crashes were successfully reproduced by the participants for each type of crash report, we also measured the time it took each participant to reproduce each bug.

After the completion of the crash reproductions, we had each participant fill out a brief survey, answering questions regarding the user preferences (UP) and usability (UX) for each type of bug report. We also collected information about each participants programming experience and familiarity with the Android platform. The UP questions were formulated based on the user experience honeycomb originally developed by Moville and were posed to participants as free form text entry questions. The UX questions were created using statements based on the SUS usability scale by Brooke and were posed to participants in the form of a 5-point Likert scale. We quantify the user experience of CRASHSCOPE and answer RQ6 by presenting the mean and standard deviation of the scores for the responses to the Likert-based questions. The questions regarding programming experience are based on the well-accepted questionnaire developed by Feigenspan et al.

Table 2: User Preference Questions

Question IDQuestion
S2UP1What information from this type of Bug Report did you find useful for reproducing the crash?
S2UP2What other information (if any) would you like to see in this type of bug report?
S2UP3What elements did you like the most from this type of bug report?
S2UP4 What information did you like least from this type of bug report?

Table 3: Usability Questions

Question IDQuestion
S2UX1 I think that I would like to use this type of bug report frequently.
S2UX2I found this type of bug report unnessecarily complex.
S2UX3I thought this type of bug report was easy to read/understand.
S2UX4I found this type of bug report very cumbersome to read.
S2UX5I thought the bug report was really useful for reproducing the crash.

Results

Table 4( Study1): Unique Crashes discovered (crashes caused by instrumentation given in parentheses)

AppA3EGUI-Ripper DynodroidPUMAMonkeyCrashScope
A2DP Vol 1 0 0 0 0 0
aagtl 0 0 1 0 1 0
Amazed 0 0 0 0 1 0
HNDroid 1 1 1 2 1 1
BatteryDog 0 0 1 0 1 0
Soundboard 0 1 0 0 0 0
AKA 0 0 0 0 1 0
Bites 0 0 0 0 1 0
Yahtzee 1 0 0 0 0 1
ADSDroid 1 1 1 1 1 1
PassMaker 1 0 0 0 1 1
BlinkBattery 0 0 0 0 1 0
D&C 0 0 0 0 1 0
Photostream 1 1 1 1 1 0
AlarmKlock 0 0 1 0 0 0
Sanity 1 1 0 0 0 0
MyExpenses 0 0 1 0 0 0
Zooborns 0 0 0 0 0 2
ACal 1 2 2 0 1 1
Hotdeath 0 2 0 0 0 1
Total 8 (21) 9 (5) 9 (6) 4 (0) 12 (1) 8 (0)

Figure 1 (Study 1): Code coverage Results

Table 5 (Study 2): User Experience Results

QuestionCrashScope MeanCrashScope Std DevOriginal MeanOriginal Std Dev
S2UX1: I think I would like to have this type of bug report frequently. 4.00 0.89 3.06 0.77
S2UX2: I found this type of bug report unnecessarily complex. 2.81 1.04 2.13 0.96
S2UX3 I thought this type of bug report was easy to read/understand. 4.00 0.82 3.00 0.97
S2UX4: I found this type of bug report very cumbersome to read. 2.50 1.10 2.44 0.81
S2UX5: I thought the bug report was really useful for reproducing the crash. 4.13 0.62 3.44 0.89

Answers to Research Questions

RQ1: CrashScope is about as effective at detecting crashes as the other tools. Furthermore, our approach reduces the burden on developers by reducing the number of “false” crashes caused by instrumentation and providing detailed crash reports.

RQ2: The varying strategies of CrashScope allow the tool to detect different crashes compared to those detected by other approaches.

RQ3: Different combinations of CrashScope strategies were more effective than others, suggesting the need for multiple testing strategies encompassed within a single tool.

RQ4: Higher statement coverage of an automated mobile app testing tool does not necessarily imply that tool will have effective fault-discovery capabilities.

RQ5: Reports generated by CrashScope are about as reproducible as human written reports extracted from open-source issue trackers.

RQ6: Reports generated by CrashScope are more readable and useful from a developers’ perspective as compared to human written reports.

CrashScope Crash Reports from Study 2

Study Datasets

Click the button below to download our dataset in .csv and .xlsx format.  If you need a viewer for the .xlsx version of the dataset you can download LibreOffice (free), OpenOffice (free) or Microsoft's Excel (paid).