Skip to Main Content
Program input syntactic structure is essential for a wide range of applications such as test case generation, software debugging, and network security. However, such important information is often not available (e.g., most malware programs make use of secret protocols to communicate) or not directly usable by machines (e.g., many programs specify their inputs in plain text or other random formats). Furthermore, many programs claim they accept inputs with a published format, but their implementations actually support a subset or a variant. Based on the observations that input structure is manifested by the way input symbols are used during execution and most programs take input with top-down or bottom-up grammars, we devise two dynamic analyses, one for each grammar category. Our evaluation on a set of real-world programs shows that our technique is able to precisely reverse engineer input syntactic structure from execution. We apply our technique to hierarchical delta debugging (HDD) and network protocol reverse engineering. Our technique enables the complete automation of HDD, in which programmers were originally required to provide input grammars, and improves the runtime performance of HDD. Our client study on network protocol reverse engineering also shows that our technique supersedes existing techniques.