Loading [MathJax]/extensions/MathMenu.js
Ultra-Large-Scale Repository Analysis via Graph Compression | IEEE Conference Publication | IEEE Xplore

Ultra-Large-Scale Repository Analysis via Graph Compression


Abstract:

We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions sof...Show More

Abstract:

We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects—encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100 GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.
Date of Conference: 18-21 February 2020
Date Added to IEEE Xplore: 02 April 2020
ISBN Information:
Print on Demand(PoD) ISSN: 1534-5351
Conference Location: London, ON, Canada

Contact IEEE to Subscribe

References

References is not available for this document.