Understanding regional context of World Wide Web using common crawl corpus | IEEE Conference Publication | IEEE Xplore

Understanding regional context of World Wide Web using common crawl corpus


Abstract:

The World Wide Web has emerged as the most important and essential tool for the society. Today, people heavily rely on rich resources available in the web for communicati...Show More

Abstract:

The World Wide Web has emerged as the most important and essential tool for the society. Today, people heavily rely on rich resources available in the web for communication, business, maps, and social networking etc. In addition, people seek web content in their preferred regional language besides English. The global statistics of the world wide web are well known, however, the regional context of the world wide web is poorly understood. This paper presents large scale web study using Common Crawl Corpus of December 2016. We examine 200+ terabytes of data with Amazon's Elastic MapReduce infrastructure. We analyze 2.87 billion web documents with respect to content type, domains, and content language. Furthermore, we explore multi-lingual web pages for European and Asian languages. Our results show that 97.8% of web documents present in our data are “text/html”. In addition, 57.2% of web documents contain content in the English language. Moreover, web content in Russian language has 5.7% share which is more that any other European language. Furthermore, we found that 60.6% of web documents have content exclusively in the English language. Finally, we found that Japanese and traditional Chinese language content dominate the Asian web pages with 1.89% and 1.23% share. To the best of our knowledge, this is the first large scale web study to explore the language mix present in the web documents.
Date of Conference: 28-30 November 2017
Date Added to IEEE Xplore: 12 March 2018
ISBN Information:
Conference Location: Johor Bahru, Malaysia

I. Introduction

World Wide Web (WWW) has remarkably transformed communication patterns of the modern day world since its emergence. Today, the daily life of people is overwhelmingly dependent on WWW as evident by 3.3 billion global Internet users, 17.1 billion networked devices, and 73.1 exabytes of internet traffic per month in 2016 [1]. Similarly, the adoption of smart mobile devices and the availability of hyperscale cloud computing have enabled users to enjoy rich communication features over the web.

Contact IEEE to Subscribe

References

References is not available for this document.