Dataset of Coronavirus Content From Instagram With an Exploratory Analysis

The novel coronavirus (COVID-19) pandemic outbreak is drastically shaping and reshaping many aspects of our life, with a huge impact on our social life. In this era of lockdown policies in most of the major cities around the world, we see a huge increase in people and professionals’ engagement in social media. Online Social Networks are playing an important role in news propagation as well as keeping people in contact. At the same time, social media is both a blessing and a curse as the coronavirus infodemic has become a major concern, and is already a topic that needs special attention and further research. In this study, we publish a multilingual coronavirus (COVID-19) Instagram dataset that we have continuously collected during the first wave of the pandemic from 5 January 2020 to 30 May 2020. The dataset contains 25.7K posts, 829K comments, and 3.2M likes in various subjects from different publishers such as ‘public accounts’, ‘fake accounts (bots)’, ‘newsagencies’, ‘influencers’, ‘celebrities’, ‘business pages’, etc. In addition to the dataset, this paper provides an analysis of the behaviour of the publishers. We study the behavioural aspects of the users in terms of their engagement, use of hashtags, activities, reactions as well as a full analysis of the published content related to the COVID-19. We believe this contribution helps the research community to better understand the dynamics behind this phenomenon in Instagram, as one of the major social media.


I. INTRODUCTION A. THE APPEARANCE OF THE CORONAVIRUS
The novel coronavirus (COVID-19) was declared a pandemic by the World Health Organisation (WHO) on 11 March 2020. 1 Since then, the world has experienced almost 3 million cases. To mitigate its spread, many governments have therefore imposed unprecedented social distancing measures that have led to millions becoming housebound. This has resulted in a flurry of research activity surrounding both understanding and countering the outbreak [1]. As part of this, social media has become a vital tool in disseminating public health information and maintaining connectivity amongst people. Several recent studies have relied on Twitter data to better understand this [2]- [5]. These have primarily focused on health-related (mis)information, but there have also been studies into online hate [6]. Despite this, there has The associate editor coordinating the review of this manuscript and approving it for publication was Yassine Maleh . 1 https://tinyurl.com/WHOPandemicAnnouncement been only limited exploration of other social modalities, such as image content.

B. THE FIRST CORONAVIRUS WAVE
The World Health Organization published the first disease outbreak news on 5th January 2020 [7], containing risk assessment and advice [8]. On 9th March 2020, Italy declared a nationwide lockdown, which was the first in Europe. France reported over 10,000 coronavirus cases on March 19th, and then, on March 23th, Boris Johnson announced a UK-wide partial lockdown. On March 26th, the United States officially became the country hardest hit by the pandemic [9]. Table 1 summarizes the important events that happened during the first outbreak of the pandemic between January 2020 and March 2020. In this study, we mark and use these critical events in our analysis. Our work is underpinned by a large-scale dataset from Instagram. We started collecting the COVID-19 content on Instagram on 5 January 2020, and we continued this task until 30 May 2020. During these months, we have seen a series of important events in the world. The virus emerged in Wuhan city in China, followed by high levels of hospitalisation. The virus then spread into middle-eastern countries such as Iran, and then the outbreak began in Europe. Later, the World Health Organization (WHO) announced a global pandemic, and many countries went into full or partial lockdown. Eventually, the situation eased and countries relaxed their lockdowns, albeit with some then experiencing a second wave. All these events spanned just five months.

C. ROADMAP & CONTRIBUTIONS
In this study, we present and explore an Instagram dataset, covering several major keywords and hashtags related to COVID-19 during the first wave of the pandemic ( Table 2). We hope that this study can help support a number of use cases. Summary of the main contributions in this study are: • We present a large dataset of COVID-19 related content published in Instagram including metrics such as posts content and reactions across various communities (Section III).
• We perform a preliminary analysis of posts and their publishers with the goal of distinguishing published content by humans and machines (Section IV). We observe that 6.9% of posts are distributed by bots.
• We present interesting observations across hashtags by publishers (Section V). We observe many geographical hashtags are associated with '#quarantine. We witness several off-topic and unrelated hashtags co-located with the main COVID-19 hashtags. We also observe 36 isolated islands in the graph of hashtags.

II. RELATED WORK
We break related work into two groups: (i) studies related to collecting COVID-19 datasets from social media, and (ii) studies related to analysing COVID-19 related contents in social media.

A. COVID-19 DATASETS
There have been a range of COVID-19 social media datasets released. To date, this predominantly covers textual data (e.g. Twitter). To assist in this, Kazemi et al. [13] provides a toolbox for processing textual data related to COVID-19. In terms of data, the first efforts in this direction were from authors in [2], which provides a large Twitter dataset related to COVID-19 (by crawling major hashtags and trusted accounts). Another similar study [3], provides an Arabic Twitter dataset with a similar data collection methodology. Lopez et al. [4] provide another Twitter dataset, including geolocated tweets. There are some further efforts on providing similar datasets from Twitter [14]- [16].
Melotte et al. [17] present a more controlled and compact dataset without requiring extensive preprocessing or tweethydration. The proposed dataset comprises tens of thousands of geotagged tweets originally collected over 255 days in 2020 across 10 metropolitan areas (in North America). Sharma et al. [5] also made a public dashboard 2 available, summarising data across more than 5 million real-time tweets. In another study [18], to help the evaluation of the determinants and impact of the COVID-19 at a large scale, the authors present a new dataset with socio-demographic, economic, public policy, health, pollution and environmental factors for the European Union. Akindtande et al. [19] present a dataset that investigates the magnitude of the misinformation content influencing scepticisms about the novel COVID-19 pandemic in Africa and the data is collected via an electronic questionnaire method from twenty-one Africa countries. In medicine, Sass et al introduce the ''German Corona Consensus Dataset'' (GECCO), a uniform dataset that uses international terminologies and health IT standards to improve interoperability of COVID-19 data [20].

B. CONTENT ANALYSIS REVIEW
These Twitter datasets have been used for various lines of analysis. For example, Saire and Navarro [21] use the data to show the epidemiological impact of COVID-19 on press publications. Singh et al. [22] are also monitoring the flow of (mis)information flow across 2.7M tweets, and correlating it with infection rates. They find that misinformation and myths are discussed, but at a lower volume than other conversations.
To the best of our knowledge, the only paper that has covered Instagram is by Cinelli et al. [23], who analyse Twitter, Instagram, YouTube, Reddit and Gab data about COVID-19. We complement this by making a public Instagram dataset available to the community. Kudchadkar et al. [24] observed trends in concurrent #PedsICU and #COVID19 usage which reflect evolving information, knowledge gained, and collaborations among the global pediatric critical care community in Twitter. In terms of sentiment analysis, Vijay et al. [25] analysed the tweets regarding COVID-19 from November 2019 to May 2020 in India and its effect. Most people started having Negative tweets but with increasing time shifted towards positive and neutral comments. We redirect readers to [1] for a comprehensive survey of ongoing data science research related to COVID-19. Another study showed that COVID-19 misinformation on Twitter was more likely to come from unverified accounts, i.e., accounts not confirmed to be human [26]. Pennycook et al. [27] showed why people believe and share misinformation related to COVID-19 and point to a suite of interventions based on accuracy nudges that social media platforms could directly implement. They believe such interventions are easily scalable and do not require platforms to make decision about what content to censor. The role of social media platforms in promoting COVID-19 conspiracy theories is also studied in [28]. Pang et al. [29] performed a comprehensive metaanalysis of publicly available global metabolomics datasets obtained from three countries (the United States, China and Brazil). They have implemented a computational pipeline to perform consistent raw spectra processing and conducted meta-analyses at pathway levels instead of individual feature levels. Mahmoudi et al [30] investigate the relationships between the counts of cases with COVID-19 and the deaths due to it in seven countries that are severely affected by this pandemic disease. In contrast to these prior works, we focus on image-based social media and explore the role that different account types play in disseminating COVID-19 related material.

III. DATA COLLECTION CAMPAIGN A. CRAWLER ARCHITECTURE
In order to collect the Instagram public content, we develop a crawler that is able to handle various tasks simultaneously. This crawler connects to Instagram via multiple channels, downloads public data content concurrently, performs some NLP based pre-processing steps, and finally stores them in a NoSQL format database. In Instagram, a reaction to a post can be active (comment) or passive (like). The crawler relies on the official Instagram APIs described in [31]. To get the public content that is tagged with a specific hashtag or keyword, we use the Instagram Hashtag Engine which is available in [32]. This API returns public posts that have been tagged with particular hashtags. Our crawler runs on several virtual machines in parallel 24/7. Note, that we do not manually filter any posts and therefore we gather all posts containing the hashtags, regardless of the specific topics discussed within. Figure 1 shows the complete architecture design of our crawler, which contains four different major parts to handle data crawling: (i) API Connection Layer, (ii) Proxy Layer, (iii) Main Body, and (iv) Database Layer. The process of receiving data is as follows: 1) The API Connection Layer (Block 1 in Figure 1) connects to the official Instagram platform [31], which is currently using the Graph API. The crawler is registered as an application to be able to perform user authentication [33]. Note, there are certain rate limitations for requesting information per hour [31]. 2) Between the Connection Layer and the Main Module, the Proxy Layer (Block 2 in Figure 1) is responsible for handling multiple proxy IP addresses and creating multiple connection layers. This helps us to receive data at a faster rate from various IP addresses. Thus, these layers are working concurrently.
3) The main body of the crawler (Block 3 in Figure 1) contains several inner modules such as Post, Reaction, Profile, Social Connection, and Story or Live modules. These are responsible for getting parts that are associated with their names. For example, the Post module is programmed to get Instagram posts and metadata. These modules are directly connected to a scheduler that handles time management. For example, checking daily stories, updating reactions, looking for new posts, checking highlights, and revising new social connections. Last, the Pre-Processing layer is used to perform some basic pre-processing steps such as text cleaning, data management, language extraction, etc. 4) In the Database Layer (Block 4 in Figure 1), we store our data. We use MongoDB as the primary database and we keep each module in a separate corresponding collection. For example, post content is stored in the post collection.

B. DATA COLLECTION AND PRE-PROCESSING 1) COLLECTION
On 5 January 2020, we prepared an initial list with '#coronavirus', '#covid19', and '#covid_19' keywords. We list the complete tracked keywords/hashtags in Table 2. Whenever a new keyword appears, we add it to our watch list. We continuously check new hashtags from [34] and [35] sources. For example, on 19 January 2020, we added '#corona', and '#stayhome'. By the end of January and beginning the lockdown in Europe, we also began to track '#quarantine', and '#covid' tags. Using our above crawler, we continuously iterate over this list to collect associated posts. If any of the keywords exist in a post's caption, hashtags, tagged users, location, or mentions, we consider that post as COVID-19 related. In order to get post reactions, we revisit posts for two weeks after the initial posting to gather comments and likes.

2) GRAPHS
We later explore the relationships between hashtags. To achieve this, we induce a graph dataset whereby hashtags that appear in posts are nodes, and edges indicate that two hashtags have appears in the same post (at least once). We only consider hashtags between 3 to 25 characters. We set the node weight as the frequency that a tag is used. We later plot graphs using [36].

3) BOT DETECTION
In order to identify bots, we extract and use features from [37]- [39] studies. Features are a combination of post and publisher metrics:

biography text (text), account url (text), full name (text), number of followers (numeric), number of followee (numeric), account age (numeric), number of posts (numeric), avg. received like (numeric), avg. received comments (numeric), number of posts (numeric), number of issued like (numeric), number of issued comments (numeric), following/followee ratio (numeric), followers/post ratio (numeric), biography emoji count (numeric), biography hashtag count (numeric), biography length (numeric), verified (numeric), duplicated comments (numeric), number of followers that are bots (numeric), number of followee that are bots (numeric), post caption (text)''.
To build a training set, we randomly select 6K posts and manually label the profiles. Based on mentioned metrics, we examine each profile by hand and annotate it as ''bot'' or ''not bot'' identity. Metrics include profile-level features (''full name, profile image, number of follower, verified, account age, etc. '') and post-level features (''received like, received comments, post caption, etc. ''). In the training set, each class has 2.1K validated samples. For all text-based features such as ''biography'', we remove all punctuation marks, stopwords and convert them to lowercase characters. Words are stemmed to reduce to their root forms. Numerical metrics are min-max normalised. Next, we train a Contextual LSTM Neural Network classifier with the same model architecture reported in [40]. In this model, both text and metadata metrics from posts and profiles are considered. First, we tokenize text metrics (e.g. biography) using Keras Tokenizer Class [41] and then the result is fed to the LSTM layer which outputs a 64-dimension vector. We attach numerical metadata to this vector and pass it through 2 ReLU activated layers of sizes 128 and 64. Finally, it connects to an output layer that predicts the label. We use a random split of 80% (training set) and 20% (test set), and to avoid over-fitting we use 10-fold Cross-validation. The Contextual model achieved a final accuracy of 88%, precision of 87%, recall of 87%, and F1 of 88%.

C. CHALLENGES AND LIMITATIONS
Note that as it is infeasible to collect all reactions. Hence, we define a limitation of 500 comments and 500 likes per post. We monitor reactions for up to two weeks to reach this limitation. In line with Instagram's Terms, Conditions, and Policies [42] as well as user privacy, we only gather publicly available data that is obtainable from Instagram. We also only rely on Instagram Posts and Reactions. We do not collect other data types such as Stories or Highlights.  In total, we have collected 829K comments and 3.2M likes from 25.7K public posts. Posts are distributed by 13.3K publishers. Table 3 summarizes the general stats regarding the posts, profiles, and reactions. Each Instagram part may contain various data types. For example, in a post, there exist VOLUME 9, 2021 'caption', 'location', 'date', 'hashtags', 'mentions', etc. In Table 4, we summarize and describe all data features. This covers four main data types: 'text', 'numeric', 'boolean', 'date', and 'binary'. We store images in a binary format.

E. ACCESS TO DATASET
This dataset is accessible through: https://github.com/ kooshazarei/COVID-19-InstaPostIDs. We publish our dataset in agreement with Instagram's Terms & Conditions [42]. Thus, as it is not permissible to release the post content and reactions, we share the post IDs (known as shortcodes). Researchers can then use tools such as Instaloader [43] to dehydrate the dataset. For any further question, please contact Koosha Zarei (koosha.zarei@telecom-sudparis.eu).

IV. CHARACTERIZING PUBLISHERS
In this section, we categorize publishers, before inspecting how their publication rates differ.

A. PUBLISHER CATEGORIES
First, we strive to understand COVID-19 related publishers. We argue that this can offer insight into how this information is generated and distributed [44]. We particularly focus on understanding how much COVID-19 generated information can be considered reliable [45].

1) OVERVIEW OF PUBLISHER CATEGORIES
Overall, we identify approximately 13.3K unique publishers. We observe a range of account characteristics. For example, some accounts have a high number of followers, and some represent well-known figures such as celebrities or brands. We categorize publishers into the following groups, as summarized in Table 6:

a: NEWS AGENCIES
To identify News agencies, we make a list of English speaking agencies on Instagram using two sources [46], [47]. Then, we filter and verify more than twenty News media accounts in our dataset. While all these accounts are already verified and categorized as 'media/news companies' by Instagram, they usually have millions of followers. We list all existing News agencies in Table 5. We find that 12.2% of posts, 0.7% of unique publishers, and 26% of total reactions belong to News agencies.

b: CELEBRITIES
We also witness the existence of posts from popular singers, actors, artists, sports players, and other figures. We compile a list of popular celebrities using [48], [49] and then search for them in our dataset. We find that these celebrity accounts tend to be verified public profiles, usually with millions of followers (avg. 80M) yet few followees (avg. 230). Some of the top figures that we see are '@ladygaga', '@arianagrande', '@jlo', '@oprah', '@leonardodicaprio', '@christiano', '@leomessi', '@serenawilliams', '@davidbeckham', '@eltonjohn', '@jenniferaniston', '@theellenshow', '@kimkardashian', '@beyonce', etc. The number of celebrity accounts is not as large as other groups, and they usually publish more Instagram stories or live broadcasts rather than posts. However, they obtain a large number of reactions, especially comments, which make them a valuable source (see Figure 5). This group holds 4.3% of all posts, 0.5% of unique publishers, and 45.2% of total reactions.

c: BUSINESS PAGES
These cover the official pages of companies on Instagram.
To identify such accounts, we rely on [50], [51], and use the Instagram Category feature (as a company) [52]. Using these two resources, we extract all known business pages. We identify two types of business accounts: (i) profiles that are already verified by Instagram as business profiles [53] such as '@Nike', '@google', '@chanelofficial', etc. with hundreds of followers. (ii) Profiles that represent small businesses that are not verified and have few followers. Business pages produce the longest caption length (average 628 characters) and tag the most people (1.5 on average). Business Pages hold 4.7% of the total posts, 26% of total reactions, and 2% of unique publishers.

d: INFLUENCERS
Some accounts are known as ''influencers''. These are refer to accounts that specifically attempt to influence public opinion, often in return for financial payments [40]. We filter and extract influencers based on feature set from [40]. Influencers utilized the highest number of hashtags within their posts (avg. 18). This group holds 4.8% of total posts, 1.3% of unique publishers, and 0.8% of total reactions.

e: BOTS
We further identify a set of bot accounts. These refer to accounts that are computationally operated [44]. We use a Contextual LSTM Neural Network classifier with 88% accuracy in order to train to identify bot accounts. The process of training the classifier, feature set, and results are explained in Section III-B in detail. Bot generate 6.9% of total posts, 2% of unique publishers, and 0.2% of total reactions.

f: PUBLIC ACCOUNTS
We refer to the rest of the publishers as ''Public Accounts''. In this category, profiles are non-verified public accounts that have a few to millions of followers. This group holds 67.1% of total posts (the most populated group), 93.5% of unique publishers, and 1.1% of total reactions. Validation: In order to validate categories, we manually check each one individually. For News agencies, Business pages, and Celebrities, we examine all samples and 100% of accounts are identified correctly. 86% of these accounts are already verified and approved by Instagram. For the influencer category, we randomly select 25% of samples and examine each by hand. 94.3% of influencers are identified correctly. To validate influencers we use the feature set from [40]. In the bot category, we randomly select 25% of samples and examine each manually. 94.3% of bots are identified correctly. To validate bots, we use [37]- [39] metrics.
The process of bot detection is presented in Section III-B. Note, for the prior analysis, we remove incorrect samples from groups.

2) COMPARISON OF PUBLISHER CATEGORIES
We observe key differences among categories. Figure 3.a presents the Followers Friends Ratio (FFR) across account groups. This defines the social connectivity of an account [54]. Bot identities have ≤ 1 FFR, which means that they follow many other accounts, yet receive few followers in return. In contrast, News agency, Celebrity, and Business accounts tend to have ≥ 1 FFR. This ratio is considerably greater for News agency and Influencers as they have millions of followers.
We also inspect the attention generated by the posts of these accounts. To inspect this, Figure 3.b plots the number of comments vs. likes received by each group. Unsurprisingly, groups that have high FFR ratios receive considerably more reactions. The first notable category is Bots, which obtains considerably less attention than the other categories (avg. 26 likes and avg. 1.2 comments). In contrast, Celebrity (avg. 1.5M likes), Business pages (avg. 162K likes), and News agency (avg. 22.4K likes) get the most attention.
Arguably, the above results may be skewed as accounts with a large number of followers (as they are more likely to obtain reactions). To control for this, Table 6 presents the average engagement rate, as a percentage of the follower count. This actually shows that Bot accounts gain the largest engagement rate (20% compared to just 2% for business pages). This may, however, be a product of the type of accounts that follow bots. For instance, bots may follow each other and automatically like posts.

B. PUBLICATION RATE
Next, we explore the number of posts published by each category of account, presented in Figure 4 as a time series. Here, most of the posts are published by the 'Public' category (79% of posts) followed by 'News agencies' (12.2% of posts).
Overall, we see a growing number of weekly posts. Public publishers have the highest rate, thanks to the volume of accounts in this category (79%). Similarly, the Celebrity group publishes the fewest points (as they constitute just 0.3% of accounts). We find that these trends are also impacted by key events. The main surges occur, first, after Europe announces the pandemic on 14 March 2020 with 18.2%, followed by the USA on 26 March 2020 with 4.5%. News posts tend to be driven by dedicated coverage given to  [59].
Perhaps most interesting is the Bot category. First, they publish a large volume of posts (4.9% of posts). This is the third most active category Second, they publish an almost fixed amount of posts during this pandemic, without the fluctuations seen in other groups. For example, compared to the News Agency group, we do not witness any noticeable peaks. This may be because of the computational manner in which such accounts generate content. The same trend is also reported in Twitter in [60] which shows an uptick in the frequency of bots' tweets referencing COVID-19 in the same period, and active bots sent 185K tweets and 1.4K retweets. The Business category also exhibits unique trends. Due to the national lockdowns (largely introduced in April 2020), many businesses released information via Instagram and shifted activities online. Thus, we see noticeable growth in posts during this period. We also find that Influencers, ranging from 'nano' to 'mega' [40], increase their number of posts during this period. Similar trends can be seen amongst Celebrity accounts. Note that most in our dataset are American English-speaking figures. Therefore, there is almost no content until the start of the pandemic in Europe (March 2020). The authors of this study [61] also reported the same trends in Twitter and Instagram posts associated with COVID-19 content.

C. REACTIONS
We next look at the number of obtained reactions, as measured by comments and likes. Considering both, we witness some unique points: (i) In the Public category, we see a high number of comments (147K) and likes (2.2M), consistently across the whole time period. (ii) The Bot category receives the lowest number of reactions than other categories in both metrics (33K likes, 1.6K comments). Notably, both trends are constant during the time, regardless of events. (iii) The News agency category receives the largest number of comments (1.8M) and likes (71M) among all. Both figures nearly follow the same fluctuations. These figures peak in March 2020 where the virus reaches Europe, UK, and then the USA. (iv) In the Business group, we see user reactions peak in two spots: 1) the first outbreak in Iran in February 2020, and then 2) after declaring the lockdown in Europe and the USA in March 2020. Afterwards, it continues steadily (72M likes vs 32K comments). (v) In the influencer category, we see fluctuations. However, the overall trends remain steady (73K like and 8K comments). (vi) In celebrities, we see a huge surge in March 2020, where the outbreak starts in Europe and UK (122M likes vs 131K comments). Note that the reaction rate is zero before March 2020 as there is no COVID-19 related published post by celebrities. Most of the celebrities are from English-speaking countries. In our study, another notable point is that during the first outbreak, three categories of News agency with 50%, Celebrity with 36%, and Business pages with 1.5% of total comments, obtain the highest written comments (total of 3.5M comments) through categories. Respectively, they also receive 26%, 45%, and 27% of total likes. In other words, these trusted groups potentially attract more people and target more audiences in Online Social Networks to distribute information, especially in critical moments such as the COVID-19 health crisis. More investigation through the comment text can lead us to understand the behaviour of people during the global crisis.

V. CHARACTERIZING THE USE OF HASHTAGS
As a proxy for post content, we next explore the hashtags employed by publishers.

A. DISTRIBUTION
A considerable part of the content (62% of posts) is tagged with '#coronavirus'. So, we consider this hashtag as the main hashtag. The '#coronavirus' tag is used by nearly 11K unique accounts and receives more than 7M reactions (comments and likes). We plot the hashtag graph of the most used keywords in this dataset in Figure 6. Note, the process of generating graphs is presented in Section III-B. In order to observe the usage behaviour of hashtags in posts, we randomly select 5K posts (20% of data) that contain all categories. We manually check captions, images, hashtags, and publisher profiles. Among main hashtags (Table 2), we witness some important hashtags that are connected in numerous posts: (1) We observe many geographical hashtags associated with '#quarantine, such as '#spain', '#italy', '#usa', '#china', '#colombia', '#nyc', '#trip', and '#travel' in 25.2% of posts. These tend to talk about lockdown-related situations, the spreading of the virus throughout the world, health crises in big cities, and the contamination rate in various locations. Similarly, in 31% of posts, with the '#staysafe' hashtag, we see '#selflove', '#selfcare', '#fitness', '#love', '#mentalhealth', '#psitive vibes', '#motivation', and '#healthylifestyle' hashtags. These tend to be tagged with general content during the first lockdown.
We next look into their temporal trends during the first wave, presented as a time series in Figure 7. This figure shows what percentage of posts are tagged with the key hashtags (Table 2), as shown in the legend. We also plot the number of deaths and new cases (on a weekly scale) as reported by the WHO. Due to space constraints, we limit ourselves to the top five COVID-19 related hashtags. The '#Coronavirus' hashtag is the most used, as we crawl our dataset based on this keyword (14K). The second most used one is the '#covid_19' (11.2K) which is a more academic naming version of this virus. '#lockdown' (1.4K) and '#pandemic' (1K) hashtags follow nearly the same pattern from the very beginning days. Both tend to be tagged with posts that are talking about situations around lockdowns, the economical consequences, inviting people to work and study from home, and how to slow down the contamination rate. '#facemask' is another important hashtag that is used to encourage people to wear face masks in 9% of posts. This trend begins to grow from April 2020 when the first outbreak started in Europe and then in the USA. We can contrast these hashtags with the death and case rate. We note that these peak on two dates, in the middle of February 2020 (China, Iran, and Italy) and in April 2020 (Europe, USA, UK, Middle east). We witness all hashtags surge with these fluctuations especially in middle March (by 12%) where the virus reached Europe. Figure 8 presents the graph of all hashtags. We colour code the hashtags based on the category of accounts (each colour represents one category). If a hashtag belongs to two groups, that node gets the colour of the category that is greater in size. In this network, main hashtags such as '#Coronavirus' and '#Covid19' ( Table 2) are located at the centre, surrounded by more connected nodes.

B. HASHTAG ISLANDS
Nodes with larger sizes (frequency) are located closer to each other. We see this characteristic in hashtags used by 'Public', 'Celebrity', 'News agency', and 'Influencer' categories. That is the reason why they are located closer to the main nodes. Furthermore, they are more topic related, and hold higher connection rates with others. In contrast, interestingly, hashtags primarily used by bots (red nodes) are located far from the centre with smaller sizes and fewer node connections. We also see the same behaviour in 'Business' nodes (green nodes).
We witness 36 isolated islands. Islands are where there are a set of inter-connected hashtags that are disconnected from the main network of hashtags. Each island contains between 5 and 39 nodes. These islands have several characteristics: (i) In each island, all nodes are well-connected to each other (avg. 21 internal connections), but there is a weak connection to external nodes (avg. 6). (ii) Island nodes are used together in the same posts. So, node sizes are equal. (iii) Islands connect directly to the main network node or through a few nodes (max 4 connections). In this case, some islands are connected directly to the '#Coronavirus' node, but others with some extra nodes. (iv) Individual islands are disconnected from each other. In 21 islands (out of 36), we see no direct connection between them. This behaviour can be seen in the bot category (with 36 islands), which is presented in red nodes, and two green islands from business accounts. The same mechanism is also reported from authors of this study [44] which investigate the anatomy of online misinformation networks.
In Social Media, one strategy to gain more visibility is to tag posts with trending topics. This has been widely reported to be used by fake identities, spam, for-sale accounts, markets, some influencers, impersonators [62]. We witness suspicious behaviours among categories. For example, bots and some small business pages are using the Coronavirus topic to be in the Instagram Explorer and get more attention. As result, they appear like isolated islands with no connections to other nodes (or topics).

VI. LIMITATIONS
There are some limitations in this study that we would like to highlight. (i) First, despite the large amount of information on Instagram regarding COVID-19, we were only able to collect a small fraction in the first wave. Similarly, it was not possible to collect all reactions, particularly for popular accounts that receive thousands of comments (e.g. BBC).
(ii) Furthermore, many accounts (e.g. News agencies) use Instagram Stories, which we did not collect. This means that we can only offer a lower-bound on activity. (iii) We note that the 'Public Accounts' category contains the largest number of posts and accounts. This group contains all accounts that did not fall into one of the other categories. Therefore, it is likely that this group can be further subdivided and may contain a diversity of users that we do can capture. We believe further exploration can be carried out such as identifying account types, clustering publishers, tracking misinformation, etc. to understand the behaviour of various entities. (iv) We also believe there are some profiles in the public category that can be considered as fake identities (e.g. bots, spammers, fake protective equipment stores for COVID-19, fake health news distributor) which may require more investigation. This may naturally impact our insights.

VII. CONCLUSION & FUTURE WORK
In this study, we have targeted the first wave of the COVID-19 pandemic that occurred between 5 January 2020 and 30 May 2020. We present a multilingual Instagram dataset containing 25.7K posts, 829K comments, and 3.2M likes. We summarize our key findings as follows: • The majority of bots in this study publish off-topic content (with regards to . They exploit the COVID-19 hashtags to spread their content (4.9% of posts). That is why we see many isolated islands in the graph of hashtags (Section V-B). This behaviour is also reported in [26], [39], [63] • The number of reactions to trusted publishers (Celebrities, News Agencies, and Business Pages) is 110x times higher than unreliable publishers. This highlights the importance of trustworthy accounts in critical moments (Section IV).
• Celebrities received the greatest attention from people. Noticeably, they published the smallest number of posts (0.3%) but received the most likes (avg. 1.5M per post) and comments (avg. 16.5K per post).
• In contrast to what we expected, Influencers received very little engagement through their posts. 27.3% of influencers stick to the Coronavirus trend to gain more attention.
• In this study, 17 News companies covered the latest news of this health crisis. Despite having a larger number of followers (avg. 4.7M), we observe limited engagement (avg. 22K likes and avg. 582 comments per post). We further believe there are a number of potential lines of future work. First, COVID-19 has triggered a number of safety measures have been such as social distancing and working from home. We believe exploring people's behaviors through the lens of social media may shed light on how people have reacted to these measures. Second, we believe it important to study the dissemination of misinformation (which our dataset could unerpin). Developing techniques to overcome this problem is another potential direction. Finally, we have observed a number of bot accounts discussing and engaging on COVID-19. The identification of invalid profiles is another effective strategy in preventing untrustworthy content.