Palaute: An Online Text Mining Tool for Analyzing Written Student Course Feedback

Collecting student feedback is commonplace in universities. These surveys usually include both open-ended questions and Likert-type scale questions but the answers to open questions tend not to be analysed further than simply reading them. Recent research has shown that text mining and machine learning methods can be utilized to extract useful topics from masses of open student feedback. However, to our knowledge, not many off-the-shelf applications exist for processing open-ended student feedback automatically. Additionally, the use of text mining tools may not be available to all educators, as they require in-depth knowledge of text-mining, data analysis, or programming tools. To address this gap the current study presents a tool (Palaute) for analyzing written student feedback using topic modeling and emotion analysis. The utility of this tool is demonstrated with two real-life use cases: First, we analyze student feedback data collected from courses in a software engineering degree programme, and then feedback from all courses organized in a university. In our experiments, the analysis of open-ended feedback revealed that on certain software engineering course modules the workload is perceived as heavy, and on some programming courses the automatic code grader could be improved. The university-wide analysis produced indicators of good teaching quality, such as interesting courses, but also some concrete improvement points like the time given to complete course assignments. Therefore, the use of the tool resulted in actionable improvement points, which could not have been identified using only numeric feedback metrics. Based on the demonstrated utility, this paper describes the design and implementation of our open-source tool.


I. INTRODUCTION
In universities the most common way to evaluate the quality of teaching is to analyze feedback collected from the students [1]- [9]. However, student evaluations of teaching (SET) as a measure of teaching quality is limited at best. First, education research has shown that SET is not a reliable metric for teaching quality, as student ratings of teaching and student learning are not related [7], [10], [11]. Second, while feedback questionnaires usually comprise of both Likert scales and open-ended questions, written feedback is often left unused [12] mostly due to the manual analysis being laborious [13].
The associate editor coordinating the review of this manuscript and approving it for publication was Biju Issac .
The current study focuses on the added value provided by open-ended, written student feedback. The automatic analysis of open-ended feedback using text mining and machine learning tools is a recent trend in higher education research (see for example [13]- [30] Qualitative open-ended questions have the advantage over Likert-type questions by allowing the respondent more freedom in their answers, in addition to allowing answers that were not expected in the survey design [31]. This is especially useful in student feedback surveys, where the open-ended questions allow the respondent to point out individual pain points or positive aspects of the course. Closed-ended questions give a direction, but they only provide as detailed information as is specifically asked in the question.
The open-ended questions require human interpretation, and especially coding of the answers is a laborious task [31]. This is not an issue with low student numbers, but interpreting the feedback becomes very costly and unreasonable as the course participant count rises to hundreds or even thousands. Similarly, drawing conclusions from student feedback on an institution or organizational unit level can be difficult for the same reason.
In this study, a tool was created (Palaute -plot, analyze, learn, and understand topic emotions 1 ) to better address the demand for written student feedback analysis. The goal was to create a tool that would improve the workflow of addressing student feedback by summarizing and generating insights from the data. The additional benefit of using Palaute is that it allows much larger data sets than is easily feasible with manual coding. This means that multiple data sets from different years from the same course can be combined and analysed easily, as well as, programme-wide analyses can be conducted, or analyses of large MOOCs. Combining the written feedback from all of the courses of a study programme should give new, actionable insights about the health of the programme that are based on qualitative SET data.
This study contributes to the field of SET by creating a novel artefact that combines multiple SET analysis steps into a single tool. The process follows the design science research approach by providing an artefact and evaluating its usefulness. Thus, the following research questions were formulated: RQ1 What can be learned from the written student feedback with the tool? RQ2 How does the tool benefit the user?
The rest of this paper is organized as follows: Literature and relevant studies are presented in Section II. The research process and artefact requirements are specified in Section III. Implementation and used analysis method are detailed in Section IV, followed by two demonstrations in Section V and the evaluation results are discussed in Section VI. Lastly, the main takeaways from this study are summarized in Section VII.

II. RELATED WORK
Text mining has been used in education analysis as a part of the field of educational text mining [32]. The diverse approaches include online forum [33] and VLE analysis [34], modeling student teamwork [35], MOOC diagnostics [36], and extracting course improvement suggestions [18].
Student written feedback analysis, a specific branch of educational text mining, has seen attention in recent research as well. Multiple different techniques have been shown to work with the evaluation of teaching data, including sentiment analysis [1], [37]- [40], Latent Dirichlet Analysis (LDA) [22], [33], rule-based classification [18], and key phrase 1 ''Palaute'' is also Finnish for ''feedback.'' extraction [41]. Diverse tools have been created to automate the listed, including Sobek [42] for text mining, Leximancer [14] for visualization, and a tool for extracting improvement suggestions [18]. Furthermore, workflows to combine text mining with qualitative approaches have been proposed, for example, by Hujala et al. [13].
There exist fewer tools that are aimed at streamlining and automating the process for the tools. While some tools for analysis have been published, few to none exist to support an approach that combines LDA text mining, supports a following thematic analysis, and additional sentiment analysis to add emotional valence analysis to the discovered themes. The tool introduced in this paper aims to address the research gap in having tools the streamline multi-step processes, such as one proposed in Hujala et al. [13]. The new tool implements a process for combining thematic analysis with LDA for analyzing themes in large student evaluation of teaching datasets, building on a line of research introduced by Finch et al. [43] and adding depth compared to analyses based solely on LDA ( [18], [34], [44]).

III. ARTEFACT DESIGN GOALS
The main goal of this paper is to design and evaluate an artefact that solves a task in a problem domain using scientific principles. In the process, we follow the design science research (DSR) process, as defined by Peffers [45]. Design science research is an approach that aims to solve an issue in specific a problem domain using an iterative design process that applies the latest knowledge from related fields of science [46]. During the process, an artefact is produced and evaluated, its validity established by the utility in solving the issue [47]. At the same time, the process of applying and testing underlying kernel theories will provide new evidence or knowledge to the state of the art scientific knowledge base [48], [49].
The main goal is to address research and a practical solution gap for automated support for evaluating large SET data sets in a manner that would be feasible for lecturers of large courses or directors of degree programmes, following an LDA and thematic analysis-based process originally established in [13]. Furthermore, we investigate what other kinds of analyses can be provided to add value to the analysis outcomes, such as sentiment analysis.
The artefact design is accomplished in two stages: Iterative design by researchers and SET analysts, and feedback from practitioners. The utility of the artefact will be evaluated based on the research questions laid out in section I. Scientific rigour, which separates design science from everyday design, is accomplished by grounding the findings in established scientific literature and methods.
The goal of the artefact is to extract meaningful information from large text corpora to the user. To accomplish this, the tool must first allow the user to input the data. Then, the data must be preprocessed, analyzed and visualized to the user, so that the insights can be highlighted from the data. The varying structure of the survey instrument used in VOLUME 9, 2021  this university means that the tool must be able to handle different kinds of data, as having the user manually structure the data into a specific format would break the workflow. The functional requirements for the artefact are listed in Table 1.

IV. ARTEFACT IMPLEMENTATION
The artefact performs topic modeling, sentiment analysis and emotion analysis on data sets of varying kinds. This core functionality of the artefact is built on two R packages STM (structural topic model) and Syuzhet. Topic modeling is done using the STM package by [50]. Sentiment and emotion analysis is done using the Syuzhet package by [51]. The Syuzhet package contains multiple lexicons for sentiment analysis and NRC lexicon [52] for emotion analysis. Syuzhet also allows using custom lexicons.
The source code of Palaute is licensed as GNU general public license v3.0 (GPLv3) and can be found at Zenodo, an open research artefact repository. 2 A Docker file can also be downloaded to run the tool with minimal setup. 3 The workflow of analyzing text feedback is illustrated in Figure 1. Using the tool does not require previous knowledge about text mining. The user is needed only to upload a data file (in csv format) and select which columns should be included in the analysis.

1) TOPIC MODELING
Latent Dirichlet Allocation (LDA) is a probabilistic model frequently used in text mining. LDA assumes that each document in the corpus is a random mixture of different topics, and distribution over words characterizes each topic [53], [54]. In other words, the corpus contains unknown topics that are spread out in multiple documents, and a group of words characterizes each topic. Words can also belong to multiple topics with varying probabilities.
The topic count is defined by the user beforehand, meaning LDA always generates as many topics as is specified. There have been solutions for finding the best amount of topics, like running the LDA multiple times with different topic counts and optimizing the perplexity of the model [54], [55].
Structural topic model (STM) improves upon LDA by including document-level metadata in the analysis. In addition to taking in the bag-of-words representation of the corpus, STM can also take in document-level covariates. This means that, for example, in surveys, quantitative data like gender or age of the respondent can be included as a covariate in the model. Lucas et al. [56] and Roberts et al. [50], [57] demonstrated that including covariate information does account for better results as the variance in topic prevalence is reduced.
Another improvement of STM over LDA is the explicit estimation of correlation between topics [56]. In other words, STM estimates how different topics relate to each other. This allows for visualization of the topic correlations, which can be helpful in getting a deeper understanding of the corpus-level structure of the topics.

2) SENTIMENT AND EMOTION ANALYSIS
Sentiment analysis is a text mining method used to understand the feelings or thoughts of the writer from the text [58].
Early methods categorized documents or individual sentences into positive, negative or neutral. More recent aspect-based methods categorize sentiments based on a more fine-grained spectrum [59]. For example, the NRC emotion lexicon [52] distinguishes eight sentiment categories based on the eight basic emotions.
Sentiment analysis can be done on three levels: document, sentence, entity, or aspect [60]. Documents can contain multiple different sentiments. For example, in a course evaluation survey, a student might complain about difficult group work while praising the lecturer for explaining the subject well. In this case, it is hard to assign a positive or negative sentiment to the document. This problem continues in the sentence level as multiple differing sentiments can also be expressed in a single sentence, for example, ''The lectures were great but too long''. In this case, ''lectures were great'' is a positive sentiment, but ''lectures were too long'' is a negative sentiment, and both sentiments focus on the same target, ''lectures''. Therefore, it makes sense to analyze sentiments on the entity or aspect level; otherwise, all the expressed sentiments cannot be accurately identified [60].
In addition to sentiments, emotions, like sadness, anger and joy, can also be identified from text. Emotion analysis follows the same procedures as sentiment analysis, but emotion analysis has a different classification goal. Identifying sentiments and emotions from text are treated as separate problems, although sentiments can be identified from the emotions [61].
Tabak and Evrim [62] compared emotion lexicons and their effects on emotion analysis. These lexicons included the National research council Canada (NRC) word-sentiment association lexicon, EmoSenticNet (ESN), DepecheMood (DPM) and Topic based DepecheMood (TDPM). The lexicons contain different emotions and words based on those emotions, for example, NRC contains the eight emotions from Plutchik's wheel and two sentiments (positive, negative). In contrast, ESN contains six emotions (joys, sadness, disgust, anger, surprise, fear), and DPM and TDPM are built with eight emotions (happy, sad, angry, afraid, annoyed, inspired, amused, don't care). For comparison, matching emotions were selected from NRC and ESN, while DPM and TDPM were mapped to match the emotions of NRC and ESN. Overall, NRC and DPM performed the best in classifying emotions from news headlines.
After reviewing the literature, we used the emotion lexicon created at NRC by Mohammad et al. [52], and Mohammad and Turney [63] and translated it to over 20 languages, including Finnish. The lexicon contains classifications for positive or negative sentiments; and eight emotions (joy, trust, sadness, anger, surprise, fear, anticipation, disgust) commonly called Plutchik's wheel [64].

B. THEMATIC ANALYSIS
The last step in the process is lightweight thematic analysis, where the practitioner (educator, administrator) or a researcher overviews the analysis outcomes and assigns one or several themes to the analysis outcomes. Thematic analysis is a 'qualitative research method for identifying, analysing and reporting patterns (themes) within the data [65, p.79] and has been used widely, including in student feedback analysis [66]. It is essentially an iterative, qualitative method for reviewing data that aims toward increased abstraction.
In the Palaute system, the primary data sources for lightweight qualitative analysis are 1) keywords in each topic are presented, and as a novel feature, the system also presents 2) most characteristic answers for each topic. This approach presents the best of both worlds: Full responses are more rich in meaning than keywords [67], the analysis is based on topic probabilities as recommended by Finch et al. [43], and the algorithm-based sampling and reading allow for efficient analysis [13].
A lightweight, practitioner-oriented and partially automated thematic analysis process, as shortened from [13], proceeds as follows.
1) Reading ten to twenty most characteristic responses from each topic and topic keywords, as generated by the LDA topic-modelling process 2) Generating initial codes for each row, using either a data grounded or a theory-driven approach 3) Defining and naming themes The STM package contains a function for calculating the topic model. The topic model can be calculated using only the documents, but there can also be metadata in the form of covariates. The first type of covariates is prevalence covariates [50]. Prevalence covariates are external data that can be used in the calculation of topic prevalence. For example, in the context of course evaluation surveys, a Likert-type question about the workload of the course can be used as a prevalence covariate.
The second type of covariates is content covariates [50]. Content covariates affect the words used in a topic, and in the current implementation of STM, content covariates create strict groups of documents so that each document can only belong to a single group.
Topical content covariates change the STM model a lot since the documents are forced into groups [50]. In the context of course evaluation surveys, it could be used with some Likert-type questions that would significantly affect the vocabulary used in the topics. The survey questions could also be included as content covariates as it would make sense that different questions are answered differently.
The artefact has support for using both covariate types, although, as a limitation of the STM package, there can be only one content covariate, but multiple prevalence covariates are supported. Each of the data columns has the possibility to be either a document, prevalence covariate, content covariate or be excluded from the analysis. This means that different combinations of covariates and documents can be tested without having go to Excel or other tools to change the structure of VOLUME 9, 2021 data manually. This also allows the tool to work without any limitations on how the columns should be ordered, named or how many columns there should be. Figure 2 shows what the mapping in Palaute looks like with a short example data set with six questions.
STM package also contains tools for selecting the best model and the computationally best number of topics [50] using the semantic coherence algorithm [68]. Semantic coherence is related to the concept of pointwise mutual information, and it has been shown that the metric correlates well with a human judgment of topic quality [68]. The semantic coherence metric is commonly used as the standard evaluation option in popular analysis libraries, including stm [50] and topicmodels [69]. The artefact contains a function that trains multiple models for each number of topics and evaluates them based on semantic coherence and exclusivity. Based on this automatic evaluation, the system automatically proposes the number of topics with the highest quality values to the user.

2) SENTIMENT AND EMOTION ANALYSIS
Sentiment analysis and emotion analysis are performed using the NRC lexicon simply by matching the words in the data to the lexicon words and adding up the sentiment values for each matched word. This analysis does not consider the order of the words, the context of the words, negations, nor emphasis, but it should still yield a general sense of the data.
The sentiment analysis and emotion analysis are performed on the whole data set as a summary of the corpus. For individual topics, representative documents are selected, and the sentiment and emotion analysis are run with only the selected documents. There are multiple ways of making this selection of documents, but the current implementation is that the artefact selects the documents exclusively, meaning each document is added to the corpus of the topic that has the highest prevalence in that document. Dividing the documents exclusively among the topics makes sure that each document is used in the overall analysis only once, as multiple topics sharing the same documents would make the topics more similar to each other.

D. VISUALIZATION
The results of the text analysis can be visualized in multiple ways. LDA Topics are visualized using LDAvis by Sievert and Shirley [70]. LDAvis uses the Jensen-Shannon divergence to calculate the inter-topic distances from the  word-topic probability matrix, which is then reduced to two dimensions to be shown as a two-dimensional plot. Each topic is displayed as a circle, with the area of the circle being proportional to the topic proportion.
Palaute adds to the LDAvis plot by expressing the sentiment of the topic as a color. The inter-topic distances are calculated from the STM model's beta matrix, which contains the log values of the word probabilities by the topic. As STM uses logarithmic values of the word probabilities exponent function must be applied to the values in the beta matrix before the inter-topic distances can be calculated.
The sizes of circles are proportional to the topic proportions, but this does not mean overlapping circles should be interpreted as sharing similar words proportional to the overlap. Instead, the distance between the topics is the measure of topic similarity, meaning they use similar vocabulary. Another important note is that since the plot is a two-dimensional representation of a higher dimensional construct, information is lost as the distances are projected twodimensionally. Dimensional scaling is done using classical multidimensional scaling. The dimension scaling algorithm tries to keep the inter-topic distance similar when reducing dimensions, but there is information that is lost. So, just because two topics are close to each other, it does not necessarily mean they should be merged as one, although this should be the case. An example of this type of plot can be seen in Figure 3.
Theta matrix of the STM model contains the document topic proportions by topics. This matrix can be visualized to show what documents belong to which topics and how much of that document belongs to the other topics. In the artefact, this is done by creating a scatter plot of the documents, where the color of the document is based on the highest topic prevalence, as is the size of the circle. So, larger circles have a larger portion of them dedicated to a single topic. The Barnes-Hut variant of t-Distributed stochastic neighbor embedding (t-SNE) was used to dimensionally scale the data down to two dimensions [71].
Documents that have similar topic proportions cluster together in this plot. When documents are highly cohesive in the sense that they belong mainly to one topic, it causes clear clusters of documents to emerge in the plot to represent the topics. When the documents contain multiple topics more evenly, then the topics are not represented as single clusters. When the documents share similar topic proportions, they tend to share similar vocabulary, meaning semantically similar documents also cluster together. Topic labels are placed on the mathematical means of the document coordinates. The circles can be clicked, which shows that document, in addition to information about the document topic proportions. Figure 4 shows the example of topic-document relation of the data set with 12 topics.
The artefact contains a page with detailed information about each topic. An example of this can be seen in Figure 5.  A similar panel to Figure 5 is generated for each topic and the details page contains all these panels. The user has the option to hide each of the smaller sections inside the panel using a filtering panel. There are also options for sorting the emotion analysis results in descending or alphabetical order, as it can be easier to do comparisons between topics when the results are in the same order. The sentiment is shown as a single bar. The number of shown keywords and documents can be changed by the user. Keywords are selected in the same way as in the inter-topic distance plot, and the documents are selected in the order of highest topic prevalence. This information should aid the user to understand what the topic is about by its vocabulary and example documents. The sentiment and emotions give additional insights about how, in this case, the survey respondents feel about the specific topic. For example, if the examination was too hard in the course, and it is a recurring theme in the survey answers, it should end up as a topic that is negative and has a vocabulary that uses emotionally negative words.

V. APPLYING AND EVALUATING THE ARTEFACT
The  Table 2. The first example analyzes data collected from a whole degree programme, and in the second example the data is collected from all courses in the whole university.

A. ANALYZING FEEDBACK ON DEGREE PROGRAMME LEVEL
First, Palaute is demonstrated with student feedback data collected from courses in a software engineering degree programme. Only responses that contained answers to open-ended questions written in Finnish were included. The dataset is a total of 36 courses with 742 responses.
Responses to the open-ended questions were collapsed to a single column. For example, if the respondent answered to four open-ended questions, the answers were mapped to four  rows, each with their matching Likert-type answers. Only full rows were included, meaning a row is dropped if the document is empty or one or more of the covariates are empty.
The model was run with 11 to 15 topics with 500 maximum iterations. This yielded a model that converged at 414 iterations with 12 topics. The topics, labelled using a machine-supported thematic analysis process introduced in [13], are listed in Table 3.
According to sentiment analysis, the feedback tended to be positive. However, lightweight thematic analysis of the most characteristic responses from each topic highlighted suggestions for improvement instead of praise. Figure 6 shows that trust and anticipation are the most matching emotions.
Going over the topics, topics 1, 3, 5 and 7 include feedback on, for example, quizzes, assignments, exercises and exams. A total of 28 % of the feedback falls into these topics. Topic 4 relates to teaching methods and it is the only completely positive topic. This topic is also relatively large at 13%. The courses and their topics are also deemed interesting, which is shown in topic 2 (10% of the feedback) and its documents.
Topic 6 contains suggestions from students. For example, some instructions could be made clearer and some additional topics could be taught in the lectures. The suggestions are mostly not critical of the current methods, and only suggest ways of further improving the courses. Topics 8, 9 and 10 relate to workload, timing and schedule issues. Topic 8, labelled as ''Low motivation due to heavy workload'', deals with the heavy workload, affecting the students' motivation negatively. Some other reasons for low motivation were also mentioned, like lack of interest in a mandatory course. Topic 9 highlights the hurry the students face with their studies, and the SE courses are just too much work. Topic 10 deals with various timing and schedule issues. For example, evaluation and feedback from some exercises were delayed, which was not liked, there was not enough time to do some exercises, and the workload was sometimes too much.
Finally, topic 11 is related to course material (5% of the feedback) and topic 12 (4%) contains other, miscellaneous comments. There is a variety of short positive comments that are not highly connected with each other.

B. ANALYZING FEEDBACK ON INSTITUTION-WIDE LEVEL
Next, we analyze feedback data collected from all courses in the university. The data set contains a total of 6087 student course evaluations.
We ran the analysis similarly to the previous experiment. This time the analysis yielded a model that converged at 26 iterations with 6 topics. The topics are listed in Table 4.
Overall, as the feedback comments have come from students taking courses in different fields, the summary is more generic in comparison to the degree programme wide feedback analysis. Based on the inter-topic distance and emotion analysis presented in Figure 7, most of the feedback is positive or mixed, and no topic is primarily negative. Topics 3 and 6 are very close to each other, while the other topics are more distinct from each other. Topic 1 is the most mixed in terms of sentiment, while the others are mostly positive.  Inspecting the feedback in the different topics, Topic 1 consists of comments and suggestions on course arrangements. In particular, mandatory attendance in classes emerged as a common subject. Overall, 12% of the feedback fell into this topic.
Topics 2 and 5 are close to Topic 1 in the inter-topic distance matrix. Topic 2 consisted of feedback on courses, their workload, and the scope of course arrangements. Topic 5, too, contained feedback on course assignments, and in particular, the workload and time given to complete assignments. Topics 3 and 6 shared the most similarities in their vocabulary. Topic 6 contained feedback about lectures, their content, and course arrangement. Topic 3 contained feedback on similar issues, with the distinction that there were also comments on the students' motivation.
Finally, topic 4 was the most distinct in terms of vocabulary used in the feedback. Comments in this topic generally contained some constructive criticism, and improvement suggestions for the course.

VI. DISCUSSION
Based on our design requirements, we created an online tool for analyzing written course feedback, Palaute. It is an online service, meaning no external software needs to be installed.
Hence, the tool makes the analysis workflow more accessible to educators who are not proficient with data mining tools.
Through creating a design and demonstrating it as an instantiated artefact in a naturalistic environment [72] (also known as 'in the wild' in HCI), we demonstrate the usefulness of the concept and therefore the validity of our design ideas. Two contributions to the field are as follows: • Feasibility for practitioners. The field of student evaluation of teaching and text mining has discussed the feasibility of speeding up topic modeling [67], since automated machine learning always needs validation [73]. A process that combines thematic analysis and LDA was proposed in [13] and is now demonstrated in this paper. With Palaute, a practitioner can analyze the findings from, for example, an institution-wide feedback survey in less than a quarter of a working day.
• Combined analysis process. As one of the major contributions to the field, combining topic modeling with sentiment and emotion analysis in CSE is a novel combination of methods that have not been widely explored in the literature. Single approaches to LDA and sentiment analysis have been studied extensively (e.g. [21], [22], [25]). However, the practice of how to combine these analyses into a single practical workflow has been less explored.

VII. CONCLUSION
Topic modeling and emotion analysis can be used in the educational context as a way of creating summaries of the data. Palaute is a tool that was created to accomplish that task. While the analysis of feedback requires some understanding of text mining, the tool significantly streamlines the analysis process compared to generic text analytics tools. The goal of this study was to create an artefact that can be used to analyze written course feedback. The evaluation of the tool was done in two experiments: Analyzing feedback within one degree programme in software engineering, and analyzing the feedback collected from all courses arranged in the university.
To answer the first research question: ''What can be learned from the written student feedback with the tool?'', we can learn the most popular points students make in the written feedback. On the degree programme level this was the heavy workload, issues and frustrations with the automatic code checker, large problems with the UI course, and that there was also a lot of praise for the SE courses. Furthermore, the university-wide analysis produced indicators of good teaching quality, such as interesting courses, but also some concrete improvement points like the time given to complete course assignments.
To answer the second research question: ''How does the tool benefit the user?'', the main benefit of Palaute is the user interface that it provides to the complex methods that are used under the hood. Performing topic modeling, emotion analysis, and visualizing the results is not trivial, so automating this process is useful. Topic modeling allows thousands of documents to be summarized quickly, which would be very time consuming if done manually. Palaute is useful in understanding the structure of the data, and the graphical user interface makes the whole process of analyzing the data much easier than having to write the code for the analysis. As the tool highlights what students think is wrong with the course, action can be taken to solve these issues, which should improve the course. This can have an impact on, for example, the dropout rates in courses. Topic modeling groups together similar comments, so the overall themes reflect what most students think is important. Thus the tool highlights points from large data sets that can be acted on to improve the teaching.
There remains future work and limitations in the field. One of the major remaining issues in establishing the workflow is integration: Currently, users of the system still need to download and format the data, analyze it in Palaute, and then download it back for further use. A more comprehensive suite, or alternatively a plugin system, would allow processing the feedback in the system where it was originally gathered. The second main limitation in the tool is the support for qualitative and thematic analysis, or mainly the support for the researcher's own notes or tags. The current expectation is that the results are exported and qualitative analysis is finalized on the desktop. However, as future research, it would be beneficial to investigate research that could enable collaborative notes or tags to the dataset through the tool. This was out of scope in the current study by intention, in order to focus on a specific research issue. VOLUME 9, 2021 NIKU GRÖNBERG received the master's degree in software engineering. He worked in the ''Smart learning environments and their content production'' Research Project at LUT University as a Research Assistant during the research project. He is currently working in the software industry.
ANTTI KNUTAS is currently working with the Department of Software Engineering, LUT University, as an Assistant Professor in software construction. He is also heading the bachelor's degree programme in software engineering in his institution. In addition to research at LUT, he has previously worked with the Science Foundation Ireland Research Centre for Software with Dublin City University and the PONG Labs, University of Milan. He has more than 50 papers in his field. His current research interests include human factors in software engineering, civic tech, and computer-supported collaboration.
TIMO HYNNINEN currently works as a Senior Lecturer of information technology with the South-Eastern Finland University of Applied Sciences. He is also the Head of the software engineering degree programme. His main research interests include software testing and quality assurance, computing education, and computing applications for education. Previously, he has also worked on video games research in academia and software development in the industry.
MAIJA HUJALA is currently working with LUT School of Business and Management as an Associate Professor in industry and data analysis. She has published 38 peer-reviewed scientific publications. Her current research interests include data science in education, especially in student evaluation of teaching. Her previous research interests include the acceptability of wind power and structural changes in the global pulp and paper industry. She is experienced in quantitative data analysis and interdisciplinary research. VOLUME 9, 2021