A Comprehensive Usability Measurement Tool for m-Learning Applications

Contribution: This article describes the process used to create a questionnaire to evaluate the usability of mobile learning applications (CECAM). The questionnaire includes specific questions to assess user interface usability and pedagogical usability. Background: Nowadays, mobile applications are expanding rapidly and are commonly used in educational institutions to support the learning and teaching process. But the possible deficient usability could decrease the utility of learning activities and the student’s motivation. Therefore, careful planning and design by the developer are required, along with a usability evaluation of the applications. Research Questions: How could an instrument be developed to evaluate the usability of m-learning applications that combine technical and pedagogical aspects? How can the quality of the developed instrument be determined? Methodology: A structured questionnaire was created like a measuring tool to evaluate and design m-learning applications. Different statistical techniques, including reliability and validity assessments, were employed to evaluate the quality of the instrument, which is determined through the calibration of the CECAM survey. Findings: After the validity analysis of the questionnaire, a scale with 56 items was obtained, with an alpha reliability coefficient of 0.911 (an excellent measuring scale). It pretends to be used by teachers to design or evaluate m-learning applications, improve their usability, and enhance the students’ learning experience.


A Comprehensive Usability Measurement
Tool for m-Learning Applications Christian X. Navarro-Cota , Ana I. Molina , Miguel A. Redondo , and Carmen Lacave Abstract-Contribution: This article describes the process used to create a questionnaire to evaluate the usability of mobile learning applications (CECAM).The questionnaire includes specific questions to assess user interface usability and pedagogical usability.
Background: Nowadays, mobile applications are expanding rapidly and are commonly used in educational institutions to support the learning and teaching process.But the possible deficient usability could decrease the utility of learning activities and the student's motivation.Therefore, careful planning and design by the developer are required, along with a usability evaluation of the applications.
Research Questions: How could an instrument be developed to evaluate the usability of m-learning applications that combine technical and pedagogical aspects?How can the quality of the developed instrument be determined?
Methodology: A structured questionnaire was created like a measuring tool to evaluate and design m-learning applications.Different statistical techniques, including reliability and validity assessments, were employed to evaluate the quality of the instrument, which is determined through the calibration of the CECAM survey.
Findings: After the validity analysis of the questionnaire, a scale with 56 items was obtained, with an alpha reliability coefficient of 0.911 (an excellent measuring scale).It pretends to be used by teachers to design or evaluate m-learning applications, improve their usability, and enhance the students' learning experience.

I. INTRODUCTION
T HE USE of mobile devices (smartphones and especially tablets) to support teaching and learning activities has gained importance and has been consolidated over the last few years [1].This has led to the popularization of different terms that have emerged to refer to this new scenario [2]: m-learning, ubiquitous learning, seamless learning, blended learning, or smart education.
However, the most widely used and widespread term is mobile learning (or m-learning) [3].Although the concept of m-learning includes a large number of variants, with no consensus on its definition [4], one of the most widely accepted definitions is that of [5]: "the processes of coming to know through conversations across multiple contexts amongst people and personal interactive technologies."In [3] it is defined as "learning across multiple contexts, through social and content interactions, using personal electronic devices."This last definition is more focused on the students and their learning process.
Since the concept of m-learning has become widespread, several authors have directed their studies to demonstrate its benefits [6], identify the new challenges that arise [7], as well as discuss its advantages and disadvantages [8].In their works, these authors highlight several factors that can be considered determinants for the advancement of m-learning: 1) An adequate technological and pedagogical integration is required.2) Facilitates access to information at any time and in any place, breaking down geographical and time barriers.3) Allows the adaptation of interfaces, contents, methodologies, and activities to the individual differences of students, in addition to generating personalized analysis and feedback.4) Promote communication and knowledge sharing, fostering collaborative learning.5) Contribute to enhancing student motivation, especially when it is used to implement innovative methodologies, such as those that exploit the use of gamification.A fundamental feature to make the above factors contribute positively is the usability of the tools, since this aspect has a critical impact on the performance of learning activities [9], [10], [11].In the field of Human-Computer Interaction, in-depth work has been done on the usability of software applications (technical usability); and techniques and tools have been developed to contemplate its requirements in the design phases and posterior evaluation [12], [13].However, the context and support for m-learning [14]define a more complex and dynamic scenario due to the specific characteristics of the devices used (small screens, limited input capabilities, mobility, etc.) [15], [16].
The proposal of guidelines for the specific development of e-learning applications has been a focus of research by the scientific community, even considering specific elements aimed at ensuring usability [17], [18].An important consequence emerging from research in this area [19] is that pedagogical aspects are not taken into account for usability evaluation [20].In fact, the concept of pedagogical usability is proposed in [21], and [22], being defined as the ability of an educational system or tool to be used effectively and efficiently in teaching and learning processes, facilitating the achievement of educational objectives.
Although students are more likely to have a satisfactory learning experience when using a well-designed learning software application that presents acceptable levels of technical or pedagogical usability from various angles [23], the literature shows clear evidence that the balance is too heavily weighted toward evaluating the usability of the technical aspects, exclusively [1].Consequently, there is a lack of instruments to adequately measure both dimensions: the technical and the pedagogical (see Section II).And the few that exist do not meet standard quality criteria in terms of validity and reliability of their results [24], [25], so they should not be used [26] since the generation of scientific knowledge with a desirable level of precision and certainty is not guaranteed [27].
Consequently, this work aims to create an evaluation tool for m-learning applications, properly validated, that considers both pedagogical aspects and usability requirements.This instrument has the potential for dual functionality: an evaluation tool and a checklist or set of heuristics to guide the design processes of m-learning systems.Based on this goal, the following research questions are formulated.RQ1: How could an instrument be developed to evaluate the usability of m-learning applications that combine technical and pedagogical aspects?RQ2: How can the quality of the developed instrument be determined?To answer these questions, the process of constructing an evaluation instrument called CECAM (Cuestionario de Evaluación de la Calidad de Aplicaciones M-learning," for its initials in Spanish1 ) is described,as is how it has been calibrated with statistical methods to demonstrate its quality [28].This instrument, unlike other heuristics and guides for designing and evaluating m-learning applications, considers both technological and pedagogical aspects.
The remainder of this article is organized as follows: Section II includes a review of related works in the field of m-learning evaluation; Section III presents the main characteristics of the mobile learning evaluation framework (MOLEF) framework, on which the proposed measurement instrument (CECAM) is based; Section IV describes the process of developing the CECAM questionnaire, while its quality analysis or validation is detailed in Section V. Finally, the Discussion (Section VI) and Conclusions (Section VII) sections are included.

II. RELATED WORKS
The importance that mobile learning, or m-learning, is currently acquiring is unquestionable, given the massive presence of mobile phones in all fields, including education [29].Their use is growing in recent years, and even more, since the COVID-19 pandemic, which revealed the need for more flexible, real-time, and remote access to educational resources [30], [31].
The m-learning approach enables extended learning, exploring the ubiquitous possibilities of technologies, such as laptops, smartphones, or tablets-to access, record, process, manage, and exchange information anytime, anywhere.
However, the proper design of this type of applications continues to be a challenge [32], [33].Considering usability aspects when creating mobile learning applications is essential to improve user acceptance and satisfaction, increasing their motivation and engagement.The most frequently reported usability aspects of m-learning in the literature have been [11]: learnability, user satisfaction, ease of use, and usefulness.Considering usability aspects will also result in the creation of more motivating, effective, and efficient learning experiences for learners [34].
The interest in the usability of m-learning is growing in recent years, and there are several systematic literature review that try to know the current state on this topic [11], [14].These works conclude that further research is needed in this area since many of the existing proposals are based on usability methods and standards for nonmobile or noneducational applications.At the support device level, there are notable differences between mobiles and desktops, such as touch interaction, ubiquity, limited screen size, and a greater demand for visual attention, which affect usability and should be taken into account in the design and evaluation of this type of applications [35].On the other hand, it is necessary to consider educational aspects, to ensure that the use of these applications helps students in their learning process, in various contexts of use, and accordance with the posed learning objectives [22].Thus, the concept of pedagogical usability emerges, which takes into account the learning process, learning purposes, user needs, learning experience, learning content, and learning outcomes [21].
Several frameworks, taxonomies and guidelines have been proposed in recent years for the design of m-learning applications and the evaluation of their usability.Table I shows some of the most outstanding proposals.Some of these works are based on standards, such as ISO 9241 [36], ISO/IEC 14 598 [37], ISO/IEC 25 000 [38], and IOS/IEC 9126 [39].Others are based on well-established and validated frameworks, such as the theory of reasoned action (TRA) [40] or the DeLone and McLean model of information systems success (DL&ML) [41].But, undoubtedly, the framework that has been most often taken as a reference is the technology acceptance model (TAM) [42], [43], proposed by Davis [44].Works based on this framework seek to identify which factors best explain user intentions, as well as the adoption and acceptance of m-learning solutions.Some of these proposals include instruments, in the form of checklists or questionnaires, which allow quantifying the usability of the evaluated m-learning system.
As can be seen in Comparative Table I, only three proposals consider pedagogical aspects, with only two specifying instruments for their measurement.However, in one proposal, the instrument has yet to be validated, and in the other, its validation has been limited to the calculation of Cronbach's Alpha.Therefore, there is a need to propose a measurement instrument, or questionnaire, that considers aspects of mobile and pedagogical usability, suitably validated, and refined, that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.will allow evaluators to determine the quality of an m-learning system [56].
Having reviewed the main existing proposals for the evaluation of m-learning systems, the following section briefly describes the proposed framework for evaluating these types of applications, called MOLEF, on which the measurement instrument created and described in this article is based.

III. MOLEF-MOBILE LEARNING EVALUATION FRAMEWORK
As indicated in the Introduction section, this article presents the process of developing and validating the CECAM questionnaire, a measuring tool proposed to evaluate and design m-learning applications.It considers the elements in the framework MOLEF described in [33].This framework was developed after a thorough analysis of the existing evaluation and development frameworks for m-learning applications, as well as of well-known and widespread models of technology adoption, such as TAM [44] or Unified Theory of Acceptance and Use of Technology (UTAUT) [57], among others.This analysis has led to the identification of a series of factors or quality requirements [58], which comprise the MOLEF framework.
The MOLEF framework considers pedagogical factors (e.g., aligning with learning objectives, adequacy, cognitive load), mobile usability features (e.g., adaptability, consistency, flexibility), as well as technology adoption factors (e.g., usefulness or relevance, ease of use, previous requirements).To this end, it proposes a catalog of usability attributes or characteristics of m-learning systems [58], based on a literature review on the design and evaluation of mobile applications, the quality standard ISO/IEC: 25010:2011 2011 [59], adoption factors, and pedagogical usability attributes.
The proposed set of quality attributes is divided into two main blocks or dimensions: 1) pedagogical usability (Fig. 1) and 2) user interface usability (Fig. 2).
Each of the main dimensions (pedagogical and user interface usability) includes a subset of subdimensions that, in turn, are divided into a set of criteria or quality attributes: 1) The pedagogical usability considers educational and pedagogical factors to support learning activities.These factors will provide the appropriate context for educational practice.This dimension establishes five subdimensions: a) content; b) multimedia; c) tasks or activities; d) social interaction; and e) personalization.Each subdimension includes a set of quality criteria (Fig. 1), the definition of which can be found in [58].
2) The user interface usability includes factors that make the software easier to use and that favor the acceptance and satisfaction of the students with the m-learning system.This dimension includes five subdimensions related to the interaction with the interface: a) design; b) navigation; c) customization; d) feedback, and e) motivation.Each subdimension is measured by a set of quality criteria (Fig. 2) [58].

IV. DEVELOPMENT OF THE CECAM MEASUREMENT TOOL
This section presents the process of developing the CECAM questionnaire (RQ1), a measuring tool proposed to evaluate and design m-learning applications.It considers the elements in the framework MOLEF [33] (see Figs. 1 and 2).

A. Questionnaire Design
Questionnaire surveys are helpful tools used to gather information or measure.For instance, questionnaires are commonly used as a measuring tool for educational software [60].
Therefore, a structured questionnaire was created as a tool for evaluating m-learning applications (CECAM).The questions require short, concrete, and closed answers (choosing between a range of five options), which should be prepared beforehand.Particularly, each question is an affirmation that describes an attribute that should be considered in m-learning applications.The evaluator will mark X to register the degree of fulfillment of such attributes, using the following scale: 1) Strongly disagree; 2) Disagree; 3) Neither; 4) Agree; and 5) Stronglyagree.These five options have been chosen because they cover all the possible answers (principle of exhaustively), avoiding the possibility of evaluators not responding due to a lack of available responses and preventing the possibility for the evaluator from choosing two answers for the same question (to guarantee the exclusivity of the questionnaire) [61].

B. Questionnaire Structure
Considering the categories in MoLEF a questionnaire was developed and divided into two multidimensional subscales (pedagogical usability and user interface usability).Therefore, the questions are grouped into dimensions with items related to the aspects and constructs mentioned in the framework.Specifically, the questionnaire has some questions or heuristics related to each established criterion.Therefore, the pedagogical usability scale includes questions related to the content, multimedia, activities, social interaction, and personalization; and the user interface usability scale contains the questions related to the interface design, navigation, customization, feedback and motivation.
Table II shows the preliminary structure of the CECAM questionnaire, which includes the subscales, the constructs, and the number of items of these constructs or factors to be measured.
Once the questionnaire structure has been presented, the process followed in elaborating the initial list of items that compose it is described.

C. Elaborating Initial Items
The items in the questionnaire were written based on the factors of MOLEF [33], focusing on presenting a clear and understandable language, avoiding the use of technical words that would prevent a clear interpretation by the evaluators, and therefore ensuring: 1) that the exact question is being answered without misinterpretations of the statement; 2) avoiding the possibility of questions being unanswered due to lack of comprehension; and 3) not making it too complicated for the evaluator.
Table VI (in Appendix) presents a description of the items that belong to each of the constructs in the pedagogical usability subscale and their assigned identifier (ID) for further questionnaire statistics analysis.Table VII (in Appendix) presents a description of the items that belong to each of the constructs in the user interface usability subscale and an ID assigned for further statistical analysis of the questionnaire.
The first group of 72 items forms the initial structure of the preliminary questionnaire.This first version is the result of a revision performed by two experts in the area of evaluation questionnaires.The main goal was to review the wording of the items and analyze phrases that could confuse the evaluators.As a result of this revision, some items were modified; some questions changed from negative to positive; some terms were eliminated to avoid confusion; and examples or explanations in parentheses were included to facilitate understanding of each item's questions whenever necessary.

V. QUALITY ANALYSIS AND RESULTS
OF THE CECAM SURVEY The previous section described the process followed to develop the CECAM questionnaire, which provides an affirmative answer to the research question RQ1, on whether it is possible to develop an instrument to evaluate the usability of m-learning applications, combining technical and pedagogical aspects.The answer to the RQ2 research question, about how way to measure the quality of the developed instrument, is given by the calibration of the CECAM survey in terms of the standard criteria of quality: reliability and validity [24].The validity of a survey should be interpreted as the degree to which evidence and theory support the interpretations of test scores; the reliability of a test provides the degree of consistency or stability of measures when the measurement process is repeated [28].The data needed for the application of the method used to calibrate the questionnaire [25], have been obtained through a quasi-experimental study, which is described in the following section.All statistical procedures were performed with Software IBM SPSS, version 21.

A. Quasi-Experiment Design to Obtain Calibration Data
With the aim of obtaining the data needed for the calibration of the CECAM questionnaire, a quasi-experiment [62] was designed to be carried out with university students without random assignment.Fig. 3 illustrates the phases involved in the process: recruitment of participants, pre-testing, intervention, and post-testing, which are described below.
The participants were recruited from the first-year students of the computer science (CS) degree taught in the College of Computer Science (CCS) of the University of Castilla-La Mancha (UCLM), Spain.All enrolled students in the compulsory class of Programming I, taught in the first semester of the CS degree, were informed in class of the experience to be carried out (what it consisted of and the estimated time it would take) and that they would receive a reward in the form of some extra points for the final grade of the course.As a result, 37 students voluntarily decided to participate in the experience.The small sample size is a reflection of the usually low-attendanceat classes by students, which several reasons can explain: 1) class attendance is not compulsory; 2) all the material to follow the subject is available to them through the university's online platform (Moodle); and 3) programming is usually one of the most difficult subjects for CS students [63] which implies that many students drop it.
Then, the pre-test phase involved providing each participant with a paper copy of a survey to gather some personal information about them, and they must fill it out anonymously during the initial 20 min of the class.The first block of questions asked for demographic information, such as age, gender, and level of education.The second block included questions about their experience in using Information Communication Technology (ICT) and mobile devices for learning, and their attitude toward mobile learning.Then, the intervention consisted of using a m-learning application to learn a particular course topic for a specified period.The chosen application was Learn Java-Free, 2 available through Google Play.The application's goal is to review topics of Java programming, and for that purpose, it contains several concepts included in the subject contents.The students were suggested to use it for three weeks to practice the Java concepts explained during laboratory classes and were allowed to use it freely.Finally, during the post-testing, participants had to complete the CECAM survey to measure the application's usability.

B. Reliability Analysis
The analysis of the reliability of a survey involves calculating its internal consistency, which measures whether several items intended to measure the same general construct produce similar scores, and the homogeneity index of each item, which indicates the degree to which each item contributes to the internal consistency of the scale [25].The internal consistency is usually obtained by Cronbach's alpha (α)coefficient [64], which is based on the average interitem correlation and assumes that the items (measured on the Likert scale) assess the same construct and are highly correlated.The values of this coefficient vary between 0 and 1, considering that the closer is to 1, the higher the internal consistency.A general rule considers a coefficient acceptable when its value is equal to or greater than 0.7 [65].For multidimensional scales, each of one measuring different aspects, the calculation of the internal consistency is performed also on each dimension.In addition, it is advisable to evaluate the value of the αcoefficient after removing each of the items from the survey in turn: items whose removal make the coefficient value increase can be disregarded.Regarding the homogeneity index of each item, it is defined by the Pearson correlation coefficient between the scores of the item and the sum of the scores on the remaining items.Items with low homogeneity indices measure something other to what is reflected by the survey, so they can be removed.
It is usually advisable to remove those items whose homogeneity index is less than 0.35 [66].In this case, the process is repeated with the remaining items until all have a homogeneity index greater than 0.35, as Fig. 4 shows.In the case of the CECAM questionnaire, α = 0.9, representing an excellent internal consistency.The pedagogical usability subscale has also a good internal consistency (α = 0.892), and the user interface usability subscale has an acceptable internal consistency (α = 0.798), as Tables III and IV, respectively, shows in their first row.Regarding the pedagogical usability subscale, Table III shows the α value of each of its dimensions and their factors, highlighting in bold the coefficients below 0.7.That of the Content factor is lower than 0.70 (α = 0.676), but after removing items C5, C7 and C8 (see Table VI of Appendix) it grows to 0.735, which is acceptable.Moreover, the removing has sense because item C1 can substitute to item C5; item C4 can replace item C7, and item C8 does not explicitly evaluate a pedagogical quality.The other factors have a good internal consistency, although that of the Task or activities could improve if item A5 were removed.Despite its homogeneity index is greater than 0.2, it was removed since it was considered a difficult question to understand, and A4 and S1 can replace it.The final internal consistency of the reduced pedagogical subscale is excellent (α = 0.900).
Concerning the factors of the user interface subscale, Table IV shows that only the Design-factor has a good internal consistency (α = 0.817); that of the Navigation-factor is close to be acceptable (α = 0.699) and it grows (α = 0.799) after removing items N9, N11, N12 and N13 (see   is near acceptable (α = 0.673) and it grows (α = 0.741) after removing items C4 and C7: C4 was considered unnecessary, and C7 can be replaced by the feature measured by C6.In the Feedback-factor, the consistency is poor (α = 0.489).Moreover, Motivation-factor has not enough consistency (α = 0.556) and it does not grow after removing any item.Concerning the Motivation factor, a detailed analysis of their items revealed that they should be included in the Feedback factor.Therefore, items M1 to M5 (see Table VII of Appendix) were moved to that factor and renamed as F8 to F12, respectively.The Cronbach's alpha for the new Feedback increased (α = 0.756) after removing items F1, F4, F8, F10, and F11.Finally, the alpha in the subscale stayed at 0.798.

C. Validity Analysis
The study of the validity of a teaching survey involves the following analysis [25].
1) Content Validity, which assesses the understanding of statements, is typically determined through expert judgment, where qualified individuals evaluate the survey [67].The content of the CECAM questionnaire has been validated by 10 experts in the usability field, from the UCLM, who acted as judges.They all were given the same document that clearly stated the purpose of the survey, its content (Table I) and a rubric, Table II, specifying how they should make their evaluation of the defined dimensions, the items associated with each, and the evaluation scale.Given that 9 of 10 of the judges agreed to keep the original 10 dimensions and 72 items, as well as the Likert-type scale, the content of the questionnaire remained unchanged from the original proposal.2) Construct Validity, which evaluates the degree to which an instrument reflects the theory of the concept that measures [68].There are different methods for this type of analysis, although convergent and discriminant validity are commonly used [69].a) Convergent validity verifies that the items in the scales are significantly and strongly correlated with the constructs they belong to [68].Among the different criteria to analyze the convergent validity, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VII ID AND DESCRIPTION OF EACH ITEM OF THE USER INTERFACE USABILITY SUBSCALE
the factor loading matrix was chosen [70], since it calculates the Pearson correlation coefficient among the items and their constructs.It is recommended that this value should be higher than 0.4 [71].This property is satisfied by all items of the pedagogical usability subscale, and those of the user interface usability subscale except for item N5 (see Table VII of Appendix).Nevertheless, it was decided not to remove it because it is an important characteristic of the usability.b) Discriminant Validity is applied in the case of multidimensional scales, and tests if the different constructs that form it measure different concepts.Then, each item must be related to its construct and significantly different from the rest of the constructs to the ones it belongs [68].
Recommended methods to analyze the discriminant validity are the comparison between indicator correlations, and the comparison between the shared and the extracted variance.The comparison between the shared and the extracted variance is made through the analysis of the factor cross-loading matrix.It represents the Pearson correlation coefficients of the items and the other constructs.Discriminant validity exists if all correlations between the items of a construct are significant and each of these correlations is higher than all correlations between indicators of the other constructs.Table VIII (in Appendix) shows the cross-loading matrix for the factors, which reveals that each item is strongly associated with a particular construct, as evidenced by its high-factorial load (indicated in bold).On the other hand, the comparison between the shared variance and extracted variance suggests that each construct should share more variance with its items than with the constructs in the scale.Therefore, the correlation coefficients between the constructs and the square roots of the average variance extracted (AVE) were calculated.For good discriminant validity, it is recommended that the square root of AVE in each construct be significantly higher than the correlations in the other constructs [72].

VI. DISCUSSION
As can be derived from the study of related works (Section II), although in recent years the number of works proposing instruments to measure the usability of m-learning of a quantitative nature has grown, few of them consider the measurement of pedagogical aspects.Moreover, those that do so have not validated the quality of the instrument or have limited it to calculating Cronbach's Alpha.Based on this motivation, and in order to provide a solution to the detected need, in this work a survey instrument (CECAM) has been created, which includes the assessment of technological and pedagogical usability aspects (RQ1), and which has been conveniently validated and refined (RQ2).As a result of the reliability and validity analysis on the preliminary questionnaire CECAM (formed by 72 items and an initial alpha of 0.900), a scale with 56 items was obtained and an alpha reliability coefficient of 0.911.Based on the criteria by [65], the final version of CECAM questionnaire can be considered as an excellent measuring scale.Table V shows the final structure of the CECAM questionnaire, which contains 56 items.
Of the two subscales, one is considered excellentpedagogical usability (0.900)-and the other acceptable -user interface usability (0.798).From the resulting constructs, four of them are considered acceptable -Content (0.769), Navigation (0.799), Customization (0.741), and Feedback (0.756); the other three as good -Multimedia resources (0.859), Educational activities (0.873) and Design (0.817)and one as excellent -Social interaction (0.916).The 56-item CECAM questionnaire, described in Table X (in Appendix A), offers higher quality than the original questionnaire, both at the general level and for each of its specific dimensions and subdimensions.This suggests that, for further statistical analysis, the data provided by these 56 questions are more valid and more reliable than those provided by the 72 questions of the original questionnaire.
In the process of refining the instrument there is a subdimension that has been eliminated: the motivation dimension.This subdimension was part of the interface usability dimension, and considered aspects that increase learner engagement and motivation, related to gamification support [73], [74].However, the aspects considered in this subdimension have shown not to be aligned with the rest of the subdimensions considered by the "classic" usability (usually Nielsen-based) and more commonly considered, such as navigation, feedback, etc. [75].It is therefore proposed, as future work, to include a separate dimension in which aspects more related to the user experience (UX) would be considered [76].UX is a more generic concept than classic usability and includes it, and in which subdimensions, such as value, desirability, engagement, entertainment, etc., can have place [77].
The instrument designed and validated was conceived as an evaluation support instrument and, therefore, as a tool for supporting usability evaluation methods (UEM) of inquiry type.Its use, combined with testing methods, allows to obtain quantitative measures of the quality (usability) of the mlearning system.However, the dimensions, subdimensions, and criteria included in the questionnaire can be considered as heuristics and design guidelines, which can be used both in usability inspection methods and in the design stages of mobile learning applications [78].It is therefore considered that this questionnaire could be used in the initial (as a checklist or heuristic guide) and final (as an evaluation instrument) phases of the ADDIE (Analysis, Design, Development, Implementation, and Evaluation) method [79].
In any case, and in relation to its initial objective, as a validation instrument, it is necessary to point out that in the area of Human-Computer Interaction and, specifically, in the field of UEM, it is proposed that the ideal is the combination of methods that provide more complete and comprehensive information on the usability of the interactive system than the independent and isolated application of a single evaluation method [80].

VII. CONCLUSION, LIMITATIONS, AND FUTURE WORK
This article proposes an evaluation tool for educational applications with mobile support, which considers pedagogical aspects and usability requirements, called CECAM.The CECAM survey can have a dual use, as an evaluation tool, but also as a checklist or set of heuristics in m-learning system design processes.Its application improves the usability of mobile learning applications, as well as the learning experience Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
of students.This proposal is a relevant contribution to the field of m-learning, since there are hardly any evaluation instruments that contemplate technological and pedagogical aspects, which have been properly validated.Therefore, it is considered that its use can help improve the quality of mlearning systems.Even so, there is still work to be done in this area, particularly in ensuring that mobile learning support tools are accessible, effective, and satisfactory for users, providing them with good learning experiences, and consequently, promoting educational success.
The survey instrument developed, consisting of 56 items grouped into two dimensions (pedagogical usability and usability of the user interface), presents high levels of validity and reliability.One of the main limitations of the work described is related to small sample size (37 students) used to calibrate the CECAM questionnaire, since it does not reach the 50 participants recommended for robust statistical analysis [81].The low number of participants in the experience poses a threat to the external validity of the proposed instrument, which implies that the results obtained by using it should be considered with caution, although they are not necessarily erroneous.Furthermore, there is an inherent threat to the internal validity of experimental designs in which the subjective perception of learners is measured: researcher and participant bias.The use of subjective (perception-based) surveys has the drawback that responses may be biased, as participants sometimes say what they think the researcher wants to hear.This threat can be reduced by promising subjects that their responses will be treated anonymously, but this does not guarantee that their answers will be completely truthful and objective.Despite these limitations, the instrument designed is very useful for researchers and practitioners to improve the effectiveness of m-learning applications and seeks to improve their design and implementation.
Therefore, this work highlights the importance of analyzing the validity and reliability of any measurement instrument before using it to draw conclusions about the data collected.As future work, it is contemplated to include an additional dimension in which specific factors related to the UX, or as it is called in the e-learning field, the learner experience, will be included.Once this has been done, it is planned to replicate the validation work described with a larger sample of students from several universities in order to increase the precision of the psychometric properties of the proposed survey.APPENDIX See Appendix Tables VI-X.

Fig. 4 .
Fig. 4. Process to calculate the reliability for the CECAM questionnaire.

TABLE I RECENT
AND NOTEWORTHY PROPOSALS THAT ADDRESS THE EVALUATION OF M-LEARNING SYSTEMS

TABLE II SUBSCALES
, CONSTRUCTS, AND NUMBER OF ITEMS OF THE PRELIMINARY QUESTIONNAIRE

TABLE III RELIABILITY
OF THE PEDAGOGICAL USABILITY SUBSCALE AND ITS FACTORS TABLE IV RELIABILITY OF THE USER INTERFACE USABILITY SUBSCALE AND ITS FACTORS

TABLE VI ID
AND DESCRIPTION OF EACH ITEM IN THE PEDAGOGICAL USABILITY SUBSCALE

TABLE VIII CORRELATION
COEFFICIENTS OF THE ITEMS AND CONSTRUCTS OF THE SCALE (FACTORS CROSS-LOADING MATRIX) Table IX (in Appendix) shows the values of the square roots of the AVE for each construct (values in bold in the diagonal of the matrix) and the correlation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IX CORRELATION
COEFFICIENTS BETWEEN THE SQUARE ROOTS OF THE AVE OF EACH CONSTRUCT coefficients with the other constructs (values under the diagonal).Therefore, all the construct values in the scale (see Table IX of Appendix), highlighted in bold, meet the requirements for discriminant validity.Therefore, the quality analysis of the initial version of the CECAM questionnaire (Tables VI and VII of Appendix) has led to a reduced instrument (Table X of Appendix) with better reliability and validity.Nevertheless, the next section discusses the main findings obtained during the CECAM calibration.

TABLE X ITEMS
INCLUDED IN THE FINAL VERSION OF THE CECAM QUESTIONNAIRE