Development of Automatic Source Code Evaluation Tests Using Grey-Box Methods: A Programming Education Case Study

Increasing the effectiveness of programming education has emerged as an important goal in teaching programming languages in the last decade. Automatic evaluation of the correctness of the student’s source code saves teachers time and effort and allows a more comprehensive focus on the preparation of assignments with integrated feedback. The study aims to present an approach that will enable effective testing of students’ source codes within object-oriented programming courses while minimising the demands on teachers when preparing the assignment. This approach also supports variability in testing and preventing student cheating. Based on the principles of different types of testing (black-box, white-box, grey-box), an integrated solution for source code verification was designed and verified. The basic idea is to use a reference class, which is assumed to be part of every assignment, as the correct solution. This reference class is compared to the student solution using the grey-box method. Due to their identical interface (defined by assignment), comparing instance states and method outputs is a matter of basic programming language mechanisms. A significant advantage is that a random generation of test cases can be used in such a case, while the rules for their generation can be determined using simple formulas. The proposed procedure was implemented and gradually improved over 4 years on groups of bachelor students of applied informatics with a high level of acceptance.


I. INTRODUCTION
Achieving greater efficiency in programming teaching has become one of the essential goals in the field of programming language teaching in the last decade.The primary reason is the growing need for programmers in the labour market and the increasing demand for many programming languages.The programmer must learn to think in a specific programming language and main them.Furthermore, the The associate editor coordinating the review of this manuscript and approving it for publication was Daniela Cristina Momete .extension of MOOCs and e-learning courses has created the preconditions for the implementation of assignments that allow the automatic evaluation of the correctness of source codes by electronic tools and can be used in large groups of students.
Currently, many researchers are exploring the potential for AI models like ChatGPT to take over the role of programmers [1], [2].This idea is often elaborated in unscientific sources, often leading to exaggerated claims.However, the reality is that the displacement of programmers by AI is primarily relevant to tasks that involve automated procedures and the creation of basic code snippets [3].AI models hold significant potential as helpful tools for programmers when dealing with clearly describable and frequently repetitive code segments.However, it's currently challenging to envision them being able to design a complex, efficient, and reliable application.
Therefore, the objectives of training programmers remain consistent compared to the previous decade.Computer programming is important in gaining problem-solving skills and critical thinking development [4].Acquiring programming skills enhances students' ability to express themselves clearly and solve problems accurately, extending to areas outside of programming [5].Integrating the complementary capabilities provided by generative AI into programming education is desirable at a certain stage of preparation.
Research-based recommendations usually state that teachers should be available to students while they are solving the automatically evaluated tasks.The teacher's role is to help students and explain the feedback of the wrong result or perhaps the whole problem that the task should solve.The reality, however, is that these types of systems are primarily used in self-study [6] or flipped classrooms [7], [8].
Based on [9], important reasons to prepare educational environments that support learning programming are: • To make progress, students need help because learning programming is challenging, especially in the introductory phases of new topics.The environments provide immediate feedback to students and support their learning speed.
• Programming courses are attended by thousands of students worldwide and by hundreds of students at individual universities.Helping students individually and solving their repetitive problems requires enormous time and teachers' commitment.The aim of educational environments is a reduction of the teacher's workload focused on evaluating assignments and the ability to move the saved time to another area of his activity.
The time and effort of the teacher saved by the automatic check of the correctness of the student solutions should be devoted to the thorough preparation of assignments integrating feedback, especially in case of incorrect answers.
The paper aims to describe a universal structure and form for verifying the correctness of the source code so that the student receives clear and precise information about which part of the program is incorrect and why.
The article consists of several parts.The theoretical background aims to provide a comprehensive overview of current testing methodologies.It starts with a brief description of static testing and then moves into dynamic code analysis.The focus then shifts to thoroughly examining black, white, and grey-box testing principles.This survey covers various approaches, techniques, advantages, and disadvantages to identify appropriate methodologies applicable to the presented research.In addition, this chapter covers the process of generating test cases.
Related work is devoted to generating feedback and evaluating the code in the virtual educational environment.
The core of the paper is focused on the conception and validation of an approach adapted to facilitate effective testing of student source codes in object-oriented programming.
The final section summarizes the primary findings, contextualizes them, and discusses potential applications of the presented approach.

II. THEORETICAL BACKGROUND
Evaluating the correctness of source code brings several challenges.Some of them have already been solved, and some of them have partial solutions.Source code correctness testing can be implemented at two levels [10], [11]: • Static analysis is based on source code analysis.
It focuses purely on the structure of the program and returns a measure of consistency with the required specifications.It does not automatically follow that a program that returns the expected results is also statically correct -it may use the wrong program structure or some inefficient items.
• Dynamic analysis involves executing code that iteratively runs for different input parameters.The parameters are to be chosen to cover both critical and general values, and the correctness of the program is determined by matching the results obtained by executing the program with the predefined correct values.

A. STATIC ANALYSIS
The information from the static analysis is used to improve code quality, safety, and robustness and, in some cases, to identify cyclic method calls or infinite code loops.
In addition, according to [12], static analysis aims to identify syntax and interface errors, reveal the potential for code reduction, highlight architectural standards, remove potential sources of bugs and inefficiencies, and eliminate actual bugs.
According to [13], typical examples of errors and problems uncovered by static analysis are uninitialized variables, unreachable code, unused variables, type mismatches, and many other mistakes often arising from programmer inattention.Therefore, the authors collected and processed information on applying static analysis to code written by beginners and then quantified and analysed the identified errors.
The results confirm that novice programmers make many errors such as ''uninitialized variable'', ''unused variable'', ''type mismatch'', and ''index out of list scope''.Most of these errors go unnoticed by novices and are often the cause of incorrect solutions, even if the original idea of the algorithm is correct.The study shows that integrating static analysis into the learning process can make a difference, especially since it can help novices find many common mistakes they make before running the code.Furthermore, the mentioned functionalities were often implemented as features in modern development environments as hints for programmers.However, identifying whether the code is correct or incorrect cannot remain on static analysis.According to [14], current static analysis tools can provide developers with enough information to assess what to do with the warnings generated, but they rarely offer a relevant fix for what they claim is a problem.Static analysis tools offer quick fixes, provide a potential solution, and apply it to the problem to help developers assess warnings more quickly and ultimately save time and effort.

B. DYNAMIC ANALYSIS
Dynamic code analysis has its place in programming education and real-world application development.The principle of data preparation and testing is very similar in both cases.The two basic approaches to code testing are built on knowing or not knowing the internal structure of the unit (program or class) under test: • White-box testing -enables in-depth analysis and bug detection in the internal program structure, assisting users in optimizing the program to its full potential [15].
• Black-box testing -the code being tested cannot be accessed (does not need to be accessed) by the tester; this approach checks the dynamic behaviour of the program and is generally faster and easier to perform than whitebox testing.
• Grey-box testing -combines the advantages of both testing methods and creates a kind of intermediate layer between the accessibility and the unavailability of the tested code.In this case, the tester has a description of the interface of the tested code but does not know its exact implementation.

1) BLACK-BOX TESTING
Several perspectives and parameters define black-box and white-box testing categories.According to [16] and [17], the black-box approach assumes that the program is considered a ''big black box'' for the tester, who cannot see inside the box.All a software tester knows is that inputs can be given to a black box, and the black box processes them and sends ''something'' back.The primary objective is to analyse the generated output and evaluate its compliance with the desired specified requirements and/or outputs.This communication usually uses an I/O (input/output) approach to verify the correctness of simple programs.The program reads the values appearing as inputs from the user and returns the corresponding values as output to the console.Both inputs and outputs are redirected so input and output reading can be automated.The validating software then compares whether the expected results follow the outputs obtained based on the execution of the program (Figure 1).
The most crucial step in black-box testing is the proposal of suitable test cases, emphasising covering all types of program behaviour and potential exceptions.Techniques aimed at streamlining the test preparation process and maximizing input coverage can be categorized into several groups within the context of testing: • Boundary Values Analysis is based on the finding that many errors tend to occur near extreme values of the input variables.For each input variable, it is therefore desirable to carry out tests containing the minimum value, values just above the minimum value, ''normal'' values, values just below the maximum value and the maximum value [18].Sometimes, it is necessary to verify the correctness of the program's behaviour with values below the minimum and above the maximum value.Test cases are designed for valid and invalid boundary values [19].
• Fuzzy Testing is a testing technique that focuses on creating random inputs and monitors their execution to identify errors and unusual program behaviour [20].
The main goal is to detect errors, unverified inputs and vulnerabilities by making simple, invalid or unexpected inputs.The fuzzy testing process is automated and generates inputs randomly or based on specified patterns [21], [22].
• Equivalence Partitioning is a technique that aims to reduce the number of test cases, dividing inputs into groups so that each focuses on a specific property or part of the program (e.g., branches in the code, boundary values, etc.) [23].The technique is based in part on the assumption that all inputs in the group cause similar behaviour (and errors) in the program.That is, if one input value from the group causes a failure, then it is assumed that the remaining inputs from it are also problematic, and conversely, for inputs that do not cause an error, the entire group can be considered successfully tested [24].
• Pair-wise Testing (All Pair Testing) is based on the idea that most software errors show up in specific combinations of input values, and all possible combinations of inputs can be reduced by identifying the relationships between them.This type of testing is based on the idea that merging two or more testing parameters could reduce test cases, and a software system can be tested more proficiently and quickly while covering all real test cases.Pair-wise testing has limitations in situations where errors depend on combining three or more input parameters.In that case, other techniques must be used (e.g.random selection of inputs) [25].
• Cause-Effect Graphis a testing method with a heuristic approach [26].It is the only black-box testing technique that considers combinations of causes for system behaviour [27].The basic idea behind the cause-effect graph is that it identifies the various causes and 106774 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
effects or conditions and actions that influence software behaviour.These causes and effects are represented by a graph showing their relationships.As a result of the process, test scenarios and test cases are created to cover different combinations of causes and effects [28].
• State Transition Testing is an approach where changes in input parameters lead to changes in the operational (internal) states of the tested application.It involves systematically observing how an application responds to different input conditions in a sequence.The tester provides a range of input values, encompassing correct and incorrect input test values, and records how the system behaves during these transitions [29].
• Orthogonal Array Testing is often used when it is necessary to evaluate many parameters and their combinations.This technique helps identify relationships and interactions between parameters that could lead to unexpected errors or problems in the software.The technique selects some representative combinations to perform tests, ensuring increased testing efficiency, reducing the number of tests, and reducing testing time while maintaining sufficient coverage of various input combinations [30].It is a systematic, statistical way of testing pair-wise interactions by deriving a suitable small set of test cases [16].In this way, it is possible to identify errors and shortcomings in the software while simultaneously optimising the testing process.
• Error Guessing is a method centred around speculation and inference.Skilled testers suggest various inputs and strive to uncover potential flaws.The efficacy of this approach depends on the testers' expertise; experienced testers can identify probable defect locations and types accurately [19], [31].Nowadays, it is more of a historical method, when human expertise has been replaced by machine performance and rapid verification of many test cases.The advantages of black-box testing can be summarised as follows: • Implementation independence -allows testing without access to the source code or internal details of the program.Testers do not need to know the programming language in which the program is written [32].
• The universality of test cases arises from the fundamental principle of testing.It allows using the same test data across various programming languages [33].
• Objectivity -perspectives of the programmer and tester are separated [18].The tester focuses on the functionality and correct behaviour of the software without being influenced by the programmer's thought process because he does not know or see the source code.
• Simple preparation and launch of tests -test cases can be designed as soon as the specifications are complete.
They can be scaled to many moderately skilled testers without implementation knowledge [32].
• Early detection of errors -black-box testing can be performed at different development levels, allowing errors to be detected earlier in the development process [34].This approach can reduce the cost of repairs later.
• Efficient for large code segments -because only the outputs are compared during testing, in principle, extensive code testing places the same demands on the tester as in the case of a simple one.In this way, it is possible to test the correctness of extensive parts of the application [18], [32].
• Variety of test scenarios -testing can be based on different scenarios and inputs, ensuring that different aspects of the software are tested.Variety increases the probability of detecting various errors [35].
• Testing allows valid and invalid inputs to be verifiedthe program can be run even if the inputs are invalid (e.g. by type) [36].
• User orientation -the behaviour of the software during testing corresponds to how the software works for end users.Testing can be focused on real-world situations and scenarios that users may encounter and thus better uncover problems that may occur in real-world operations like incorrect outputs, unexpected behaviour, or incorrect responses to various inputs [37].Undoubtedly, black-box software testing is not without drawbacks: • Code coverage depends on the quality of test scenarios -the effectiveness of black-box testing depends on the quality of test scenarios and the completeness of code coverage [18], [38].Because testers cannot access the source code, achieving a satisfactory percentage of code coverage can be difficult, and some parts may remain untested.
• Low efficiency when testing complex systems -although tests can verify the functionality of complex systems, it can be challenging to capture all interactions and possible scenarios without knowing the internal structure of the system [39].
• Problems with detecting some types of errors -runtime or low-level errors (memory problems, delayed responses, inconsistent output) may not be detected due to missing inputs causing these errors [15].
• Problematic identification of the cause of the problem -testers may have limited ability to identify the core of the problem because they lack in-depth knowledge of the internal implementation of the software [40].
• Inadequacy for performance testing -black-box testing focuses on functionality but usually fails to test software performance adequately.It can be challenging for performance-optimized algorithms to detect problems based on external inputs alone [41].
• Insufficient emphasis on security -black-box testing is usually insufficient to identify security bugs and vulnerabilities, requiring deep knowledge of the software and its internal mechanisms [42], [43].
• The possibility of redundant tests -in the case of insufficient control, tests with the same parameters may be run more than once [32].

2) WHITE-BOX TESTING
According to [44], white-box testing is a significant issue in software engineering.White-box testing strategies are based on knowledge and accessibility to the internal logic of software components.Therefore, the preparation of test suites for white-box testing depends on the source code of the software component under test.This type of testing focuses on verifying the functionality and quality of code based on knowledge of its internal workings.Testers typically analyse source code, configuration files, database schemas, and other technical details in white-box testing.Based on the findings, they prepare scenarios that test different code paths, conditions, loops, and functions.Primary concentrate on code structure, branches, loops, conditions, etc.
White-box testing techniques can be divided according to the goal to which the scenarios are aimed: • Boundary Value Analysis essentially copies the same technique from the black-box type of testing; it focuses on testing the boundaries of the validity of the input data.
The minimum and maximum input values are tested, verifying whether the system handles extreme values correctly [18].It focuses more on testing at boundaries, including minimum, maximum, just inside/outside boundaries, error, and typical values.
• Statement Coverage ensures that every statement in the code is executed at least once during testing.Test scenarios are created so that each command is executed at least once during the tests.Statement coverage is a weak criterion because it is insensitive to some control structures [45].
• Branch Coverage ensures that every possible branch path (if, switch, etc.) is tested [46].The process involves formulating test scenarios that encompass affirmative and negative conditional branches.All cases documented in the code are passed through in situations involving multiple branches.
• Loop Testing focuses on testing loops in code.It verifies the validity of the loop construction and whether the desired behaviour is maintained during iteration.Loop errors occur primarily at their initiation or conclusion [47].This method focuses on ensuring the correctness of loop constructs.
• Path Coverage involves testing every possible path through the code, including all branches and conditions.It is a very demanding technique that tries to cover all possible scenarios [48].Path Coverage is very challenging because it can require many test scenarios.For efficiency, therefore, Basis Path Testing is used, which deals with identifying key paths derived from the cyclomatic complexity of the code and focuses on testing these paths [40].Basis Path tries to reduce the effort of Path Coverage testing by focusing on key paths instead of testing all possible scenarios.
• Equivalence Partitioning partitions a set of input data into groups where each group is equivalent in terms of the expected behaviour of the system [47].
Representative values from each group are then tested.The mentioned techniques also include techniques that can be used alone or in combination with the ones mentioned above to monitor the program as a whole: Control Flow and Data Flow.
Control Flow Testing is a fundamental and effective technique for every type of software and is almost always implemented to some extent [47].It analyses how different statements and structures in the program code are translated and how the program is executed according to different decisions and branches.It focuses on tracking the paths through which program execution moves [49].It examines how program flow is controlled through branches (if, switch), loops (for, while) and other control structures.
Data Flow Testing focuses on the points where variables receive values and where values are used [19].Data flow analysis examines how data is moved and changed in different program parts, examining the processing and propagation of variable values in code segments.It focuses on tracking the assignment and change of values to variables and their transfer between code sections.Data flow analysis aims to detect errors in data manipulation, such as uninitialized variables, unused variables, and incorrect value circulation.
In addition to automated techniques, white-box testing also includes static techniques (Code review), which are usually performed without running tests by direct analysis of the code: • Desk Checking represents static checking performed by programmers before compilation or execution.The code is compared to the requirements specification or design to verify that it meets the requirements.
• Code Walkthrough (also known as Technical Code Walkthrough) involves high-level employees such as technical leads, database administrators, and one or more colleagues.These people ask questions about the code to the author, and the author explains the code, ideas, and logic.The code is fixed immediately if there is any error in the logic [16].
• Formal Inspections identify defects in artefacts such as code and design documents.It involves a team of members who follow a predefined process, organize review sessions, and take on different roles, such as authors, moderators, reviewers, and recorders.Checklists and guidelines are used to evaluate artefacts, and any issues identified are tracked and documented for resolution [50], [51].White-box testing technique applies to the following levels of software testing: unit testing, system testing, integration testing [18], [40], [47].System testing and integration testing are not relevant within the context of the content of this article.
Unit testing is a specific element of white-box testing aimed at testing elementary parts of the code.According to [52], unit testing has become an accepted practice, often even mandated by development processes (e.g., test-driven development).Unit testing aims to verify the functionality and correctness of (almost) every elementary part of the program.A typical example is testing the constructor and individual methods of the class in the relevant programming language.Many frameworks that support unit testing for the most popular programming languages are available today.Their occurrence and frequency of use aimed at mapping the use of test patterns are confirmed by researchers in [53].
The advantages of white-box testing focused on unit testing can be summarised as follows: • Having access to the source code guarantees that testers have access to the internal details of the program, allowing them [16], [18] a better understanding of algorithms and code structure, the ability to focus on specific parts of the code, quickly identify errors in logic, etc.As a result, it is possible to identify even complex problems in the early phases of development and suggest optimizations.
• Achieving high code coverage [19], [47] is accomplished because testers can purposefully design scenarios to ensure the best possible coverage of different branches and paths in the code in the least number of tests.
• Enabling testing since the initial phases of development allows running tests for program components long before the program reaches its final state.This process is independent of the program's graphical user interface or input/output channels of the application [47], [54].
• Compatibility of tests with automation -individual tests and test scenarios are built to be automated and replicated at any stage of software development.Such preparation increases the efficiency of the testing process [19], [54], [55].
• The enhancement of program design and architecture results from the close connection between the tested code and the test process itself.Continuous testing brings accelerated development, faster identification and rectification of errors, increased assurance in software quality, and more efficient use of resources [54].Moreover, testing can also help reveal flaws in software design or architecture, facilitating early changes to improve the overall design [47].
• Identifying security vulnerabilities and performance issues based on the analysis of code and testing specific sections to uncover inputs that cause exceptions, potential security weaknesses, or lead to performance degradation.The testing process facilitates the identification of specific code fragments that need to be reprogrammed [41], [56].
• Enhanced collaboration between testers and developers defines the identified issues more precisely.Collaboration allows the developer to receive accurate information about the location in the code responsible for the problem [16], [32].Of course, even white-box testing has disadvantages, which can be summarized as follows: • The main drawback of white-box testing stems from the demands on testers' knowledge and expertise.Testers must be proficient in the programming language they create the tests, including its nuances [40], [47].Despite the automation of test generation in numerous cases, testers still need to oversee scenario design and certain operations to achieve optimal code coverage [18].This fact limits the potential for involving testers who lack expertise in the specific language.
• Difficulty in maintenance and the necessity for seamless communication between testers and developers -if the internal implementation of the software changes, the tests can become susceptible to errors and need modifications.Deep collaboration between teams can be challenging to organize [19], [32].
• Isolation of tests from the program as a whole and external dependencies -by white-box testing focusing on internal details, there is a potential inadequacy in evaluating the comprehensive behaviour of the program.This approach might lead to testing only isolated components without considering their interplay and interaction with external services [32], [57].Consequently, specific problems might remain latent until the software is put into a production environment, and performance and scalability issues tied to external elements might go undetected.
• Neglecting UX validation -can stem from the inherent nature of white-box testing, which focuses on internal complexities, so UX issues can be overlooked by testers [16], [32].

3) GREY-BOX TESTING
Grey-box testing combines white and black-box testing techniques, taking advantage of both.This approach involves the use of inputs and outputs but provides testers with partial knowledge of the internal structure of the application.Thanks to this knowledge, testers can design test scenarios more effectively, but they are not burdened with a detailed inspection of the logic and content of the code [58].In greybox testing, the understanding of the program internals falls between black-box testing, which has minimal internal insight, and white-box testing, which involves comprehensive knowledge of the internal code structure [59].Grey-box testing can be applied to most testing phases.According to [60], it is suitable mainly for integration testing.Researchers state the following as the most common and specific techniques for grey-box testing [18], [32], [49]: • Matrix Testing identifies program variables, each linked with technical and business risks, determined by the tester's judgment.The usage frequency of variables in tests aligns with the assessed risk level.
• Orthogonal Array Testing is mainly a black-box testing technique.As mentioned above, this technique selects some representative combinations to perform tests, reducing the number of combinations.This type of testing is used to select an appropriate subset of all possible combinations.
• Pattern Testing is a technique that involves analysing previous errors or defects in the software to identify recurring patterns or common types of issues.Testers study the root causes of past failures and use this analysis to design new tests that target similar types of errors in the code.By identifying patterns in past defects, testers aim to improve testing effectiveness by proactively focusing on code areas prone to specific issues.
• Regression Testing involves testing the software after any changes, updates, or additions are made to ensure that the modifications haven't negatively impacted the existing functionality.It aims to verify that new code or features haven't introduced unintended side effects or broken previously working software parts.While regression testing is not specific to grey-box testing, it's a common practice in software development and is often incorporated into automated testing processes to maintain the stability and reliability of software.Compared with the white-box and the black-box, the greybox testing brings a test of medium granularity, moderately demanding, and verifies the application from the point of view of the whole and its parts [60].As for the advantages and disadvantages, their strength or adequacy depends on shifting the grey test to white-box or black-box methods.Therefore, the view of advantages and disadvantages, based on the most frequently stated opinions, should be taken with a grain of salt because the presented advantages can be disadvantageous from another point of view and vice versa.They are, therefore, listed as grey-box testing characteristics only: • Offers combined benefits -the strengths of both blackbox and white-box testing can be leveraged [32].
• Unbiased Testing -testers and developers typically remain distinct entities.Testers often receive interface definitions and documentation without direct access to the source code [32], [49].This approach leads to partial code coverage, contingent on the tester's skills in crafting tests.A substantial portion of the code might go untested consequently.
• Testers conducting this type of testing don't require advanced programming skills.
• The test is performed from the user's rather than the designer's perspective [18].
• Intelligent test creation is indeed suitable in situations with limited information.Testers can develop smart test scenarios, particularly focusing on aspects like data types, communication protocols, and exception handling [32].
• Grey-box testing is unsuitable for algorithm testing, and some test cases are difficult to design.
While grey-box testing offers benefits, it's important to acknowledge its drawbacks.Therefore, selecting the optimal approach should be deliberate, considering the project's characteristics and testing goals.

C. TEST CASE GENERATION TECHNIQUES
Testing often accounts for more than 50% of the total cost of software development [61].Due to the price needed to achieve sufficient software quality through the implementation of tests, much attention is currently being paid to alternatives supporting simplification and streamlining of the process.In addition, test generation methods are one of the most intellectually demanding tasks.The use of its principles also accelerates the creation of test cases for school assignments and makes them more sophisticated and secure against student ''hacking''.
Many approaches to test case generation have been proposed and explored in the past decade.They can be categorised based on [61], [62], and [63] as follows: • • Combinatorial testing can detect errors caused by the interaction between several parameters of the tested system.Combinatorial methods are based on the observation that not all parameters contribute to every mistake, and most errors are caused by one parameter or the interaction between a small number of parameters, usually two to six [25].Test cases or specific program configurations are generated from the list of parameters, selecting a subset based on some coverage criterion of the Cartesian product.
• Random testing requires test cases (inputs) to be generated randomly and independently from the input domain.It is commonly assumed that each element of the input set has an equal probability of being selected as a test case.This type of testing can be used alone or in combination with other testing methods and is a commonly used test method due to its conceptual simplicity and efficiency [65].Random adaptive testing is based on the idea that non-failing inputs tend to form contiguous failure regions, and therefore, nonfailing inputs should also produce contiguous regions.
Therefore, if the previous test cases did not reveal the failure, the new test cases should be far away from the already executed test cases that do not cause the failure [61].
• Search-based testing is the combination of automatic test case generation and search techniques.The search algorithms aim to automate the process of prioritising test cases, generating test data, optimising software test oracles, minimising test suites, authorising realtime properties, etc. [66].A problem-specific fitness function that guides searching for good solutions from a potentially infinite search space in a practical time limit is crucial to optimisation.A typical example is using genetic algorithms to prepare test cases [67].
• Metamorphic testing is a technique that uses some necessary properties of the software under test to create new test cases [63].New test cases are based on transforming some selected existing test cases.For the test result verification, instead of using an oracle for each test case, the test results from multiple test cases are checked against the corresponding metamorphic relations [68].Each technique has advantages, and limitations must be carefully considered per project-specific requirements.The effectiveness of the testing process and its ability to detect potential flaws depends on choosing an appropriate test generation approach.Combining these techniques can provide better coverage of test scenarios and improve overall software quality.

III. RELATED WORK
Integrating software testing into the software engineering curriculum is essential [69].In addition, when teaching the basics of programming, it serves as a key tool that allows students to receive accurate feedback on the correctness and accuracy of their code.Often used by educators and course creators, this approach makes it easier for students to understand programming principles better and improve their skills.
In practical implementation, this methodology works in such a way that students develop their programs according to the assignment.The programs are then verified using a testing tool.Tests are prepared in advance; evaluation usually involves sending the program and providing feedback [70].The results of the evaluations provide students with an insight into the correctness of their code and its alignment with the required functionality.
This approach has several advantages.First, it lets students see how programming theory translates into program code.It also offers instant feedback, allowing students to quickly correct mistakes and improve their programming skills.Finally, it teaches students the importance of code quality control and error detection principles.
Overall, using testing in learning the basics of programming helps students better understand the practical aspects of programming and increases the quality of their programs.
Thinking about source code evaluation in learning programming makes sense if tests can be automated, quickly executed, and run anytime and many times.All the presented approaches fulfil the fundamental prerequisites for integrating into students' instructional curriculum.Nevertheless, it is necessary to consider the natural flaws that could potentially weaken their effectiveness: • It is not easy to design test cases if the student assignments are not clear and concise.
• If the tester is not paying attention, it is possible to create redundant tests, which can give the impression of code correctness [19].
• Black-box (I/O) testing is challenging to define for more composite and complex tasks.
• Black-box (I/O) testing does not apply to testing code segments without changing the evaluated code.
• Although most platforms have tools for creating and executing tests, they are not available for every kind of implementation/platform [19].
• The test code must also be modified if the verified code has been changed (addition, change of parameters/ logic).The shortcomings of the I/O approach are partially eliminated by unit testing, but the limitations resulting from the natural effort of students to simplify the solving of hard tasks should be solved with a more advanced technique.
An example of a student solution exploiting validation through I/O access is in Figure 2.Even though the code for the correct solution is much shorter, the student chose the cheating route.

A. FEEDBACK GENERATION
Getting feedback is an essential part of any learning process.Based on [9], feedback can be considered a formative method supported by acquiring knowledge and skills.Valuable feedback can provide the learner with corrective information, provide alternative solutions, bring information to clarify ideas, provide encouragement, or confirm that their answer is correct [73].Quality feedback enables an overall increase in the quality of educational content.The types of feedback can be looked at from different angles.
A brief view of the main characteristics brings [74] looking at the feedback from the following aspects: • The feedback source -represents the person or system that provides the feedback.Reference [75] reported that feedback from friends boosts self-confidence and reduces anxiety.However, [76] states that student feedback is often inaccurate compared to tutors.The role of the educational system is between these rolesprecise as a teacher and friendly as a classmate.
• From the point of view of motivation, feedback can be formulated in a positive or negative form.Positive feedback is intended to encourage students.It must be consistent, especially in the first phase of behaviour modification.Negative feedback does not contain an explanation or a proposal for a solution.Although it can exceptionally excite a student, it should be avoided in general.It can often cause a drop in self-confidence among students.
• From the point of view of time, feedback can be perceived as quick and delayed.Part of today's educational systems are tools for providing immediate feedback, which should be the main driving force of the entire education.According to [74], immediate feedback provides students with information as soon as the desired behaviour is completed, reinforcing students' knowledge of their strengths and weaknesses.Although delayed feedback is generally not considered adequate during new knowledge acquisition, some sources [77] accept it and define situations for its adaptation.
A more detailed view of feedback categories is provided by [78], which is already focused on feedback provided by digital applications.The author identified the following categories of feedback focused on students: verification, trial-&-error, corrective, elaborated, explanatory and resulting feedback.
Individual types of feedback occur in existing educational environments and can be integrated into the communication model of the environment at different levels (question, course, course messages/notifications, overall communication).The prerequisite is the definition of rules, text of messages, connection to educational elements and user behaviour monitoring.
Generating feedback for programming exercises is based on similar principles to evaluating the source code mentioned above [79].In addition, feedback is often developed within the evaluation of the solution.Its content is prepared as part of the verification method and usually consists of at least information about the status of the evaluated code.

FIGURE 3.
A very brief report of the test case consists of just the name of the test case and a vague description of the bug [36].
In educational environments focused on evaluating source codes, part of the feedback message can be a fragment of the evaluation [80].Scope and quality depend on the authors, how they formulate the feedback for the incorrect solution, and how they process the categorisation of wrong solutions and provide the student with further recommendations.Figure 3 shows part of the message and how the student can be provided with information about the reason for the error or instructions on how to fix it.
According to [9], educational environments, by default, generate feedback in the summative (number of correct criteria or passed tests from all), binary (correct or incorrect) or visual (progress bar) form.However, these forms provide a minimum of specific and formative information.
A higher level is provided by hints and instructions generated by student code analysis.Typical tools are based on continuously monitoring the programmer's activity or analysing the program code during evaluation.They take the form of recommendations, e.g., use this approach or a specific structure or method.These elements are part of many development environments today and rarely cover more than a few lines of code or a data structure.Code hints tell the student what to do but do not explain why.It can make these hints hard to interpret and decrease students' trust in their helpfulness [81].
Various specifically defined frameworks are used to achieve better and more accurate results.They are often based on creating assignments according to strict rules for writing code or complementing the control steps with messages the user receives when defined conditions are met or not met.For example, Gao et al. [82] describe a framework that uses automated testing tools to detect defects in student code and to provide feedback on those defects in the form of specific examples of incorrect behaviour.
The comprehensive feedback is based on the concolic testing presented in [83].In the presented framework, the authors combine pre-prepared automated tests with random testing and symbolic execution.If there is any defect inside the student code, a test case leading to that defect is generated.Although this approach is didactically unbeatable, its major disadvantage is the necessity of customisation for each assignment and the tremendous workload on the assignment creation side; the authors prepared and tested its effectiveness on five tasks.
Machine learning has brought a new perspective to problem-solving in many fields.Reference [84] mentions autonomously providing feedback as one of the main challenges for mass education.The authors defined neural networks based on the idea that a program can be represented as a linear mapping between steps (commands) and conducted a learning process with data collected from millions of code.orgusers.
Despite the research efforts proven by current research [6], [85], [86], providing quality feedback at scale remains an open problem.Creating high-quality feedback for small 106780 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.groups of users in an electronic course or a standard study program is exceptionally difficult and inefficient in terms of effort and effect.

B. EVALUATION IN THE EDUCATIONAL ENVIRONMENT
Even though research pushes the boundaries of source code evaluation and partly automates feedback generation, reality lags behind the most modern ideas.The content creator (often a university teacher) must work with timeefficient tools, can be handled by programmers when creating assignments, and support the creation of new content even with evaluation mechanisms in a short time.Dozens of educational systems or plugins are currently available [9].They support verifying source codes using some forms in a defined set of programming languages.
Dozens of systems with (automatic) program task evaluation use an input-output (black-box) approach.The correctness of the solution is evaluated by checking the student program outputs with prepared input values and expected outputs, which are manually inserted into the evaluation system by the assignment creator.This approach is popular; it allows different levels of penalties according to the ''importance'' of specific inputs in many systems, but it usually only evaluates the program as a whole, not its parts [87].Figure 4 provides a typical view of a partially correct program.[45]).
An alternative, again frequently used solution, is integrating unit testing into education systems.The xUnit libraries (e.g., JUnit, CUnit, PHPUnit, etc.) are used to check the correctness of the code using a white-box testing approach.They allow the definition of the test cases and expected correct results at the level of the entire program and the level of individual class methods.The advantage of this approach is the ability to define the accuracy of the result, which is especially useful if the solution results are real numerical values and the ability to test program parts.However, the result is again compared against specific expected output values.Figure 5 presents the verification of methods for multiplying and adding a pair of numbers with the accuracy of 0.001 required when working with real numbers.Testing frameworks are robust and proven tools that support the automation of the testing process and accelerate the deployment of modified code in production.Code testing for borderline, risky and general inputs is generally sufficient, and it is possible to use the same input values all the time.
In all approaches, students receive feedback consisting of input values, student program output, and expected correct output values.However, repeated use of the same inputs without additional teacher control or additional mechanisms focused on source code analysis may lead students to cheat and confidently generate input-based results using a simple condition (Figure 2).However, in the case of using the principles of xUnit frameworks in verifying the correctness of student assignments, it is appropriate to make certain adjustments aimed at simplifying the preparation of projects, preventing cheating, and providing clear and understandable feedback.
Although standard elements of grey-box testing can be applied in educational environments, incorporating tools that aid in test creation can indeed be complex.Various annotation languages, such as JML for Java, Spec# for C#, and ACSL for C, can significantly assist in formulating and executing tests and defining other essential properties within the system.At the validation level, these languages can also contribute to generating black-box test data, ensuring comprehensive testing coverage [88].However, integrating these tools into educational systems can present challenges due to the technical integration process and the specific requirements of educational environments.
Educational systems often work in web environments and communicate with intricate evaluation systems.Test creators must ensure the swift execution of tests to accommodate the simultaneous usage of the system by numerous users, thereby minimizing processing time and data transfer volume.This problem introduces additional performance, scalability, and efficient resource utilization challenges.

IV. CASE STUDY
The following case study shows the potential of implementing a robust source code verification mechanism within a virtual learning environment.This implementation is based on effective strategies utilising the latest software testing In educational environments where the integration of testing tools can often be complex and complicated, this case study highlights the importance of using efficient procedures and simplified approaches based on black, white, and grey testing techniques.It exemplifies the practical application of current software testing knowledge and demonstrates how it can be used to create an effective source code verification framework in a virtual learning environment.

A. GOAL AND RESEARCH QUESTION
The goal of the study is to design and verify an approach that will enable effective testing of students' source codes in the subject of object-oriented programming.This approach also requires supporting variability in testing and preventing student cheating.
The research question is defined as: ''What is the optimal structure and form for validating source code so that students receive clear and accurate information about specific errors in their program and simultaneously eliminate attempts to solve the problem by cheating?''

B. CONTENT
Based on the framework mentioned in [33], teaching programming at the Faculty of Natural Sciences and Informatics of the CPU in Nitra uses a combination of microlearning and automatically evaluated program code.This combination is used in teaching students of the applied informatics study program for three years in Java.Java is an introductory programming language for some students who have never written a program; the course concept is not based on an object-first approach.
The Priscilla system [89] developed at the university and thus providing maximum variability is used as an educational environment.It uses the server part of the Virtual Programming Lab plugin [47], [48] created for LMS Moodle to run the code and verify parts of the exercises.
The content of the first semester is focused on the basic structures of the Java language and working with classes.It contains the following chapters: 1) Java introduction (output) 2) Variables, data input 3) Conditions, loops 4) Data types, special attention is paid to the string type 5) Nested loops and effectivity 6) Multiple conditionals 7) Arrays, 2D arrays 8) Files 9) Introduction to object-oriented programming 10) Methods, encapsulation, constructors 11) Static variables 12) Inheritance 13) Polymorphism.
A format combining a flipped classroom and university lectures is used.
The preparation of assignments and the implementation of the automated control of source codes is carried out in the first half of the semester using an input-output approach.Students solve approximately 150 tasks of varying difficulty.
The second part of the semester is focused on solving more complex tasks, which are required to understand the principles of object-oriented programming, the necessity of class encapsulation, inheritance, etc.
Assignments are based on the requirement of the design of own classes and follow the general principles of creating automatically evaluated tasks [90]: • Ensure programming tasks are concise, clear, and have well-defined inputs and outputs.
• Change the nature of the tasks from the requirement to master the basic principles of the given programming concept through word tasks to tasks requiring the mastery of non-programming issues to force the student to solve problems.
• Avoid tasks that produce direct outputs from a limited set of answers (e.g., true/false, 1/0) to avoid randomly generating an answer.
• Consider the time and computational complexity of creating test case inputs to maintain system efficiency and user availability.Although the I/O approach can also be used when teaching work with objects, the result is usually only information about the passing of a test or the occurrence of a problem.Considering that testing a class consists of testing all its methods and class logic, any information obtained from black-box testing is insufficient, and it is no longer possible to create appropriate feedback.

C. GREY-BOX TESTING APPROACH
From a technological point of view, object-oriented code testing is usually based on the white-box testing methodology.Access to the complete class with all its methods is allowed in industrial testing scenarios.However, educational testing uses a different approach: assessment requires predefined tests to verify individual functionalities.The tester cannot access the code directly because it does not exist when the tests are written.The test creator is thus limited to interacting with the interface of the created object.This limitation places considerable emphasis on the accuracy and precision of the assignment that defines the specifications of the constructed class [91].As a result, the test object is treated as a ''grey box'' in which its inner workings are partially hidden.
This approach is used in several educational environments designed to teach programming.
Bouvier et al. [92] emphasize the key role of a wellstructured task in improving students' problem-solving skills.This research considers comprehension, cognitive processing, mathematical understanding, and computer skills.Evaluation of the solutions included using grey and white-box techniques, with students' scores being higher for white-box 106782 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.methods.This difference was identified because grey-box methods require the program to be successfully compiled and run facilitate evaluation, while white-box strategies allow evaluation using static evaluation methods.The result of the study confirms that context can have benefits beyond simply solving a given problem correctly.
Garcia-Magarino et al. [93] present a novel online judge system designed to evaluate long programming practices based on unit testing for small pieces of code (e.g.functions, methods, and classes) and provide specific comments for each failure to guide students.Climent & Arbelaez [94] present a case study focused on teaching OOP principles, using Python as a programming language and unit testing to verify code correctness.
Fischer and Von Gudenberg [95] use their code to compare the student and tutor class method results.This approach dramatically simplifies the test-writing process, as it enables the utilization of standard object manipulations without necessitating explicit and intricate checks.
Staubitz et al. [96] point out that in the case of massively parallel usage and code verification, response times quickly reach a level where it is no longer possible to provide timely feedback to students.
Based on the knowledge and methodologies used in the available solutions, unit testing stands out as the most practical approach to implementing grey-box testing.This approach is feasible for most frequently used programming languages through xUnit test components based on industrial software testing principles [97].A typical process of executing the xUnit-based test is presented in Figure 6.To verify the class correctness, it is necessary to verify its behaviour in the following situations: • The functionality of nodes for inputs -constructor, setters with typical, boundary, and prohibited values.
• The functionality of output nodes -getters for identifying the content of attributes or structured output of class state with appropriate output formatting.
• The functionality of methods -method outputs if it returns output, attribute state changes due to the application of methods that do not return results.
• The functionality of sequence of methods -verification of the correctness of the logic of the class, whether the sequence of methods correctly affects or, on the contrary, blocks changes made to attributes.
Tested methods within testing are isolated by default, and created instances are not affected by this.As part of preparing test frameworks based on xUnit focused on education, several tools supporting primary educational aspects were created [80], [99].
The test creator defines a list of values (parameters) for the input to the tested method and adds the expected result values to it (Figure 5).The precision can also be determined if a difference between the expected and the return value is assumed.
Running a test consists of the sequential execution of methods comparing the expected result and the result obtained from the instance of the tested class.The test framework ensures the independence of the tests from each other.According to the configuration, it creates a new instance for each test or performs other defined behaviour.
The result of the testing process is a list of tests with information on which tests were successful and which were not.
However, in the case of using the standard assertEquals methods, the following problems are encountered: • Writing tests and finding the correct outputs for manually entered inputs is laborious; moreover, in some cases, to obtain the output, it is necessary to combine the method performing the given operation and the getter.
• Tests use the same values repetitively, which can encourage students to cheat.
• Standard methods built into xUnit library test routines do not provide feedback in the form of recommendations.To solve the mentioned problems, the approach of greybox testing will be used, and random input generation and orthogonal array testing with certain rules will be applied.

D. PROPOSED TESTING APPROACH
The proposal is based on requirements aimed at the necessity of using different variable values and informing the student about the reason and location of the error.Due to the necessity of comprehensive verification of the methods in the tested classes, it is necessary to start from the principles of white or grey-box testing.The elimination of the shortcomings resulting from the use of the white-box approach is based on the following facts: • Because code testing is a part of the learning process, it is desirable to have a correct (reference) solution created for each assignment.The reference (and correct) solution can be tested with the same set of tests as the student solution.To verify the correctness of the student's solution, it is, therefore, the best approach to compare the results of the methods of the student's class with the results of the teacher's solution.
• Preventing cheating or making testing more interesting can be achieved by generating random values or selecting a random value from a predefined list.All tests should be monitored and tracked to gain inputs causing unexpected behaviour (following the requirements of identifying and logging problematic cases [100]).Identified problems can be solved by limiting input conditions or reformulating the for further use.
• Each test can inform students what it tests through its name.Notes, recommendations, or an explanation of the reasons for the wrong result of the test can enrich the essential information as an additional object or text.The following input requirements are necessary for a successful implementation of validation: • detail and unambiguous assignments • reference class (correct solution) • test cases focused on: • constructors • complex output(s) of class attributes • setters • getters • methods • the predefined or randomly generated sequence of methods simulating the work of a class in a real environment • code ensuring the comparison of the reference and student class results.When creating test cases, it is necessary to incorporate the utilization of boundary value analysis, equivalent partitioning techniques, and orthogonal array testing.These techniques cannot be automated in the simplest form of a test tool, but the form of random input generation presented below supports them.
The structure presented in Figure 7 was designed and verified during several cycles of educational courses to cover the entire process.
The inputs to the evaluation are the group of classes representing the student's solution and the group of classes representing the teacher's (reference) solution.The solution of the assignment can be one class, a group of independent classes that communicate with each other, classes connected by inheritance or another relationship.For simplicity, the student and the reference solutions will be referred to only as individual classes.The last part of the input is either a configuration file or a class that contains constants with detailed instructions for generating parameters in individual test cases.
Following the description of the testing process for JUnit presented in [101], the complete test run has two phases.
The first phase is the configuration and preparation of test values based on the test case description.Input values can be of different types, adapted to collections that code information for their random generation during testing.Based on the list of test cases, the creation of tested instances in the Main class is then prepared, where the whole process is started.Finally, the provided test cases are verified in independent tests.
The second phase is test case execution.This phase consists of the following: 1) Start of the process -the Mainclass instantiates the Test Evaluator class.Depending on requirements, a separate instance can be created for each test (following the requirement of independence of tests).Alternatively, if subsequent test operations do not rely on the internal state of the Test Evaluator, a single instance can be used for a group of tests, potentially saving time in most scenarios.2) Test case generation -the Test Evaluator instance receives the corresponding Test case description data.This description guides the test generator in creating specific values to test the student class.A test generator provides parameters in the form of a list or an object (Generated test case data), which are usually used as input to the constructor of the class under test.This test generator can be universal, generating random numbers, arrays, strings, etc., or its default functionality can be extended (for example, by inheriting from the universal generator).In this way, it can be tailored for specific tasks or groups of tasks.3) Instantiating tested classes -the Test Evaluator contains code that defines a distinct sequence of operations for each test.Typically, the initial operation involves instantiating both the student and reference classes.Depending on the test format, the implicit (parameter less) constructor or the constructor specified in the assignment is used.The input parameters for these constructors are parameters prepared by the Test Generator.
106784 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
4) Test processing -after creating the instances, the process proceeds by calling the method or methods specified in the specific test to verify the relevant behaviour of the tested class.The test might also include outputs or a collection of return values from the methods.5) Output comparison -each test completed by comparing the outputs of the student and reference classes.The comparison can include results obtained as return values from methods and getters, or it can be based on comparing the state variables of instances.
The choice of a suitable control method depends on the decision of the test author and the possibilities that arise from the assignment.6) Feedback generation -is an integral part of the comparison involved.The outcome of this generation process is the text presented to the user during a successful or unsuccessful test.This text creation can be based on input parameters such as expected and obtained results or derived from feedback included within the corresponding test case.Similar to generated text results, exceptions can be elegantly handled in Java.Some of the tests can be focused on verifying errors and exceptions triggered by the programmer.7) Test execution report -encapsulates the generated information, presented in a format accessible to the student once all tests are concluded (Result list).
In situations where unforeseen behaviour arises due to the generation of random inputs, details regarding the inputs and the encountered issue are logged and readied for assessment by testers.
The following is a description of the individual phases of the process.

1) TEST CONFIGURATION
The configuration class (test case description) is at the entrance to the testing process.When creating it, it is necessary to consider the list of tested methods, verification of behaviour at border values, and other groups defined across black-box and white-box techniques.In addition, depending on the types of tasks, it is necessary to prepare lists for random selection of test strings or to create lists.
The test case description is defined in Figure 7 as a separate element, but it can be part of the Main class that runs the tests -it represents only data layer.However, the number of generated tests and their parameters depend on the content of this data layer.
For a better understanding of the principles of creating classes, an example of an assignment used in teaching will be presented (Figure 8).At the same time, this example presents the importance of a precise definition of the task, which could complicate the verification of the created class if the quality of the assignment is unsatisfactory.
For the task presented in Figure 8, the definition of the test parameters is as follows: • name -to obtain a name for a specific test, it is necessary to have a list of names available; from the list, a random value (or text string at a random position) is selected after the test is started • workedYears -parameter, which should be from the interval 0-30, but the class must be able to adjust values outside this range to the required values (this is in the specification) • salary -similarly to the previous case, it is necessary to verify acceptable and unacceptable numerical values • post -the job position will again be selected from the list of pre-prepared text strings.Tests are represented by a list of objects.Each test consists of the name of the test, the significance of the test (its share in the overall evaluation) and a list of parameters for individual test cases.The number of these parameters is not uniform and not limited.
Figure 9 presents the definition of the test for verifying the functionality of the constructor with a significance of 5%.
The parameters for testing the full constructor are four values, two of which are text strings and two are integers.
A simple syntax was chosen to identify the data type of the parameter -the first character of the parameter determines the data type of the variable: • e -enum represents a list stored in a text string with comma-separated values; a random value is selected from this list when generating the text string; the obtained value enters the test as a string • i -integer value; the following parameter specifies the interval of the number -P (positive), N (negative), X (exact, followed by the exact value that is used in the test) • d -decimal numbers with the same rules • I-interval with additional parameters defining the data type and range of the interval • a -array with additional parameters defining the size of the array, type and interval of values generated into it before testing.As shown later, the range of parameter types and their behaviour is easily modifiable in the Test Generator class that generates specific values.
Parameter preparation is essential for testing borderline, confusing, and common values.For each type, the creator of the task should define several tests, while in case of specific and expected errors, he can also provide a hint or explanation to the student.This note is listed in the last position of the test parameter lines as inf1, inf2.
The combination of parameters, the definition of intervals, and the selection of individual values from the list can provide Equivalence Partitioning and Orthogonal Array Testing (not automatically, but quite simply).
An example of a method which can be used to generate integer values for test cases is presented in Figure 10.

2) TEST EVALUATION
The defined tests are then gradually transferred to the TestEvaluator class, which ensures the testing of the student class and collects the success rates for individual tests.
At the same time, a list of test results is generated into string variable output.Finally, results are provided to the user or sent for further processing with the assigned score or success rate of the total number of tests.Each instance of the class Evaluator is created separately to maintain the independence of the tests (Figure 11).
The most demanding part of the process is creating a test class that implements the execution of individual tests.A large part of the functionality of this class is universal for all tests and tasks (obtaining or generating parameters for methods).However, each test must define specific steps that verify specific methods or sequences of methods.
The test class contains a sequence of steps for each test.The procedure is always as follows: 1) read and retype test parameters correctly 2) generate values of parameters 3) create instances of the tested and reference (author) class 4) perform the method or methods under test 5) get output in the form of data of the relevant type or the form of a string 6) compare and evaluate the outputs of both classes.An example of a simple evaluation structure is shown in Figure 12.
The files with the code created in Java are sent to the Virtual Programming Lab server (or another execution server).There, they are run, and the obtained results are subsequently transformed in the educational environment into a user-acceptable form that informs the student about valid and problematic parts of the code/class (Figure 13).
The presented approach was also applied to tasks testing inheritance and a feature specific to the Java language -static variables.Although this article emphasises Java, it is only one of the languages to which the approach can be applied.Thanks to the independence of the presented solution from xUnit frameworks, the identical procedure can also be used  with other programming languages supporting the principles of object-oriented programming.
The philosophy of generating inputs in text form and then generating the result again in text form enables language-independent preparation of tests and independent evaluation of outputs.Moreover, in connection with the Virtual Programming Lab server environment, it is possible to run code in dozens of different programming languages.Every language has certain specific features, and in the case of Java, the limits can be reached quite early.The reason is, for example, the necessity of the existence of methods referenced by the testing code.Suppose the tested method does not exist in the tested class or has an inappropriate list of parameters (types, number).In that case, the testing is interrupted already during compilation, and the user does not get customised information, only compiler messages.
However, this problem is not so pronounced when using testing because students at this stage of the course already have some knowledge and experience with compiler messages and can deal with this problem.

E. EVALUATION IN THE EDUCATIONAL PROCESS
In the academic year 2022/23, a survey was conducted to assess the educational process and content of the Objectoriented programming in Java course.58 students enrolled in the second semester of the Applied Informatics study program took part in the research study.The sample consisted of 52 male and 6 female students.During the first semester, these students completed an introductory programming course in the Python programming language.In the second semester, they started learning Java programming.As part of the Java language course, they completed a 4-week structured programming module, followed by a 5-week intensive study of object-oriented programming.All students completed a Java fundamentals course, where programs were evaluated using I/O black-box methods.
After completing the object-oriented programming module, during which grey-box methods were used for code evaluation, their perception of this approach was evaluated using a questionnaire.Students completed a questionnaire using a 7-point Likert scale (0-6).Before the evaluation commenced, informed consent was obtained from all participants.The statistical evaluation of selected questions is presented in Table 1.
The survey results revealed the following findings: • Almost 95% of participating students agreed that the automated program evaluation method used in the courses was beneficial and suitable for their needs (Q1).
• Approximately 90% of students, 52 individuals, responded positively to the shift towards an assessment approach that directly identifies method errors.These students agreed with the statement that they are comfortable with an evaluation method based on checking the outputs of the methods (Q2).
• The new method of evaluation, based on the creation of a detailed evaluation of individual methods, was comprehensible to 74% of students at first glance (Q3).
• A significant majority, representing 79% of the participants, agreed the new output format was generally suitable for verifying the correctness of classes (Q4).
• However, only 55% of students report that with this form of feedback, they were able to identify errors more quickly and accurately compared to their experience in previous courses (Q5).
• Interestingly, 29% of students reported not noticing any change in the feedback format.To illustrate the assessment and the situation, representative statements of students are given.The positive feedback primarily centres around the ability to identify errors more rapidly: • ''It is clearly visible in which method there is an error, so there is no need to search the program as long as the other parts are correct.'' • ''Easier debugging and troubleshooting, better clarity.'' • ''It's easier to find out in which specific method I have an error; in the case of random values, I have more confidence that I wrote the code correctly and not just hit the output that the system wanted in a particular case.''Conversely, some students expressed negative feelings rooted in their familiarity with the I/O-based assessment approach: • ''I miss being able to get step-by-step feedback.Now, I must finish all constructors and methods before getting feedback.'' • ''I found it challenging to identify bugs in the code because I had to write all the necessary methods to get the code to run.''The results of the survey are in line with the results of the previous research [102] conducted in the academic year 2019/2020.In this research, 69% of students found that automated assessments helped them understand new content, and 77% believed that automated assessments were effective for practising the content and satisfying for them.

V. DISCUSSION
Based on the principles of individual types of testing (black-, white-, grey-box), an integrating solution ensuring source code verification was designed, verified, and presented.In the module of object-oriented programming, which is the focus of this study, the solution is a tool for evaluating a class or a group of classes that meet the assignment requirements with their attributes and methods.
The basic concept and difference from industrial testing lie in the assumption that an object-oriented programming assignment always comes with an author's solution.By default, the author's solution exists to present a model solution to the student or is created by the author to verify the correctness and clarity of the assignment.The fundamental difference between the presented approach and standard automated testing is that a class of students is tested against the results of a reference class and not against specific predefined values.
The reference solution must meet the requirements of the assignment and pass all validation tests.As a result, its outputs (or the results of its methods) can be used as references for testing student classes.This approach eliminates the need to pre-define test parameter values; instead, these values can be generated continuously and randomly for each test.
Based on the existence of two classes that should present the same interface, comparing their behaviour (including outputs) is a matter of basic programming language mechanisms.Evaluating classes is easy due to the behaviour of their instances, and no specialized mechanism is needed in Java.
Comparing class behaviour reflects the results of individual methods.Access to the status of created instances represented by attributes is possible anytime.The evaluation procedure is based on comparing the results of the methods or their sequences between the tested and the reference class.An alternative or additional way is to compare attributes obtained from getters or summarise outputs from some method (appropriately defined already in the assignment).
To create input parameters, it is possible to use random generation of numerical or textual variables or their lists represented by arrays, objects, or other data structures.
The effectiveness of the presented solution lies in the coverage of several elements, each of which leads to efficiency in a specific area: • Reducing the effort of the content creator for test preparation.Reducing the effort of the content creator is a key benefit of the whole concept.Investing effort in creating a reference solution is necessary, but the need to laboriously identify suitable inputs and search for their correctly corresponding outputs is eliminated.
• Minimizing dependencieson other libraries or software provides an opportunity for a durable solution that does not require frequent updates due to vulnerabilities or new library versions.Since the evaluation code does not use any specialized libraries, it loses in versatility but gains in longevity.
• Minimizing requirements for transmission capacityis achieved because data transmission from the data server to the execution server is reduced to the transmission of formulas according to which test cases are created.
There is also no need to transfer support libraries (xUnit).
• Cheating prevention was one of the basic requirements when designing the concept.It is based on the random generation of inputs and corresponding outputs.Although there may be a situation where a student's solution is accepted by the system even though it is incorrect, this situation is less likely and certainly less consequential than in the case of several predefined 106788 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
static tests.If it were incorrect, it would affect the assessment of all students.
• Student-friendly communication involves enriching information about expected and obtained outputs typical of black-box testing.Although this information may be part of the evaluation report, it must be supplemented with additional information.The first enrichment is the definition of name of the test, which can correspond to the name of the tested method or a short description (e.g., constructor -boundary values).The optional part of the design is the supplementary feedback -if the authors anticipate or have experience with the most common mistakes students make in a specific test case, they can provide them with precisely targeted feedback.
The most important and intensive task in model implementation is undoubtedly creating a test case.Randomness is a critical factor in this process, primarily to prevent cheating and speed up the generation of test cases.Strategies such as using templates for different data types or designing a simple language to specify data structure types and ranges can be adopted to increase the efficiency of creating input parameters (different approaches have been presented for different data types, such as numbers, texts, and arrays).
In addition, it is possible to generate invalid or completely random inputs using a fuzzy testing technique.
It should be noted that although the presented procedure provides an effective tool for creating assignments and testing the created programs, the application of random tasks in creating inputs requires a committed approach from the creator.A typical example is finding a random string in a list of random strings.The test case author must prepare the generation of input values so that the specified string should be found, at least in some cases.Otherwise, the student only needs a simple program listing the hard-coded output ''false'' or ''not found''.
Although the described solution was Java-oriented, the same approach can be used in Python, C, and PHP, or more generally, in any language that supports object-oriented programming.
For each test case and tested method, the student obtains information on whether the given method, constructor or output corresponds to the expected values, and for which values the test was successful or unsuccessful.Following the white and grey-box testing techniques, it is desirable to apply appropriate techniques to the greatest extent possible to achieve maximum code coverage and identify all problematic types of inputs.
Although solutions applying the presented methods are available, their disadvantage is robustness and complexity.The proposed solution is simple and does not require the implementation of additional libraries on the client or server side.In addition, it is independent of framework updates and relatively quickly adopted by creators of assignments.The next advantage is the transfer of data between the system intended for writing code and the execution server -it is only a few kilobyte files with the code responsible for running the tests.
The presented concept has its disadvantages, which can be summarized as follows: • Despite the simplicity of the concept, it requires an advanced knowledge of the relevant programming language from the creator of the test.
• Each task or class validation requires the creation of a unique code.Although many checks are routine and often similar, no tool is currently available to create templates and speed up the creation of assessment classes.The creation of evaluation classes is, therefore, time-consuming.
• Running and evaluating code that does not contain tested methods is impossible in Java.In other words, evaluating a partial solution is impossible because the methods must at least have a defined header.This limitation is due to pre-run code checks and general language safety rules.In contrast, in some other interpreted (and even some compiled) languages, the existence of a method is checked only when code is executed.This approach allows the evaluation of the program even in the absence of a corresponding method.The presented solution was applied during four years of a university course in Java programming.One year took place in LMS Moodle using the Virtual Programming Lab module and three years in the educational system PRISCILLA.The students accepted the solution because it provided more detailed feedback than introductory courses using the I/O approach.Moreover, although this approach placed high demands on content creators in the initial tasks, they already appreciated the speed of their creation after getting better familiar with the principles.

VI. CONCLUSION
Teaching programming is currently among the topics of interest to researchers because, despite advances in the field, the labour market is still experiencing a shortage of programmers.The preparation of tasks in educational environments can take several forms, while the goal and the expected current level of knowledge and skills of educators must be considered.
In the case of introductory exercises and the creation of simple examples to familiarise students with the elements of a programming language, it is possible to consider various supporting algorithms and static analyses of the code before sending it for I/O analysis.After the movement to learning the principles of object-oriented programming, it is possible and probably appropriate to move code correctness verification to the area of unit testing.
By default, automated tests cannot be used universally with the same template for all assignment types.It is not even desirable in preparing tests used in teaching programming.The teacher or test creator is expected to adapt each test to the task.Although this approach is laborious, due to the specific preparation of each task, an explanation of errors typical for the given task can be integrated into the test.In more demanding cases, a solution guide can also be integrated.
The advantage of the presented solution is that it allows testing the class as a whole and the interconnection of several classes, verification of the relationship of inheritance and/or polymorphism, and even a special feature of the Java -static.Future work involves improving the presented solution in various ways.One promising approach is to modularize the system into separate layers: an input description layer (containing rules for generating inputs), a process description layer (controlling the sequence of methods execution), and a code evaluation layer (for comparing method results).
This division opens possibilities for designing templates following orthogonal array testing.These templates can streamline and automate the input generation process for individual or group parameters.
A natural next step is to extend the solution from Java to other programming languages, such as Python and C/C++.The challenge lies in fine-tuning the concept to enable practical and student-friendly verification of classes and their methods while removing non-standard restrictions when working on assignments in the student's part of the environment.
In addition, there is an opportunity to optimize human resources for content preparation.In testing-oriented subjects, it is possible to consider the inclusion of a thematic unit, where students with advanced skills would develop an evaluation code for specific tasks.This code could later be used in object-oriented programming subjects as the testing classes.

FIGURE 1 .
FIGURE 1. Example of data for I/O testing (a truncation of a fraction).

FIGURE 2 .
FIGURE 2. A typical student fraud attempt for the task ''Write characters in odd positions.''

FIGURE 7 .
FIGURE 7. Simplified scheme of test generation and collection of test results.

FIGURE 8 .
FIGURE 8.An example of a simple task.

FIGURE 9 .
FIGURE 9. Description of test cases with parameters for verifying the functionality of the constructor.

FIGURE 10 .
FIGURE 10.Illustration of a method for generating integer values defined by test case descriptions.

FIGURE 11 .
FIGURE 11.Test processing: in the cycle, the individual lines of the test description object presented in Figure 9 are sent for testing.

FIGURE 12 .
FIGURE 12. Procedure for generating test values and evaluating the correctness of test results for a group of cases defined for one type of test.The constructor with four parameters is evaluated.

FIGURE 13 .
FIGURE 13.User view of test results.