A Sample-Efficient OPF Learning Method Based on Annealing Knowledge Distillation

To quickly respond to variations in the state of network load demand, a solution using data-driven techniques to predict optimal power flow (OPF) has emerged in recent years. However, most of the existing methods are highly dependent on large data volumes. This limits their application on the newly established or expanded systems. In this regard, this work proposes a sample-efficient OPF learning method to maximize the utilization of limited samples. By decomposing the OPF task before knowledge distillation, deep learning complexity is reduced. Thereafter, knowledge distillation is used to integrate decoupled tasks and improve accuracy in low-data setups. Unsupervised pre-training is introduced to alleviate the demand for labeled data. Additionally, the focal loss function and teacher annealing strategy are adopted to achieve higher accuracy without extra samples. Numerical tests on different systems corroborate the advanced accuracy and training speed over other training methods, especially on fewer-sample occasions.

and fast decoupled load flow [2]. Although many advances 28 The associate editor coordinating the review of this manuscript and approving it for publication was Youngjin Kim .
have been made to simplify the model, the computational 29 efficiency problem is still a bottleneck. 30 In recent years, deep-learning-based methods have exerted 31 significant efficiency improvement for OPF [3], [4]. It uses a 32 large amount of historical data to approximate the variable 33 relationship and achieve the real-time response. Compared 34 with traditional solvers, the deep learning approach has a 35 computation speed improvement of up to 200 times for 36 DC-OPF and 35 times for alternating current OPF (AC-OPF) 37 [5], [6]. In addition, the deep learning technique provides a 38 feasible solution to address OPF solving in online settings and 39 state combinations. To address the online efficiency problem 40 of OPF learning, several approaches have been studied based 41 on active constraints [7], [8], warm-start points prediction [9], 42 [10], and so on. However, high data requirements of these 43 data-intense methods limit their applications [11]. 44 To reduce the data requirements for training, a hybrid 45 number model-driven approach is adopted to simplify the 46 iterations. Such an approach is no longer simple end-to-end 47 the sample size is reduced, the model training speed can designed and used to improve the training performance 105 without extra samples. Pre-training and teacher anneal-106 ing strategies achieve higher accuracy in the small-data 107 regime. 108 The remaining paper is organized as follows. In section II, 109 challenges of OPF learning are discussed and the scheme of 110 the proposed solution is outlined. Section III details the train-111 ing process. The overall procedure is described in section IV. 112 Numerical results with the proposed method are shown in 113 section V. Finally, the paper is summarized in section VI. 115 With no consideration of the mapping relationship and data 116 distribution, the conventional learning approaches rely heav-117 ily on data volumes, which limits the application. Actually, 118 the variable relationships can be simplified by decoupling the 119 target outputs in separate networks. Therefore, our solution 120 is based on OPF task decomposition and organized in a 121 knowledge distillation framework.

123
The OPF determines the most economical generation 124 dispatch while satisfying the load demand and other secu-125 rity constraints. The following optimization formulations are 126 obtained when applying a DC approximation to the tradi-127 tional AC-OPF.
where P G is the power output of ith generating unit. c 1i and c 2i 131 are the generation cost coefficients. P D is the power demand 132 of the ith bus. P Fk is the transmission power of the kth branch. 133 V θ i is the voltage phase angle of the ith bus. G , N and br 134 are the set of generating units, bus, and branches, respectively. 135 B ij is the susceptance of admittance between the ith and jth 136 bus.

137
The OPF model contains information about the branch 138 parameters and network topology. The complex model 139 requires a few iterations to reach the optimal solution. It takes 140 a long time to optimize the power flow for a large number of 141 operating states.

142
The researchers are currently interested in a model-free 143 method based on deep learning, which seeks a function 144 automatically to fit the abstract relationship between power 145 demand and power dispatch. In [4], [5], [6], [7], [8], [9], [10], 146 and [11], load variables are widely used as input features, 147 while power generations and phase angles are considered 148        The SDAE network, with fewer hyperparameters, is com-209 patible with the proposed method which combines unsuper-210 vised pretraining and finetuning. There are three differences 211 between our network and the SDAE network normally used 212 as shown in Fig.3.

213
First, for the hidden layer setting, the traditional SDAE 214 network has the smallest hidden layer in the middle, i.e. the 215 bottleneck layer. In contrast, the middle layer is the widest in 216

232
The variable relationships can be categorized into two 233 types and thus the OPF task is decoupled as follows, where f T denotes the task of training teacher models. f θ,t and 238 f G,t are the teacher models to predict V θ and P G , respectively.

239
The teacher model f θ,t (P D → V θ ): learns the knowledge The pre-training of the proposed model involves only unla-262 beled load data in a task-agnostic way. These data are readily 263 available in the power system. The pre-training is to train 264 most of the parameters with unlabeled samples which are 265 readily accessible. The computational burden is eased for 266 subsequent supervised training.

267
The unlabelled state data is used in a self-supervised man-268 ner based on feature reconstruction. Feature reconstruction 269 means the original feature can be recovered to its initial 270 form after an encoding-decoding process. Pre-training aims 271 to minimize the distance between the original features and 272 their corresponding transformations. The more similar the 273 reconstruction feature is, the more valuable features can be 274 kept by the encoder.

275
In a traditional SDAE network, the input value is usually 276 erased with random zeros to enable the network with anti-277 noise ability. However, this random zero strategy is unsuitable 278 for our input vector because it may lose key features. To deal 279 with this, a random gaussian noise strategy is proposed to 280 avoid feature loss.
where P D,noise is the input vector P D added with random 283 noise. η is the noise ratio, which obeys a Gaussian distribution 284 and lies between +5% and −5%. r is a random vector. p is The pre-training is agnostic to phase angles or power genera-293 tion, but only extracts generic features from the state data.
where L θ t and L Gt are the loss function of models f θ,t (P D →

307
V θ ) and f G,t (P D → P G ), respectively. V θ,i and V θ t,i are 308 the ith element in the actual value and teacher prediction, 328 where a i and p i are the focal weights. a i is determined by the 330 target value, while p i is related to the prediction value. a i can 331 be expressed as, where Y i is the ith element in the label vector, which is the 335 angle or generation. P[Y i ] is the proportion of the correspond-336 ing category of Y i . The category is obtained by dividing the 337 entire range of values into 20 intervals. The proportion is 338 determined by the number of labels whose values fall into 339 the same interval.

340
From the perspective of deep learning, the outputs which 341 are often equal to zero or maximum reflect that the features 342 are more distinctive and easier to learn. For these categories, 343 the corresponding parameters p are attributed with lower 344 values. The weight p is obtained from the predicted values 345 after a power operation as follows, where r is set as 1. Y ti is the ith value of the teacher model 349 prediction and the subscript t denotes the teacher model. The 350 states whose label is close to 0 or 1 are easy to learn, so the 351 percentage should be smaller.

352
The gradient descent algorithm is more suitable to min-353 imize the loss function in deep learning models [24]. The 354 gradient descent process can be expressed as,  Since the pretraining stage is task-agnostic and unrelated 369 to the downstream work, the result of the pre-train stage can 370 be reused for the student model initialization and only fine-371 tuning is required.

372
The knowledge-distillation-based finetuning stage aims to 373 approximate the results of existing single-task models, which 374 is achieved by minimizing the gap between the prediction of 375 the teacher model and the student model.
where L s,t is the loss function, which evaluates the difference 380 between teacher prediction and student prediction. L θ s,θt is 381 the loss function of V θ t and V θ s . L Gs,Gt is the loss function of 382 P Gt and P Gs . V θs and P Gs are predictions of the student model. the usage is specified as follows, where L is a comprehensive loss function, which is combined 397 by L θs,θ t , L θ s , L Gs,Gt , and L Gs . λ θ and λ G are the weights 398 of teacher models, which are determined after comparison as 399 follows, where L θt and L Gt are obtained by (7) and (8). L θ s and L Gs 403 are the loss function value of the student model, which can be 404 expressed as, In (17)   Otherwise, the student is trained by actual labels.

411
To avoid the accuracy limitations of teacher models, 412 the teacher annealing approach is adopted in our train-413 ing. A dynamic annealing weight is introduced in the 414 two-objective loss function, which can be expressed as, where e and e max are the finetune epoch index and the 418 max epoch number, respectively. λ is the dynamic annealing 419 weight, which increases linearly with iteration.

420
The (19) and (20) indicate the knowledge distillation pro-421 cess is divided into two stages. In the early stage, the student 422 model learns from the teacher models f θ,t (P D → V θ ) and 423 f G,t (P D → P G ). With the increasing finetune epochs, the 424 student model experiences a gradual transition to supervised 425 learning under target labels.

428
The proposed approach provides a sample-efficient 429 OPF-solving framework to determine the optimal generation 430 dispatch. The overall process is shown in Fig.5.

431
Step 1: Input the historical data or simulation data of the 432 power flow under different system states.

433
Step 2: Select the unlabeled data for SDAE pretraining and 434 the encoder layer parameters are determined.

435
Step 3: The labeled samples are classified into angle labels 436 and generation labels.

437
Step 4: For angle labels, a new network is constructed 438 based on the encoder.

439
Step 5: Finetune the network in step 4 and obtain a teacher 440 network.

441
Step 6: Calculate teacher predictions and focal loss 442 function.

443
(With the generation labels, the teacher model f G,t (P D → 444 P G ) is trained in parallel so as with steps 4-6.)

445
Step 7: Construct a new network as a student model based 446 on the pre-trained encoder.

447
Step 8: Set the maximum epoch and initialize the current 448 epoch counter.

449
Step 9: Calculate the difference between student and 450 teacher predictions as (13)-(14), as well as the loss function 451 for each variable according to (16).

455
Step 11: The weighted sum of the loss function is calcu-456 lated and used in parameter updating.

457
Step 12: Repeat steps 10-12 until the epoch counter reaches 458 the limitation and the OPF training is finished.  [29], [30] are involved to test the scalability 463 of the proposed method. The hardware and software used in 464 the case study include Intel i5-10600KF CPU, 16G RAM, 465 WINDOWS 10, and Python 3.8. The Gurobi toolkit is also 466 involved in benchmark calculation when evaluating accuracy. 467 VOLUME 10, 2022

479
The accuracy indices are defined as follows, where V θ and V θ are the predicted and actual phase angles. 483 P G and P G are the predicted and actual power generation. The 484 judgment thresholds are set to 0. Various methods in Table 1 are applied to the RTS-79 sys-497 tem. The computational performance is displayed in Table 2. 498 M2, M5, and M6 have the same network structure, and 499 Fig.6 compares relative errors of their predicted node phase 500 angles.

501
M0 is the method that invokes the Matpower toolkit 502 for solving, and its outcomes are used as the benchmark. 503 As shown in Table 2, the solution time of traditional opti-504 mization algorithms is 248.2683 s.

505
Comparing M1 and M2, the SDAE network is proved to be 506 more effective in predicting OPF. This is because the com-507 putational effectiveness of random forests depends greatly 508 on the manual selection of features. Moreover, the training 509 effectiveness of random forests is limited by the size of the 510 output volume. The requirements of tree size and layers in 511 M1 increase accordingly with the output scale. The problem 512 of preferring a large number of parameters is difficult to solve 513 and ultimately detrimental to accuracy.

514
Results of M2 show that task decomposition enables the 515 network to concentrate on one particular problem. By decom-516 posing the task, interactions between unrelated features can 517 be avoided to occupy parameter resources, so that parameters 518 can work together to achieve an accurate output. 519 Fig.7 compares the generation results obtained by M3 520 and M4. The advanced focal loss function is effective in 521 improving the prediction of unbalanced distribution vari-522 ables. This technique changes the weighting factor of the 523 data difference, allowing the model to notice small sample 524 data without sample data addition. In the RTS-79 system, 525 generation units 23, 24, and 25-30 are always prioritized in 526 generation dispatch due to their low cost. The other gener-527 ating units operate only in fewer states with high load levels. 528 The generation labels show uneven distribution, but the Focal 529 loss function enhances the attention of the network to the 530 minority data, thus improving the overall effectiveness of the 531 method.

532
The results of the M5 and M6 in Table 2 show that 533 knowledge distillation can integrate multiple high-precision 534 single-task models while maintaining the same level of 535   accuracy. Fig.6 shows that the teacher annealing strategy can 536 achieve higher accuracy in knowledge distillation.  Table 3, by reusing the pre-trained model, the 539 proposed training method using the knowledge distillation 540 strategy can achieve high accuracy results in one minute. the SDAE network in the pre-training phase, thus reducing 547 the learning burden in the supervised phase. The proposed 548 method maintains a higher accuracy level over other meth-549 ods, demonstrating its feasibility for small sample states. For 550 example, the M2 requires 15000 samples to roughly attain 551 the accuracy that the proposed method (M6) achieves with 552 200 samples.

554
The proposed method is also applied to the systems with 555 different scales and the results are presented in Table 4. 556 the Brazilian system has 242 nodes with 53 generators, and 557 the specific settings such as line capacity can be found 558   in [30]. For each test system, training data contains 200 559 samples.

560
It can be seen that the accuracy of the proposed method 561 slightly decreases as the size of the system increases but main-562 tains an acceptable level. Therefore, the proposed method is 563 suitable for power systems with different scales.

565
This paper proposes a sample-efficient method based on 566 DC-OPF decomposition and knowledge distillation to enable 567 training with limited samples. Numerical results prove the 568 proposed DC-OPF task decomposition can improve the train-569 ing generalization on limited samples, and the annealing 570 operation in the knowledge distillation can finally enhance 571 the accuracy by 10% for angle and 8% for power generation. 572 Moreover, the accuracy improvement of the proposed method 573 is over 12% which is more significant in lower-data setups. 574 Compared to the plain deep learning method, the proposed 575 method can reduce the sample size by 98.6% and improve 576 the accuracy in phase angle by 12% and generation by 2%. 577 In the future, it will be studied in AC-OPF with consideration 578 of reactive power and voltage magnitude.