This paper presents a fast parallel method to solve the radiative transport equation in inhomogeneous participating media. We apply a novel approximation scheme to find a good initial guess for both the direct and scattered components. Then, the initial approximation is used to bootstrap an iterative multiple scattering solver, i.e., we let the iteration concentrate just on the residual problem. This kind of bootstrapping makes the volumetric source approximation more uniform, thus it helps to reduce the discretization artifacts and improves the efficiency of the parallel implementation. The iterative refinement is executed on a face-centered cubic grid. The implementation is based on CUDA and runs on the GPU. For large volumes that do not fit into the GPU memory, we also consider the implementation on a GPU cluster, where the volume is decomposed to blocks according to the available GPU nodes. We show how the communication bottleneck can be avoided in the cluster implementation by not exchanging the boundary conditions in every iteration step. In addition to light photons, we also discuss the generalization of the method to γ-photons that are relevant in medical simulation.