Reinforcement learning algorithms are central to the cognition and decision-making of embodied intelligent agents.A bilevel optimization (BO) modeling approach,along with a host of efficient BO algorithms,has been proven to be an effective means of addressing actor-critic (AC) policy optimization problems.In this work,based on a bilevelstructured AC problem model,an implicit zeroth-order stochastic algorithm is developed.A locally randomized spherical smoothing technique,which can be applied to nonsmooth nonconvex implicit AC formulations and avoid the closed-form lower-level mapping,is introduced.In the proposed zeroth-order scheme,the gradient of the implicit function can be approximated through inexact lower-level value estimations that are practically available.Under suitable assumptions,the algorithmic framework designed for the bilevel AC method is characterized by convergence guarantees under a fixed stepsize and smoothing parameter.Moreover,the proposed algorithm is equipped with the overall iteration complexity of ■.The convergence performance of the proposed algorithm is verified through numerical simulations.