<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="research-article">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">MS</journal-id><journal-title-group>
    <journal-title>Mechanical Sciences</journal-title>
    <abbrev-journal-title abbrev-type="publisher">MS</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Mech. Sci.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">2191-916X</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/ms-17-671-2026</article-id><title-group><article-title>Curriculum-learning-driven hierarchical multi-agent deep reinforcement learning for collaborative scheduling in complex supply chain networks</article-title><alt-title>Curriculum-learning-driven hierarchical multi-agent deep reinforcement learning</alt-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Dong</surname><given-names>Jingya</given-names></name>
          
        <ext-link>https://orcid.org/0000-0002-6841-7166</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Zhao</surname><given-names>Han</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Zhao</surname><given-names>Suyi</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Wang</surname><given-names>Yijie</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Guo</surname><given-names>Mengfan</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="yes" rid="aff2">
          <name><surname>Song</surname><given-names>Chunhe</given-names></name>
          <email>chhsong@iaii.ac.cn</email>
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Xu</surname><given-names>Mingliang</given-names></name>
          
        </contrib>
        <aff id="aff1"><label>1</label><institution>School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China</institution>
        </aff>
        <aff id="aff2"><label>2</label><institution>Institute of AI for Industries, Nanjing 211100, China</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Chunhe Song (chhsong@iaii.ac.cn)</corresp></author-notes><pub-date><day>25</day><month>June</month><year>2026</year></pub-date>
      
      <volume>17</volume>
      <issue>1</issue>
      <fpage>671</fpage><lpage>684</lpage>
      <history>
        <date date-type="received"><day>26</day><month>April</month><year>2026</year></date>
           <date date-type="rev-recd"><day>19</day><month>May</month><year>2026</year></date>
           <date date-type="accepted"><day>29</day><month>May</month><year>2026</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2026 Jingya Dong et al.</copyright-statement>
        <copyright-year>2026</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026.html">This article is available from https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026.html</self-uri><self-uri xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026.pdf">The full text article is available as a PDF file from https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d2e138">With the growing scale, heterogeneity, and dynamic uncertainty of modern supply chain networks, collaborative scheduling across order assignment, manufacturer selection, and logistics operations has become increasingly critical and challenging because of strong inter-stage coupling, high decision complexity, and dynamic operational constraints. To address these challenges, this paper investigates the joint optimization problem of order assignment, heterogeneous manufacturer selection, and logistics vehicle scheduling in dynamic supply chain collaborative networks and proposes a curriculum-learning-driven hierarchical multi-agent deep reinforcement learning framework (CH-MADRL) for coordinated scheduling in complex environments. First, the joint optimization problem is formulated as a hierarchical multi-agent Markov decision process to capture the hierarchical dependencies and dynamic interactions among order assignment, heterogeneous manufacturer selection, and logistics vehicle scheduling, which establishes a unified modeling foundation for multi-stage collaborative scheduling. Second, based on this formulation, a hierarchical multi-agent deep reinforcement learning architecture is developed to decompose the tightly coupled high-dimensional joint scheduling problem into three correlated sub-problems, enabling coordinated optimization across different stages of the supply chain. Third, a constraint-progressive adaptive curriculum-learning mechanism is developed to facilitate policy learning under dynamic constraints, where a stage-conditioned dynamic masking mechanism regulates feasible action spaces, and a dual-gated promotion strategy stabilizes transitions across curriculum stages. Simulation experiments demonstrate that the proposed method surpasses baseline approaches in scheduling performance, training efficiency, and cross-scale generalization capability.</p>
  </abstract>
    
<funding-group>
<award-group id="gs1">
<funding-source>National Key Research and Development Program of China</funding-source>
<award-id>2024YFB3311600</award-id>
</award-group>
<award-group id="gs2">
<funding-source>Henan Provincial Science and Technology Research Project</funding-source>
<award-id>252102211074</award-id>
</award-group>
</funding-group>
</article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d2e150">The global supply chain system is undergoing a significant evolution characterized by heightened uncertainty. The superposition of multiple factors, including trade frictions, geopolitical risks, public health emergencies, and demand fluctuations, imposes continuous shocks on the stable operation of cross-regional supply networks <xref ref-type="bibr" rid="bib1.bibx6" id="paren.1"/>. Under macro environments where volatility, uncertainty, complexity, and ambiguity coexist, supply chain management is transitioning from traditional models dominated by informatization and standardized processes to digital-intelligent systems featuring data-driven approaches, intelligent perception, and autonomous decision making <xref ref-type="bibr" rid="bib1.bibx9 bib1.bibx39" id="paren.2"/>. Meanwhile, enterprises impose higher requirements on the response speed, collaborative efficiency, and resilience of supply chain systems. Multi-stage collaborative optimization and real-time scheduling thus emerge as critical challenges in scenarios with dynamically arriving orders, heterogeneous resource capabilities, and strong coupling between manufacturing and logistics, particularly for intelligent decision making in complex supply chains.</p>
      <p id="d2e159">Researchers worldwide conduct systematic studies on collaborative scheduling in complex supply chains, and existing methods generally fall into three categories: mathematical programming and decomposition optimization methods, heuristic and metaheuristic methods, and data-driven methods based on reinforcement learning <xref ref-type="bibr" rid="bib1.bibx35" id="paren.3"/>. The first two formulate order production and distribution as optimization models subject to process flows, capacity limits, delivery deadlines, and transportation constraints, solving them via exact algorithms or approximate search techniques. <xref ref-type="bibr" rid="bib1.bibx32" id="text.4"/> propose a matheuristic framework that combines local search with exact methods to coordinate production and distribution, optimizing total costs of order fulfillment and shipment. Subsequent work has extended production and logistics scheduling from static single-stage settings to complex environments with equipment failures, dynamic order arrivals, and multi-objective trade-offs <xref ref-type="bibr" rid="bib1.bibx2" id="paren.5"/>. These methods remain effective for small- to medium-scale problems or relatively stable environments. However, continuous order arrivals, intensified cross-stage coupling, and frequent disturbances expose critical limitations: model reconstruction is costly, solution times escalate rapidly, and schedule stability deteriorates.</p>
      <p id="d2e171">Reinforcement learning (RL) handles dynamic task arrivals, resource state changes, and frequent disturbances by interacting with the environment, learning from trial and error, and optimizing long-term returns. Unlike traditional methods that require explicit modeling and static solution processes, RL updates decision policies through continuous feedback. <xref ref-type="bibr" rid="bib1.bibx41" id="text.6"/> formulate dynamic flexible scheduling problems with transportation time constraints as multi-agent collaborative decision processes, coordinating machine agents with job agents. <xref ref-type="bibr" rid="bib1.bibx30" id="text.7"/> propose a nested hierarchical deep reinforcement learning method for production and logistics collaborative scheduling in dynamic flexible job shops. <xref ref-type="bibr" rid="bib1.bibx19" id="text.8"/> develop a real-time scheduling method using multi-agent deep reinforcement learning (MARL) for production and logistics coordination under large-scale dynamic order arrivals. For logistics operations, <xref ref-type="bibr" rid="bib1.bibx18" id="text.9"/> propose a hierarchical framework for dynamic conflict-free automated guided vehicle (AGV) scheduling in automated container terminals. The above studies validate RL, particularly multi-agent and hierarchical variants, as effective for complex scheduling and logistics coordination <xref ref-type="bibr" rid="bib1.bibx30 bib1.bibx19 bib1.bibx18" id="paren.10"/>. However, existing methods primarily address single workshops, single logistics systems, or local resource coordination. Extending these methods to full-chain supply chain scenarios is difficult because strong coupling among order assignment, heterogeneous manufacturer selection, and vehicle scheduling leads to exponential growth of the joint state and action space. Furthermore, dynamic task arrivals, resource competition, and spatiotemporal dependencies compound challenges, including sparse rewards, inefficient exploration, and training non-stationarity. Recent studies note that, although RL and MARL have advanced rapidly in dynamic scheduling, practical deployment remains limited by scalability, training stability, and cross-scenario generalization <xref ref-type="bibr" rid="bib1.bibx40 bib1.bibx7" id="paren.11"/>.</p>
      <p id="d2e193">For RL in complex supply chain collaborative scheduling, curriculum learning (CL) provides a feasible way to organize training under conditions of cold-start difficulties, sparse rewards, and training instability. Training tasks are arranged progressively so that agents can first learn basic decision rules and then adapt to larger networks, stronger stage couplings, and tighter operational constraints. Results from related scheduling studies support the usefulness of this training paradigm. <xref ref-type="bibr" rid="bib1.bibx14" id="text.12"/> demonstrate that curriculum-based training performs better than direct training for large-scale instances in job shop scheduling, while <xref ref-type="bibr" rid="bib1.bibx26" id="text.13"/> show that task ordering based on difficulty and transferability improves reinforcement learning performance. A similar situation arises in supply chain collaborative scheduling, where increasing network scale is usually accompanied by stronger interactions among decision stages and more complex constraints. However, existing CL methods rely on predefined difficulty orders or rough task selection heuristics, and only limited attention has been paid to the gradual regulation of feasible action spaces during training. This issue is particularly important in large discrete decision spaces, where invalid action masking can reduce infeasible exploration and improve training stability <xref ref-type="bibr" rid="bib1.bibx13" id="paren.14"/>. Recent work has started to examine automatic curriculum design in sparse-reward cooperative multi-agent reinforcement learning <xref ref-type="bibr" rid="bib1.bibx4" id="paren.15"/>, but research that combines curriculum learning, hierarchical multi-agent coordination, and dynamic feasible-action control for complex supply chain scheduling is still limited.</p>
      <p id="d2e209">To address these limitations, we investigate the joint optimization of order assignment, heterogeneous manufacturer selection, and logistics vehicle scheduling in dynamic supply chain collaborative networks. First, we formulate this problem as a hierarchical multi-agent Markov decision process that captures the coupling among order, manufacturing, and logistics decisions. Then, we propose a curriculum-learning-driven hierarchical multi-agent deep reinforcement learning framework (CH-MADRL), which coordinates scheduling across stages through hierarchical decomposition, curriculum progression, and dynamic constraint handling. Furthermore, we design a stage-conditioned dynamic masking mechanism and a dual-gated promotion strategy to gradually expand feasible action spaces and stabilize curriculum progression. Finally, experiments on dynamic environments validate the proposed method. The main contributions of this paper are as follows: <list list-type="custom"><list-item><label>1.</label>
      <p id="d2e214">A CH-MADRL framework is proposed, which decouples order assignment, heterogeneous manufacturer selection, and logistics vehicle scheduling into three hierarchically correlated sub-problems to alleviate the combinatorial complexity caused by high-dimensional joint decision making.</p></list-item><list-item><label>2.</label>
      <p id="d2e218">A constraint-progressive adaptive curriculum-learning mechanism is designed, which achieves smooth transition across scale tasks and improves policy training stability through complexity evolution paths from simple to complex and dual-gated promotion strategies.</p></list-item><list-item><label>3.</label>
      <p id="d2e222">A stage-conditioned dynamic masking mechanism is constructed, which embeds task dependencies, resource availability, and scale boundaries into the action selection process, achieving progressive unlocking of feasible action spaces and reducing invalid exploration.</p></list-item><list-item><label>4.</label>
      <p id="d2e226">Comparative experiments show that the proposed method outperforms baselines in scheduling performance, training efficiency, and cross-scale generalization.</p></list-item></list></p>
      <p id="d2e229">The remainder of this paper is organized as follows. Section 2 introduces the related literature and techniques. Section 3 provides a formal description and mathematical modeling of the research problem. Section 4 elaborates on the proposed CH-MADRL framework in detail. Section 5 presents experimental results. Section 6 concludes and outlines future research directions.</p>
</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Related work</title>
<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>Collaborative logistics scheduling methods in supply chains</title>
      <p id="d2e247">Coordination between production and logistics stages constitutes the basis of supply chain collaborative scheduling. Expansion from single factories and distribution chains to multi-facility, multi-resource, and multi-node networks has shifted research from separate optimization of orders, production, and distribution toward integrated production and distribution and joint manufacturing and transportation optimization. <xref ref-type="bibr" rid="bib1.bibx20" id="text.16"/> examine collaborative production and material distribution, showing that independent optimization of either stage lowers system-wide efficiency under strong spatiotemporal coupling. On this basis, <xref ref-type="bibr" rid="bib1.bibx10" id="text.17"/> investigate joint production and transportation scheduling problems in flexible job shops, <xref ref-type="bibr" rid="bib1.bibx38" id="text.18"/> incorporate limited AGV resources into flexible job shop models to formalize manufacturing and transportation collaboration, and <xref ref-type="bibr" rid="bib1.bibx31" id="text.19"/> develop batch-centric mixed-integer programming for integrated production and pipeline distribution. These studies demonstrate that production and logistics collaborative scheduling has evolved from traditional single-stage optimization to integrated optimization oriented toward multi-resource coupling.</p>
      <p id="d2e262">Regarding solution methodologies, existing research includes exact algorithms, heuristic algorithms, and metaheuristic algorithms. Exact methods guarantee optimality for small-scale instances, yet computational complexity escalates rapidly with order scale, node quantity, and transportation resources, precluding real-time decision making in dynamic scenarios <xref ref-type="bibr" rid="bib1.bibx38 bib1.bibx31" id="paren.20"/>. Consequently, researchers generally adopt learning-augmented heuristic and metaheuristic methods. <xref ref-type="bibr" rid="bib1.bibx29" id="text.21"/> combine learning-augmented mechanisms with local search for parallel serial batch processing machine scheduling; <xref ref-type="bibr" rid="bib1.bibx1" id="text.22"/> propose parallel heuristic methods for hybrid job shop scheduling with conflict-free AGV path planning; <xref ref-type="bibr" rid="bib1.bibx28" id="text.23"/> fuse GRASP, genetic algorithms, and learning mechanisms for supply chain scheduling; <xref ref-type="bibr" rid="bib1.bibx15" id="text.24"/> introduce energy consumption factors into multi-factory supply chain scheduling, expanding modeling dimensions. However, most of these studies are still centered on single factories or limited collaborative settings, and research on integrated scheduling across order assignment, manufacturer selection, and vehicle dispatching in dynamic supply chain networks remains relatively scarce.</p>
</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Dynamic-scheduling methods based on deep reinforcement learning</title>
      <p id="d2e288">In recent years, deep reinforcement learning (DRL) has emerged as an important direction in dynamic-scheduling research because it learns sequential decision policies end to end in high-dimensional state spaces. <xref ref-type="bibr" rid="bib1.bibx23" id="text.25"/> formulate dynamic flexible job shop scheduling as Markov decision processes, demonstrating DRL's real-time response advantages under random order arrivals. Subsequent work extends from single-agent methods to graph neural networks, multi-agent reinforcement learning, and hierarchical reinforcement learning. <xref ref-type="bibr" rid="bib1.bibx27" id="text.26"/> and <xref ref-type="bibr" rid="bib1.bibx8" id="text.27"/> review DRL applications for dynamic job shop scheduling and supply chain production scheduling, respectively; <xref ref-type="bibr" rid="bib1.bibx22" id="text.28"/> combine graph neural networks with DRL for dynamic job shop scheduling; <xref ref-type="bibr" rid="bib1.bibx34" id="text.29"/> propose attention-enhanced reinforcement learning methods for flexible job shop scheduling with transportation constraints; <xref ref-type="bibr" rid="bib1.bibx17" id="text.30"/> propose an end-to-end decentralized scheduling framework for dynamic distributed heterogeneous flow shops; <xref ref-type="bibr" rid="bib1.bibx33" id="text.31"/> extend hierarchical multi-agent deep reinforcement learning to flexible job shops with transportation constraints. Meanwhile, multi-agent deep reinforcement learning (MADRL) extends from workshop scheduling to broader collaborative decision problems such as online scheduling in assembly systems, distributed hybrid flow shops, and multi-echelon inventory management <xref ref-type="bibr" rid="bib1.bibx16 bib1.bibx5 bib1.bibx36 bib1.bibx24" id="paren.32"/>. These studies demonstrate that DRL gradually evolves from single-workshop internal scheduling toward complex scenarios involving cross-resource, cross-node, and multi-agent collaboration.</p>
      <p id="d2e316">However, existing DRL research exhibits two limitations in full-chain collaborative scenarios within complex supply chains. First, most methods address single workshops, single logistics systems, or local resource allocation, lacking integrated modeling of order decomposition, heterogeneous manufacturer collaboration, and logistics vehicle scheduling. Second, as supply chain scale and constraint complexity increase, the joint state and action space expands rapidly, causing cold-start difficulties, sparse rewards, inefficient exploration, and multi-agent training non-stationarity. To alleviate these difficulties, CL is introduced into RL training processes. Its fundamental idea improves sample utilization efficiency and policy transfer capability through task organization from easy to difficult <xref ref-type="bibr" rid="bib1.bibx21" id="paren.33"/>. Applications span flexible job shop scheduling <xref ref-type="bibr" rid="bib1.bibx25" id="paren.34"/> and hierarchical reinforcement learning for dynamic AGV scheduling, automated terminal task allocation, and port equipment coordination <xref ref-type="bibr" rid="bib1.bibx11 bib1.bibx3 bib1.bibx12 bib1.bibx37" id="paren.35"/>. However, existing CL methods remain limited to single-dimensional difficulty progression or empirical stage switching without systematic coupling of the synchronous expansion of order scale, manufacturing nodes, logistics capacity, and hierarchical multi-agent decision making with dynamic action space constraints. Based on this, this paper proposes a hierarchical MADRL framework incorporating constraint-progressive curriculum evolution to address cross-level joint decision problems in complex supply chain collaborative scheduling.</p>
</sec>
</sec>
<sec id="Ch1.S3">
  <label>3</label><title>Problem formulation</title>
      <p id="d2e337">This paper investigates a multi-stage decision optimization problem in collaborative supply chain logistics, covering three interrelated decision dimensions: order task decomposition, manufacturing resource allocation, and transportation resource scheduling. Unlike the classical flexible job shop scheduling problem (FJSP), limited to single-facility operations, our collaborative setting introduces strict cross-node logistics constraints and spatiotemporal coupling. This integration exponentially expands the joint action space, making the proposed adaptive curriculum mechanism computationally essential to overcome the resulting high-dimensional exploration challenges. For formal modeling, the problem is represented as a standardized supply chain instance with scale <inline-formula><mml:math id="M1" display="inline"><mml:mrow><mml:mi>n</mml:mi><mml:mo>×</mml:mo><mml:mi>m</mml:mi><mml:mo>×</mml:mo><mml:mi>l</mml:mi></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M2" display="inline"><mml:mi>n</mml:mi></mml:math></inline-formula>, <inline-formula><mml:math id="M3" display="inline"><mml:mi>m</mml:mi></mml:math></inline-formula>, and <inline-formula><mml:math id="M4" display="inline"><mml:mi>l</mml:mi></mml:math></inline-formula> denote the number of orders, heterogeneous manufacturers, and logistics vehicles, respectively. Let <inline-formula><mml:math id="M5" display="inline"><mml:mrow><mml:mi mathvariant="script">O</mml:mi><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:msub><mml:mi>O</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>O</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula> represent the order set, <inline-formula><mml:math id="M6" display="inline"><mml:mrow><mml:mi mathvariant="script">M</mml:mi><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula> represent the manufacturer set, and <inline-formula><mml:math id="M7" display="inline"><mml:mrow><mml:mi mathvariant="script">L</mml:mi><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mi>l</mml:mi></mml:msub><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula> represent the vehicle set. Each order <inline-formula><mml:math id="M8" display="inline"><mml:mrow><mml:msub><mml:mi>O</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> contains <inline-formula><mml:math id="M9" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> operations with process precedence, denoted as <inline-formula><mml:math id="M10" display="inline"><mml:mrow><mml:msub><mml:mi>K</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula>, where operation <inline-formula><mml:math id="M11" display="inline"><mml:mrow><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> must be completed before <inline-formula><mml:math id="M12" display="inline"><mml:mrow><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>. Each operation <inline-formula><mml:math id="M13" display="inline"><mml:mrow><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> can only be assigned to a manufacturer <inline-formula><mml:math id="M14" display="inline"><mml:mrow><mml:mi>j</mml:mi><mml:mo>∈</mml:mo><mml:mi mathvariant="script">M</mml:mi></mml:mrow></mml:math></inline-formula> with the required processing capability, and its processing time is denoted as <inline-formula><mml:math id="M15" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>. When adjacent operations are executed by different manufacturers, vehicle <inline-formula><mml:math id="M16" display="inline"><mml:mrow><mml:mi>v</mml:mi><mml:mo>∈</mml:mo><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:math></inline-formula> is required to complete inter-node transportation. The transportation time consists of empty vehicle dispatch time from the current position to the pickup point and loaded transportation time from the pickup point to the target manufacturing node. An operation can only start processing when its predecessor operation is completed, materials arrive via transportation, and the corresponding manufacturer resource becomes available. This paper takes minimizing the system makespan as the optimization objective:

          <disp-formula id="Ch1.E1" content-type="numbered"><label>1</label><mml:math id="M17" display="block"><mml:mrow><mml:mo movablelimits="false">min⁡</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mo>max⁡</mml:mo></mml:msub><mml:mo>=</mml:mo><mml:mo movablelimits="false">min⁡</mml:mo><mml:mo>(</mml:mo><mml:munder><mml:mo movablelimits="false">max⁡</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>∈</mml:mo><mml:mi mathvariant="script">O</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

        where <inline-formula><mml:math id="M18" display="inline"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> is determined by the manufacturing and transportation processes at each stage of the order. For order <inline-formula><mml:math id="M19" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>, the completion time of the final operation is as follows:

          <disp-formula id="Ch1.E2" content-type="numbered"><label>2</label><mml:math id="M20" display="block"><mml:mtable class="split" rowspacing="0.2ex" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mspace width="0.25em" linebreak="nobreak"/><mml:mo movablelimits="false">max⁡</mml:mo><mml:mo mathsize="2.0em">[</mml:mo><mml:msubsup><mml:mi>T</mml:mi><mml:mi>j</mml:mi><mml:mtext>avail</mml:mtext></mml:msubsup><mml:mo mathsize="2.0em">(</mml:mo><mml:mo movablelimits="false">max⁡</mml:mo><mml:mo mathsize="2.0em">(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>T</mml:mi><mml:mi mathvariant="normal">v</mml:mi><mml:mtext>free</mml:mtext></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">v</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>i</mml:mi><mml:mo>∗</mml:mo></mml:msup></mml:mrow><mml:mtext>empty</mml:mtext></mml:msubsup><mml:mo mathsize="2.0em">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mo>+</mml:mo><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mtext>trans</mml:mtext></mml:msubsup><mml:mo mathsize="2.0em">)</mml:mo><mml:mo mathsize="2.0em">]</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
      <p id="d2e838">For each operation, the processing start time is constrained by the completion of its predecessor, the arrival of transported materials, and the availability of the assigned manufacturer. The completion time is determined by the start time and the corresponding processing duration. Accordingly, the final order completion time is jointly affected by manufacturing capacity constraints, process precedence constraints, and transportation time constraints. For modeling purposes, the following assumptions are adopted: <list list-type="custom"><list-item><label>1.</label>
      <p id="d2e843">Manufacturer resources are exclusive and non-pre-emptive, meaning that each manufacturer can process only one operation at a time and processing cannot be interrupted.</p></list-item><list-item><label>2.</label>
      <p id="d2e847">Tasks follow a strict serial flow of processing, transportation, and reprocessing, and all operations must be executed in the prescribed order.</p></list-item><list-item><label>3.</label>
      <p id="d2e851">When two adjacent operations are assigned to different manufacturers, inter-node transportation must be completed by a logistics vehicle, and the corresponding transportation time is nonzero.</p></list-item><list-item><label>4.</label>
      <p id="d2e855">Logistics vehicles are subject to exclusivity constraints, meaning that each vehicle can execute only one transportation task at a time.</p></list-item><list-item><label>5.</label>
      <p id="d2e859">Orders arrive randomly over time, and fluctuations in actual operational efficiency are determined by manufacturer fulfillment reputation <inline-formula><mml:math id="M21" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and logistics vehicle service quality <inline-formula><mml:math id="M22" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">v</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>.</p></list-item></list></p>

      <fig id="F1" specific-use="star"><label>Figure 1</label><caption><p id="d2e886">Architecture of the proposed CH-MADRL framework.</p></caption>
        <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f01.png"/>

      </fig>

</sec>
<sec id="Ch1.S4">
  <label>4</label><title>CH-MADRL framework</title>
      <p id="d2e903">To address high-dimensional decision complexity in large-scale supply chain collaborative scheduling, together with cold-start difficulty and inefficient exploration under complex spatiotemporal constraints, this paper proposes the CH-MADRL framework. As shown in Fig. <xref ref-type="fig" rid="F1"/>, the framework combines hierarchical task decomposition, dynamic masking, and adaptive curriculum learning to reduce decision complexity, incorporate multi-stage dependencies and spatiotemporal constraints, and improve convergence efficiency and generalization performance.</p>
<sec id="Ch1.S4.SS1">
  <label>4.1</label><title>Hierarchical multi-agent MDP model construction</title>
      <p id="d2e915">Considering the dynamic, partially observable, and multi-resource-coupled nature of supply chain networks, the scheduling problem is formulated as a hierarchical decentralized partially observable Markov decision process. The model is defined by the tuple <inline-formula><mml:math id="M23" display="inline"><mml:mrow><mml:mo>〈</mml:mo><mml:mi mathvariant="script">F</mml:mi><mml:mo>,</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>R</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">Ω</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="italic">γ</mml:mi><mml:mo>〉</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M24" display="inline"><mml:mrow><mml:mi mathvariant="script">F</mml:mi><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mtext>ord</mml:mtext></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mtext>mfg</mml:mtext></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>F</mml:mi><mml:mtext>log</mml:mtext></mml:msub><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula> denotes the heterogeneous agent set for order assignment, manufacturer selection, and logistics scheduling; <inline-formula><mml:math id="M25" display="inline"><mml:mi>S</mml:mi></mml:math></inline-formula> denotes the global state space describing real-time information on resource nodes and order flows; <inline-formula><mml:math id="M26" display="inline"><mml:mi mathvariant="normal">Ω</mml:mi></mml:math></inline-formula> denotes the joint observation space composed of local observations <inline-formula><mml:math id="M27" display="inline"><mml:mrow><mml:mo mathvariant="italic">{</mml:mo><mml:msub><mml:mi>o</mml:mi><mml:mtext>job</mml:mtext></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>o</mml:mi><mml:mtext>mfg</mml:mtext></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>o</mml:mi><mml:mtext>log</mml:mtext></mml:msub><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula> at different decision layers; <inline-formula><mml:math id="M28" display="inline"><mml:mi>A</mml:mi></mml:math></inline-formula> denotes the joint action space composed of discrete action subspaces at three hierarchies: order assignment, manufacturer selection, and logistics scheduling; <inline-formula><mml:math id="M29" display="inline"><mml:mi>P</mml:mi></mml:math></inline-formula> denotes the state transition function governed by operation precedence, transportation delays, and other physical process constraints; <inline-formula><mml:math id="M30" display="inline"><mml:mi>R</mml:mi></mml:math></inline-formula> denotes the reward function designed to promote global collaborative optimization; and <inline-formula><mml:math id="M31" display="inline"><mml:mrow><mml:mi mathvariant="italic">γ</mml:mi><mml:mo>∈</mml:mo><mml:mo>[</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the discount factor.</p>

<table-wrap id="T1" specific-use="star"><label>Table 1</label><caption><p id="d2e1076">Definitions of state space features.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="justify" colwidth="25mm"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="justify" colwidth="127mm"/>
     <oasis:thead>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1" align="left">State</oasis:entry>

         <oasis:entry colname="col2">Symbol</oasis:entry>

         <oasis:entry colname="col3" align="left">Description</oasis:entry>

       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>

         <oasis:entry colname="col1" morerows="1" align="left">Order agent state</oasis:entry>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M32" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Order index</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M33" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Index of the operation currently pending processing</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M34" display="inline"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Actual completion time of the preceding operation</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M35" display="inline"><mml:mrow><mml:msub><mml:mtext>Loc</mml:mtext><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Current geographical location of the order</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M36" display="inline"><mml:mrow><mml:msub><mml:mtext>Prog</mml:mtext><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Order scheduling progress, computed as <inline-formula><mml:math id="M37" display="inline"><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>)</mml:mo><mml:mo>/</mml:mo><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>×</mml:mo></mml:mrow></mml:math></inline-formula> 100 %, where <inline-formula><mml:math id="M38" display="inline"><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the total number of operations for order <inline-formula><mml:math id="M39" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M40" display="inline"><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mi>i</mml:mi><mml:mtext>cum</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Cumulative actual processing and logistics time incurred by the order</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M41" display="inline"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>P</mml:mi><mml:mo mathvariant="normal">‾</mml:mo></mml:mover><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Average expected processing time of the current pending operation <inline-formula><mml:math id="M42" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M43" display="inline"><mml:mrow><mml:msubsup><mml:mover accent="true"><mml:mi>T</mml:mi><mml:mo mathvariant="normal">‾</mml:mo></mml:mover><mml:mi>i</mml:mi><mml:mtext>rem</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Estimated average cumulative processing duration of the order's remaining operations</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" morerows="1" align="left">Manufacturer Agent State</oasis:entry>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M44" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Manufacturer index</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M45" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Production quality grade of the manufacturer</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M46" display="inline"><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mi>j</mml:mi><mml:mtext>ops</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Cumulative total number of operations processed by the manufacturer</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M47" display="inline"><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mi>j</mml:mi><mml:mtext>avail</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Earliest available time of the manufacturer upon completion of preceding tasks</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M48" display="inline"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:msup><mml:mi>i</mml:mi><mml:mo>∗</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Interaction feature: standard processing time of the current operation <inline-formula><mml:math id="M49" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula> of the selected order <inline-formula><mml:math id="M50" display="inline"><mml:mrow><mml:msup><mml:mi>i</mml:mi><mml:mo>∗</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> at manufacturer <inline-formula><mml:math id="M51" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row rowsep="1">

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M52" display="inline"><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:msup><mml:mi>i</mml:mi><mml:mo>∗</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mtext>trans</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Interaction feature: estimated in-transit transportation time for the selected order <inline-formula><mml:math id="M53" display="inline"><mml:mrow><mml:msup><mml:mi>i</mml:mi><mml:mo>∗</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula> from its current location <inline-formula><mml:math id="M54" display="inline"><mml:mrow><mml:msub><mml:mtext>Loc</mml:mtext><mml:mrow><mml:msup><mml:mi>i</mml:mi><mml:mo>∗</mml:mo></mml:msup></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula> to manufacturer <inline-formula><mml:math id="M55" display="inline"><mml:mi>j</mml:mi></mml:math></inline-formula></oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" morerows="1" align="left">Logistics Agent State</oasis:entry>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M56" display="inline"><mml:mi>v</mml:mi></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Logistics vehicle index</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M57" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">v</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Service quality grade of the logistics vehicle</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M58" display="inline"><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mi mathvariant="normal">v</mml:mi><mml:mtext>tasks</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Cumulative number of transportation tasks executed by the vehicle</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M59" display="inline"><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mi mathvariant="normal">v</mml:mi><mml:mtext>free</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Release time of the logistics vehicle upon completion of its previous task</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M60" display="inline"><mml:mrow><mml:msub><mml:mtext>Loc</mml:mtext><mml:mi mathvariant="normal">v</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Current geographical location of the vehicle</oasis:entry>

       </oasis:row>
       <oasis:row>

         <oasis:entry colname="col1" align="left"/>

         <oasis:entry colname="col2"><inline-formula><mml:math id="M61" display="inline"><mml:mrow><mml:msubsup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">v</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>i</mml:mi><mml:mo>∗</mml:mo></mml:msup></mml:mrow><mml:mtext>empty</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula></oasis:entry>

         <oasis:entry colname="col3" align="left">Interaction feature: estimated dispatching time for vehicle <inline-formula><mml:math id="M62" display="inline"><mml:mi>v</mml:mi></mml:math></inline-formula> to travel empty from its current location <inline-formula><mml:math id="M63" display="inline"><mml:mrow><mml:msub><mml:mtext>Loc</mml:mtext><mml:mi mathvariant="normal">v</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> to the location of the selected order <inline-formula><mml:math id="M64" display="inline"><mml:mrow><mml:msup><mml:mi>i</mml:mi><mml:mo>∗</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula></oasis:entry>

       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

<sec id="Ch1.S4.SS1.SSS1">
  <label>4.1.1</label><title>State space</title>
      <p id="d2e1695">Following the partially observable modeling framework, the state space is decomposed into three hierarchical local subspaces: the order agent state <inline-formula><mml:math id="M65" display="inline"><mml:mrow><mml:msubsup><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>, the manufacturer agent state <inline-formula><mml:math id="M66" display="inline"><mml:mrow><mml:msubsup><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>, and the logistics agent state <inline-formula><mml:math id="M67" display="inline"><mml:mrow><mml:msubsup><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">L</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>. The state features of the three agent layers are listed in Table <xref ref-type="table" rid="T1"/>, where all temporal variables are quantified in minutes to ensure dimensional consistency.</p>
</sec>
<sec id="Ch1.S4.SS1.SSS2">
  <label>4.1.2</label><title>Action space</title>
      <p id="d2e1748">The joint action space is decomposed as <inline-formula><mml:math id="M68" display="inline"><mml:mrow><mml:mi mathvariant="script">A</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="script">A</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:msup><mml:mo>×</mml:mo><mml:msup><mml:mi mathvariant="script">A</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:msup><mml:mo>×</mml:mo><mml:msup><mml:mi mathvariant="script">A</mml:mi><mml:mi mathvariant="normal">L</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>. A dynamic mask <inline-formula><mml:math id="M69" display="inline"><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is introduced to project action probability distributions onto feasible domains and enforce physical feasibility constraints.</p>
      <p id="d2e1791"><italic>Order assignment subspace (</italic><inline-formula><mml:math id="M70" display="inline"><mml:mrow><mml:msup><mml:mi mathvariant="script">A</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula><italic>).</italic> The action <inline-formula><mml:math id="M71" display="inline"><mml:mrow><mml:msubsup><mml:mi>a</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:msubsup><mml:mo>∈</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula> selects a high-priority order for processing. The mask <inline-formula><mml:math id="M72" display="inline"><mml:mrow><mml:msubsup><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:msubsup><mml:mo>∈</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:msup><mml:mo mathvariant="italic">}</mml:mo><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is activated only when an order has not been delivered and satisfies the prerequisite readiness conditions:

              <disp-formula id="Ch1.E3" content-type="numbered"><label>3</label><mml:math id="M73" display="block"><mml:mrow><mml:msubsup><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">O</mml:mi></mml:msubsup><mml:mo>[</mml:mo><mml:mi>i</mml:mi><mml:mo>]</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="double-struck">I</mml:mi><mml:mo>(</mml:mo><mml:msubsup><mml:mi mathvariant="italic">δ</mml:mi><mml:mi>i</mml:mi><mml:mtext>done</mml:mtext></mml:msubsup><mml:mo>=</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>)</mml:mo><mml:mo>⋅</mml:mo><mml:mi mathvariant="double-struck">I</mml:mi><mml:mo>(</mml:mo><mml:mtext>Ready</mml:mtext><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">t</mml:mi><mml:mo>)</mml:mo><mml:mo>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>

            where <inline-formula><mml:math id="M74" display="inline"><mml:mrow><mml:mi mathvariant="double-struck">I</mml:mi><mml:mo>(</mml:mo><mml:mo>⋅</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the indicator function, and <inline-formula><mml:math id="M75" display="inline"><mml:mrow><mml:mtext>Ready</mml:mtext><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> indicates material and preceding-operation readiness for order <inline-formula><mml:math id="M76" display="inline"><mml:mi>i</mml:mi></mml:math></inline-formula>.</p>
      <p id="d2e1966"><italic>Manufacturer selection subspace (</italic><inline-formula><mml:math id="M77" display="inline"><mml:mrow><mml:msup><mml:mi mathvariant="script">A</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula><italic>)</italic>. The action <inline-formula><mml:math id="M78" display="inline"><mml:mrow><mml:msubsup><mml:mi>a</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:msubsup><mml:mo>∈</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula> selects a manufacturer for the current operation. The mask <inline-formula><mml:math id="M79" display="inline"><mml:mrow><mml:msubsup><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:msubsup><mml:mo>∈</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:msup><mml:mo mathvariant="italic">}</mml:mo><mml:mi>m</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> enforces process qualification constraints based on the capability matrix <inline-formula><mml:math id="M80" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="bold">T</mml:mi><mml:mtext>cap</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula>, retaining only manufacturers with the required capability for operation <inline-formula><mml:math id="M81" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>:

              <disp-formula id="Ch1.E4" content-type="numbered"><label>4</label><mml:math id="M82" display="block"><mml:mrow><mml:msubsup><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">M</mml:mi></mml:msubsup><mml:mo>[</mml:mo><mml:mi>j</mml:mi><mml:mo>]</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="double-struck">I</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold">T</mml:mi><mml:mtext>cap</mml:mtext></mml:msub><mml:mo>[</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>]</mml:mo><mml:mo>≠</mml:mo><mml:mi mathvariant="normal">∅</mml:mi><mml:mo>)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
      <p id="d2e2107"><italic>Logistics scheduling subspace (</italic><inline-formula><mml:math id="M83" display="inline"><mml:mrow><mml:msup><mml:mi mathvariant="script">A</mml:mi><mml:mi mathvariant="normal">L</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula><italic>)</italic>. The action <inline-formula><mml:math id="M84" display="inline"><mml:mrow><mml:msubsup><mml:mi>a</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">L</mml:mi></mml:msubsup><mml:mo>∈</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula> dispatches a logistics vehicle. The mask <inline-formula><mml:math id="M85" display="inline"><mml:mrow><mml:msubsup><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">L</mml:mi></mml:msubsup><mml:mo>∈</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:msup><mml:mo mathvariant="italic">}</mml:mo><mml:mi>l</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> reflects resource availability and excludes vehicles under failure or maintenance:

              <disp-formula id="Ch1.E5" content-type="numbered"><label>5</label><mml:math id="M86" display="block"><mml:mrow><mml:msubsup><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mi mathvariant="normal">L</mml:mi></mml:msubsup><mml:mo>[</mml:mo><mml:mi>v</mml:mi><mml:mo>]</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="double-struck">I</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mtext>Status</mml:mtext><mml:mi mathvariant="normal">v</mml:mi></mml:msub><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo><mml:mo>∈</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:mtext>Idle</mml:mtext><mml:mo>,</mml:mo><mml:mtext>Active</mml:mtext><mml:mo mathvariant="italic">}</mml:mo><mml:mo>)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
</sec>
<sec id="Ch1.S4.SS1.SSS3">
  <label>4.1.3</label><title>Reward function</title>
      <p id="d2e2241">For makespan minimization, a terminal sparse reward provides only limited learning signals, which often leads to temporal credit assignment difficulties and slow convergence in long-horizon scheduling tasks. To improve training efficiency, a potential-based dense reward is adopted, where the potential function is defined as <inline-formula><mml:math id="M87" display="inline"><mml:mrow><mml:mi mathvariant="normal">Φ</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. The immediate reward at step <inline-formula><mml:math id="M88" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> is therefore defined as follows:

              <disp-formula id="Ch1.E6" content-type="numbered"><label>6</label><mml:math id="M89" display="block"><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="normal">Φ</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:mi mathvariant="normal">Φ</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
      <p id="d2e2367">Since the makespan is monotonically non-decreasing during scheduling and satisfies <inline-formula><mml:math id="M90" display="inline"><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>≥</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, the immediate reward always satisfies <inline-formula><mml:math id="M91" display="inline"><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>≤</mml:mo><mml:mn mathvariant="normal">0</mml:mn></mml:mrow></mml:math></inline-formula>. Actions that increase the critical path receive an immediate penalty, while actions that leave it unchanged receive zero reward. For an episode of length <inline-formula><mml:math id="M92" display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula>, the cumulative return can be written in telescoping form:

              <disp-formula id="Ch1.E7" content-type="numbered"><label>7</label><mml:math id="M93" display="block"><mml:mtable class="split" rowspacing="0.2ex" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>T</mml:mi></mml:munderover><mml:msub><mml:mi>r</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo movablelimits="false">∑</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mi>T</mml:mi></mml:munderover><mml:mo>[</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mn mathvariant="normal">0</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>T</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:msubsup><mml:mi>C</mml:mi><mml:mtext>max</mml:mtext><mml:mtext>final</mml:mtext></mml:msubsup><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>

      <fig id="F2" specific-use="star"><label>Figure 2</label><caption><p id="d2e2570">Constraint-progressive adaptive curriculum learning mechanism.</p></caption>
            <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f02.png"/>

          </fig>

</sec>
</sec>
<sec id="Ch1.S4.SS2">
  <label>4.2</label><title>Constraint-progressive adaptive curriculum learning mechanism</title>
      <p id="d2e2588">To improve learning in large-scale supply chain scheduling with dynamic task arrivals, sparse rewards, and high-dimensional constraints, a constraint-progressive adaptive curriculum is introduced. As shown in Fig. <xref ref-type="fig" rid="F2"/>, training starts with simple supply chain instances and gradually progresses to more complex ones by expanding network topology and task scale. Dynamic masking then releases feasible actions stage by stage, reducing invalid exploration in large decision spaces. Promotion to the next stage is allowed only after policy stabilization, which helps maintain stable training and reliable convergence.</p>
<sec id="Ch1.S4.SS2.SSS1">
  <label>4.2.1</label><title>Supply chain complexity manifold evolution</title>
      <p id="d2e2600">To reduce exploration difficulty in high-dimensional environments, the curriculum-learning process is defined as a discrete sequence of environment sets, <inline-formula><mml:math id="M94" display="inline"><mml:mrow><mml:mi mathvariant="script">C</mml:mi><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M95" display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">6</mml:mn></mml:mrow></mml:math></inline-formula>. At each stage <inline-formula><mml:math id="M96" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, the supply chain network is specified by the tuple <inline-formula><mml:math id="M97" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="normal">Ω</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo>〈</mml:mo><mml:msubsup><mml:mi>N</mml:mi><mml:mi>k</mml:mi><mml:mtext>ord</mml:mtext></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>N</mml:mi><mml:mi>k</mml:mi><mml:mtext>mfg</mml:mtext></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>N</mml:mi><mml:mi>k</mml:mi><mml:mtext>log</mml:mtext></mml:msubsup><mml:mo>〉</mml:mo></mml:mrow></mml:math></inline-formula>, which denotes the numbers of orders, manufacturers, and logistics vehicles, respectively. The complexity progression is organized into three stages.</p>
      <p id="d2e2705"><italic>Foundation logic construction stage (</italic><inline-formula><mml:math id="M98" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula><italic>).</italic> The initial scale is set to <inline-formula><mml:math id="M99" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="normal">Ω</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mo>〈</mml:mo><mml:mn mathvariant="normal">10</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">5</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">3</mml:mn><mml:mo>〉</mml:mo></mml:mrow></mml:math></inline-formula>. This stage is used to support rapid policy initialization in a low-dimensional setting and to learn basic coordination rules, including order assignment, resource availability, and task dependencies.</p>
      <p id="d2e2749"><italic>Scale continuous increment stage (</italic><inline-formula><mml:math id="M100" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>∼</mml:mo><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mn mathvariant="normal">5</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula><italic>).</italic> The network scale is increased step by step without changing the underlying scheduling mechanism so as to strengthen policy robustness and generalization under growing resource competition and spatiotemporal conflicts.</p>
      <p id="d2e2773"><italic>Target scale collaboration stage (</italic><inline-formula><mml:math id="M101" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mn mathvariant="normal">6</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula><italic>).</italic> The environment reaches the target scale <inline-formula><mml:math id="M102" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="normal">Ω</mml:mi><mml:mn mathvariant="normal">6</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mo>〈</mml:mo><mml:mn mathvariant="normal">20</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">10</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">5</mml:mn><mml:mo>〉</mml:mo></mml:mrow></mml:math></inline-formula>. At this stage, the policy is trained under a larger state space and stronger spatiotemporal coupling to form a stable collaborative scheduling strategy.</p>
</sec>
<sec id="Ch1.S4.SS2.SSS2">
  <label>4.2.2</label><title>Dynamic masking-driven progressive unlocking of action space</title>
      <p id="d2e2825">In multi-agent systems, directly facing target-scale configurations with preset global maximum action space <inline-formula><mml:math id="M103" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="script">A</mml:mi><mml:mo>max⁡</mml:mo></mml:msub></mml:mrow></mml:math></inline-formula> (whose dimensions are determined by <inline-formula><mml:math id="M104" display="inline"><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mo>max⁡</mml:mo><mml:mtext>ord</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M105" display="inline"><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mo>max⁡</mml:mo><mml:mtext>mfg</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M106" display="inline"><mml:mrow><mml:msubsup><mml:mi>N</mml:mi><mml:mo>max⁡</mml:mo><mml:mtext>log</mml:mtext></mml:msubsup></mml:mrow></mml:math></inline-formula>) causes an extremely severe curse of dimensionality. To this end, this paper decouples the curriculum evolution mechanism from underlying control logic, formalizing it as a dimensionality release process of decision feasible domains: at the early curriculum stage <inline-formula><mml:math id="M107" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, the system utilizes the dynamic masking mechanism <inline-formula><mml:math id="M108" display="inline"><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> to construct hard constraint boundaries, forcibly masking all high-dimensional action indices exceeding the current network scale, that is, strictly ensuring <inline-formula><mml:math id="M109" display="inline"><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>[</mml:mo><mml:mi>a</mml:mi><mml:mo>]</mml:mo><mml:mo>≡</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>,</mml:mo><mml:mo>∀</mml:mo><mml:mi>a</mml:mi><mml:mo>&gt;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. As the curriculum advances toward higher-stage manifolds, masking constraints are gradually released, and effective decision boundaries of agents are progressively unlocked.</p>
      <p id="d2e2935">This mechanism reduces the global optimization problem into nested subspace searches, mitigating exploration disorientation and improving training stability.</p>
</sec>
<sec id="Ch1.S4.SS2.SSS3">
  <label>4.2.3</label><title>Adaptive promotion mechanism based on dual-gating</title>
      <p id="d2e2946">This paper designs an adaptive promotion mechanism based on dual-gating, where promotion is triggered only when both indicators simultaneously satisfy the required conditions within sliding window <inline-formula><mml:math id="M110" display="inline"><mml:mi>W</mml:mi></mml:math></inline-formula>. The sliding-window size was set to <inline-formula><mml:math id="M111" display="inline"><mml:mrow><mml:mi>W</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula> episodes, determined through preliminary sensitivity experiments balancing promotion responsiveness and policy convergence stability. <list list-type="order"><list-item>
      <p id="d2e2970">Performance lower-bound gating: The average return <inline-formula><mml:math id="M112" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">μ</mml:mi><mml:mi mathvariant="normal">W</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> within the current window must exceed the preset baseline threshold <inline-formula><mml:math id="M113" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>perf</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>:<disp-formula id="Ch1.E8" content-type="numbered"><label>8</label><mml:math id="M114" display="block"><mml:mrow><mml:msub><mml:mi mathvariant="italic">μ</mml:mi><mml:mi mathvariant="normal">W</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="double-struck">E</mml:mi><mml:mo>[</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mi>W</mml:mi><mml:mo>:</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo><mml:mo>&gt;</mml:mo><mml:msubsup><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>perf</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
      <p id="d2e3050">The threshold <inline-formula><mml:math id="M115" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>perf</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> is determined from the optimal solutions of heuristic algorithms at the corresponding scale. This criterion requires the DRL policy to reach a scheduling level beyond that of traditional rules before entering the next stage.</p></list-item><list-item>
      <p id="d2e3072">Policy stationarity detection: Due to the high variance of DRL exploration, mean-based indicators can be biased by outliers or occasional high returns. Therefore, a relative standard deviation criterion is introduced:<disp-formula id="Ch1.E9" content-type="numbered"><label>9</label><mml:math id="M116" display="block"><mml:mrow><mml:msub><mml:mtext>RSD</mml:mtext><mml:mi mathvariant="normal">W</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:mfrac style="display"><mml:mrow><mml:mi mathvariant="italic">σ</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mi>W</mml:mi><mml:mo>:</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mi mathvariant="italic">μ</mml:mi><mml:mi mathvariant="normal">W</mml:mi></mml:msub><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:mstyle><mml:mo>&lt;</mml:mo><mml:msub><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>stab</mml:mtext></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula><mml:math id="M117" display="inline"><mml:mrow><mml:mi mathvariant="italic">σ</mml:mi><mml:mo>(</mml:mo><mml:mo>⋅</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> is the standard deviation function, and <inline-formula><mml:math id="M118" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>stab</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> is the preset stationarity tolerance factor, which is set to 0.05 through preliminary tuning experiments to balance sensitivity to policy fluctuations and training efficiency. The dual-gating mechanism effectively avoids misjudgment risks from single indicators through joint constraints of performance and stationarity, ensuring robustness of curriculum stage transitions. The complete adaptive curriculum evolution process is detailed in Algorithm</p></list-item></list></p><boxed-text content-type="algorithm" position="float" id="Ch1.Prog1"><label>Algorithm 1</label><caption><p id="d2e3156">Adaptive curriculum evolution algorithm based on stationarity detection.</p></caption><disp-quote content-type="algorithmic" specific-use="numbering{1}"><list>

    <list-item>

      <p id="d2e3163" specific-use="STATE"><bold>Input:</bold> Initial policy network parameters <inline-formula><mml:math id="M119" display="inline"><mml:mrow><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="italic">ϕ</mml:mi></mml:mrow></mml:math></inline-formula>; curriculum environment set <inline-formula><mml:math id="M120" display="inline"><mml:mrow><mml:mi mathvariant="script">C</mml:mi><mml:mo>=</mml:mo><mml:mo mathvariant="italic">{</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mi>K</mml:mi></mml:msub><mml:mo mathvariant="italic">}</mml:mo></mml:mrow></mml:math></inline-formula></p>
              </list-item>

    <list-item>

      <p id="d2e3219" specific-use="STATE"><bold>Output:</bold> Converged optimal parameters <inline-formula><mml:math id="M121" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">θ</mml:mi><mml:mi>k</mml:mi><mml:mo>∗</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> for each stage; final global optimal policy parameters <inline-formula><mml:math id="M122" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">θ</mml:mi><mml:mi>K</mml:mi><mml:mo>∗</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula></p>
              </list-item>

    <list-item>

      <p id="d2e3252" specific-use="STATE"><bold>Initialization:</bold> <inline-formula><mml:math id="M123" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula> – total number of curriculum stages; <inline-formula><mml:math id="M124" display="inline"><mml:mi>W</mml:mi></mml:math></inline-formula> – sliding-window size; <inline-formula><mml:math id="M125" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>perf</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> – performance promotion threshold for stage <inline-formula><mml:math id="M126" display="inline"><mml:mi>k</mml:mi></mml:math></inline-formula>; <inline-formula><mml:math id="M127" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>stab</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> – stationarity (RSD) threshold; <inline-formula><mml:math id="M128" display="inline"><mml:mi>G</mml:mi></mml:math></inline-formula> – sliding-window return sequence.</p>
              </list-item>

    <list-item>

      <p id="d2e3317" specific-use="FOR"><bold>for</bold> curriculum stage <inline-formula><mml:math id="M129" display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> <bold>to</bold> <inline-formula><mml:math id="M130" display="inline"><mml:mi>K</mml:mi></mml:math></inline-formula> <bold>do</bold> <list>
    <list-item>
      <p id="d2e3350" specific-use="STATE">Initialize environment configuration <inline-formula><mml:math id="M131" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="normal">Ω</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>←</mml:mo><mml:mtext>Config</mml:mtext><mml:mo>(</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>; <bold>If</bold> <inline-formula><mml:math id="M132" display="inline"><mml:mrow><mml:mi>k</mml:mi><mml:mo>&gt;</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:math></inline-formula> <bold>then</bold> <inline-formula><mml:math id="M133" display="inline"><mml:mrow><mml:mi mathvariant="italic">θ</mml:mi><mml:mo>←</mml:mo><mml:msubsup><mml:mi mathvariant="italic">θ</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow><mml:mo>∗</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula></p></list-item>
    <list-item>
      <p id="d2e3418" specific-use="WHILE"><bold>while</bold> True <bold>do</bold> <list>
    <list-item>
      <p id="d2e3429" specific-use="STATE">Perform one PPO training iteration; obtain current episode total return <inline-formula><mml:math id="M134" display="inline"><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula></p></list-item>
    <list-item>
      <p id="d2e3444" specific-use="STATE">Update sliding-window return sequence <inline-formula><mml:math id="M135" display="inline"><mml:mi>G</mml:mi></mml:math></inline-formula>; compute current window mean return <inline-formula><mml:math id="M136" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">μ</mml:mi><mml:mi mathvariant="normal">W</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="double-struck">E</mml:mi><mml:mo>[</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mi>W</mml:mi><mml:mo>:</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
    <list-item>
      <p id="d2e3488" specific-use="STATE">Compute current policy relative standard deviation <inline-formula><mml:math id="M137" display="inline"><mml:mrow><mml:msub><mml:mtext>RSD</mml:mtext><mml:mi mathvariant="normal">W</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="italic">σ</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mi>W</mml:mi><mml:mo>:</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo><mml:mo>/</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mi mathvariant="italic">μ</mml:mi><mml:mi mathvariant="normal">W</mml:mi></mml:msub><mml:mo>|</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
    <list-item>
      <p id="d2e3536" specific-use="IF"><bold>if</bold> <inline-formula><mml:math id="M138" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">μ</mml:mi><mml:mi mathvariant="normal">W</mml:mi></mml:msub><mml:mo>&gt;</mml:mo><mml:msub><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>perf</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> <bold>And</bold> <inline-formula><mml:math id="M139" display="inline"><mml:mrow><mml:msub><mml:mtext>RSD</mml:mtext><mml:mi mathvariant="normal">W</mml:mi></mml:msub><mml:mo>&lt;</mml:mo><mml:msub><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>stab</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula> <bold>then</bold> <list>
    <list-item>
      <p id="d2e3592" specific-use="STATE">Save current converged optimal policy parameters <inline-formula><mml:math id="M140" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="italic">θ</mml:mi><mml:mi>k</mml:mi><mml:mo>∗</mml:mo></mml:msubsup><mml:mo>←</mml:mo><mml:mi mathvariant="italic">θ</mml:mi></mml:mrow></mml:math></inline-formula></p></list-item>
    <list-item>
      <p id="d2e3613" specific-use="STATE"><bold>Break</bold> (promotion condition satisfied; proceed to next stage <inline-formula><mml:math id="M141" display="inline"><mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn mathvariant="normal">1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>)</p></list-item></list></p></list-item>
    <list-item>
      <p id="d2e3636" specific-use="ENDIF"><bold>end</bold> <bold>if</bold></p></list-item></list></p></list-item>
    <list-item>
      <p id="d2e3645" specific-use="ENDWHILE"><bold>end</bold> <bold>while</bold></p></list-item></list></p>
              </list-item>

    <list-item>

      <p id="d2e3655" specific-use="ENDFOR"><bold>end</bold> <bold>for</bold></p>
              </list-item>
            </list></disp-quote></boxed-text>
</sec>
</sec>
<sec id="Ch1.S4.SS3">
  <label>4.3</label><title>Hierarchical proximal policy optimization algorithm</title>
      <p id="d2e3673">This paper adopts a hierarchical proximal policy optimization (H-PPO) algorithm based on the actor–critic architecture, with feasible-domain hard-constraint projection and a global collaborative loss design. At the network level, agents at each hierarchy use structurally symmetric actor–critic networks. The actor network employs a multilayer perceptron (MLP) to extract state features and introduces a hard-constraint projection layer to ensure physical feasibility. Using the dynamic mask <inline-formula><mml:math id="M142" display="inline"><mml:mrow><mml:msub><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> defined in Sect. 4.1.2, log-domain renormalization is applied to the output logits <inline-formula><mml:math id="M143" display="inline"><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, with <inline-formula><mml:math id="M144" display="inline"><mml:mrow><mml:msubsup><mml:mi>z</mml:mi><mml:mi mathvariant="normal">t</mml:mi><mml:mo>′</mml:mo></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi>log⁡</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. This operation assigns zero probability to invalid actions and blocks gradient propagation along illegal paths, thereby improving search efficiency in constrained spaces.</p>
      <p id="d2e3735">The training process follows a decentralized execution and centralized evaluation paradigm. All agents share the unified differential reward <inline-formula><mml:math id="M145" display="inline"><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> in Sect. 4.1.3, and local policies are jointly optimized for makespan minimization. Parameter updates use the following composite loss function:

            <disp-formula id="Ch1.E10" content-type="numbered"><label>10</label><mml:math id="M146" display="block"><mml:mtable rowspacing="0.2ex" class="split" displaystyle="true" columnalign="right left"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mtext>Total</mml:mtext></mml:msub><mml:mo>=</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mspace linebreak="nobreak" width="0.25em"/><mml:mo>-</mml:mo><mml:msub><mml:mi mathvariant="double-struck">E</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>[</mml:mo><mml:mo movablelimits="false">min⁡</mml:mo><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="italic">ρ</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:msub><mml:mi>A</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mtext>clip</mml:mtext><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="italic">ρ</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>-</mml:mo><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>+</mml:mo><mml:mi mathvariant="italic">ϵ</mml:mi><mml:mo>)</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd/><mml:mtd><mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub><mml:msub><mml:mi>L</mml:mi><mml:mtext>VF</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:mi mathvariant="italic">ϕ</mml:mi><mml:mo>)</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mn mathvariant="normal">2</mml:mn></mml:msub><mml:mi>S</mml:mi><mml:mo>[</mml:mo><mml:msub><mml:mi mathvariant="italic">π</mml:mi><mml:mi mathvariant="italic">θ</mml:mi></mml:msub><mml:mo>]</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>

          where <inline-formula><mml:math id="M147" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">ρ</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> denotes the ratio between new and old policies, <inline-formula><mml:math id="M148" display="inline"><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi mathvariant="normal">t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> denotes the advantage function, and <inline-formula><mml:math id="M149" display="inline"><mml:mrow><mml:mi>S</mml:mi><mml:mo>[</mml:mo><mml:msub><mml:mi mathvariant="italic">π</mml:mi><mml:mi mathvariant="italic">θ</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> denotes policy entropy.</p>
</sec>
</sec>
<sec id="Ch1.S5">
  <label>5</label><title>Experimental results and analysis</title>
      <p id="d2e3923">The proposed method is implemented with PyTorch and runs on a workstation equipped with an AMD Ryzen 7 8845H CPU, 16 <inline-formula><mml:math id="M150" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">GB</mml:mi></mml:mrow></mml:math></inline-formula> RAM, and an NVIDIA GeForce RTX 4060 Laptop GPU with 8 <inline-formula><mml:math id="M151" display="inline"><mml:mrow class="unit"><mml:mi mathvariant="normal">GB</mml:mi></mml:mrow></mml:math></inline-formula> VRAM. All random seeds are fixed to ensure reproducibility, and each reported result is averaged over 100 independently sampled test instances per scale configuration.</p>
<sec id="Ch1.S5.SS1">
  <label>5.1</label><title>Training efficiency and curriculum efficacy validation</title>
      <p id="d2e3949">To evaluate the effect of the curriculum mechanism, two training settings are considered: progressive training with CH-MADRL and direct training with standard DRL. CH-MADRL is trained from <inline-formula><mml:math id="M152" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="script">E</mml:mi><mml:mn mathvariant="normal">1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> (10 <inline-formula><mml:math id="M153" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 <inline-formula><mml:math id="M154" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 3) and promoted stage by stage under the dual-gating criteria defined by <inline-formula><mml:math id="M155" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>perf</mml:mtext></mml:msub><mml:mo>(</mml:mo><mml:mi>k</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M156" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">δ</mml:mi><mml:mtext>stab</mml:mtext></mml:msub></mml:mrow></mml:math></inline-formula>, with parameter warm start used at each stage transition. Standard DRL is trained end to end on the target scale 20 <inline-formula><mml:math id="M157" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M158" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5.</p>

      <fig id="F3" specific-use="star"><label>Figure 3</label><caption><p id="d2e4022">Training efficiency and curriculum efficacy validation: <bold>(a)</bold> convergence curve comparison between CH-MADRL and standard DRL; <bold>(b)</bold> training convergence comparison between mixed-dimension expansion and single-dimension expansion strategies.</p></caption>
          <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f03.png"/>

        </fig>

      <p id="d2e4037">Supply chain networks involve coupled order, resource, and logistics flows, making curriculum design an important factor in training efficiency. Two curriculum expansion strategies are compared: single-dimension expansion, which increases only the number of orders while keeping the manufacturer scale and logistics capacity fixed, and mixed-dimension expansion, which jointly increases order quantity, manufacturer scale, and logistics capacity.</p>
      <p id="d2e4041">As illustrated in Fig. <xref ref-type="fig" rid="F3"/>a, standard DRL shows pronounced oscillation in the early training stage and converges to a suboptimal makespan of about 1500, reflecting cold-start difficulty. CH-MADRL exhibits brief performance drops near stage transition points but recovers quickly after each transition and converges to a lower makespan. Figure <xref ref-type="fig" rid="F3"/>b shows that mixed-dimension expansion leads to smoother stage transitions and faster convergence, while single-dimension expansion produces noticeable performance drops when later stages introduce abrupt parameter changes.</p>
</sec>
<sec id="Ch1.S5.SS2">
  <label>5.2</label><title>Overall scheduling performance comparison</title>
      <p id="d2e4056">To evaluate the scalability of the framework, experiments are conducted across six gradient scales from 10 <inline-formula><mml:math id="M159" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M160" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 to 20 <inline-formula><mml:math id="M161" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M162" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5, with 100 independent random test instances generated for each configuration. To simulate dynamic disturbances and resource heterogeneity in supply chains, critical environment parameters are independently sampled from uniform distributions: <list list-type="custom"><list-item><label>1.</label>
      <p id="d2e4089">Dynamic order arrivals: Release times <inline-formula><mml:math id="M163" display="inline"><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>∼</mml:mo><mml:mi mathvariant="script">U</mml:mi><mml:mo>[</mml:mo><mml:mn mathvariant="normal">30</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">300</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>.</p></list-item><list-item><label>2.</label>
      <p id="d2e4118">Resource capability fluctuations: Manufacturer processing quality <inline-formula><mml:math id="M164" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> and logistics transportation quality <inline-formula><mml:math id="M165" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">v</mml:mi></mml:msub><mml:mo>∼</mml:mo><mml:mi mathvariant="script">U</mml:mi><mml:mo>[</mml:mo><mml:mn mathvariant="normal">0.1</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1.0</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>.</p></list-item><list-item><label>3.</label>
      <p id="d2e4158">Logistics time-lag distribution: Inter-node transportation times <inline-formula><mml:math id="M166" display="inline"><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mtext>trans</mml:mtext></mml:msub><mml:mo>∼</mml:mo><mml:mi mathvariant="script">U</mml:mi><mml:mo>[</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">10</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>.</p></list-item></list></p>
      <p id="d2e4186">The average makespan comparison of various algorithms across different scales is presented in Fig. <xref ref-type="fig" rid="F4"/>. Results demonstrate that CH-MADRL maintains performance advantages across all test scales. Particularly in the highest-complexity 20 <inline-formula><mml:math id="M167" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M168" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 large-scale scenario, CH-MADRL converges to an average makespan of 1764.82, representing a 2.3 % reduction compared with standard DRL (1806.67), validating the scalability of the proposed method in high-dimensional state spaces.</p>

      <fig id="F4"><label>Figure 4</label><caption><p id="d2e4207">Performance trend analysis of CH-MADRL versus standard DRL in multi-scale scalability tests.</p></caption>
          <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f04.png"/>

        </fig>

      <p id="d2e4217">The reliability of supply chain scheduling solutions is typically measured through statistical dispersion of the results. The boxplot distribution in Fig. <xref ref-type="fig" rid="F5"/> further reveals that CH-MADRL solutions exhibit more compact distribution patterns with fewer outliers. Specifically, at the 20 <inline-formula><mml:math id="M169" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M170" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 scale, CH-MADRL achieves a makespan standard deviation of 257.36, lower than that of the baseline (272.58), representing a 5.6 % reduction.</p>

      <fig id="F5"><label>Figure 5</label><caption><p id="d2e4238">Boxplot comparison of makespan distributions across different scales for CH-MADRL versus standard DRL.</p></caption>
          <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f05.png"/>

        </fig>

</sec>
<sec id="Ch1.S5.SS3">
  <label>5.3</label><title>Horizontal comparison with baseline algorithms</title>
      <p id="d2e4255">To comprehensively evaluate the performance advantages of CH-MADRL, this subsection conducts comparative experiments with three representative algorithm categories:</p>
      <p id="d2e4258"><list list-type="custom">
            <list-item><label>1.</label>

      <p id="d2e4263">Composite heuristic rules (rule 1–4): Four composite greedy strategies are constructed for multi-stage collaborative characteristics of supply chains. Four classical order priority rules (EST, MOPNR, SPT, MTWR) are combined with, respectively, manufacturer and logistics assignment rules based on shortest processing time (SPT), serving as local-optimal baselines.</p>
            </list-item>
            <list-item><label>2.</label>

      <p id="d2e4269">Metaheuristic algorithm (GA): Standard GA is introduced as a classical global search baseline to evaluate whether deep reinforcement learning methods can approach or surpass traditional evolutionary computation methods in solution quality.</p>
            </list-item>
            <list-item><label>3.</label>

      <p id="d2e4275">Deep reinforcement learning baselines (A2C, DRL <inline-formula><mml:math id="M171" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula> IL): The on-policy algorithm A2C is introduced to validate the advantages of the PPO architecture adopted in this paper regarding update stability. DRL <inline-formula><mml:math id="M172" display="inline"><mml:mo>+</mml:mo></mml:math></inline-formula> IL incorporating expert demonstration is introduced to validate whether the proposed curriculum-learning mechanism better facilitates agents in breaking through expert experience limitations and exploring globally superior strategies compared to passive imitation learning.</p>
            </list-item>
          </list></p>
      <p id="d2e4294">The three-dimensional bar chart in Fig. <xref ref-type="fig" rid="F6"/> intuitively demonstrates the scheduling results of all algorithms for the first 20 instances at the 20 <inline-formula><mml:math id="M173" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M174" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 scale. CH-MADRL achieves the lowest makespan in the vast majority of instances, outperforming rule algorithms based on local greedy strategies.</p>

      <fig id="F6"><label>Figure 6</label><caption><p id="d2e4316">Makespan comparison of all methods for 20 test instances at 20 <inline-formula><mml:math id="M175" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M176" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 scale.</p></caption>
          <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f06.png"/>

        </fig>

      <p id="d2e4339">Figure <xref ref-type="fig" rid="F7"/> demonstrates solution quality distributions of various algorithms across different test sets through boxplots. Results indicate that CH-MADRL exhibits comprehensive advantages in the vast majority of test scenarios: lowest makespan mean, most compact box range, and minimal outliers.</p>

      <fig id="F7"><label>Figure 7</label><caption><p id="d2e4346">Statistical boxplot analysis of different algorithms at 20 <inline-formula><mml:math id="M177" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M178" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 scale</p></caption>
          <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f07.png"/>

        </fig>

</sec>
<sec id="Ch1.S5.SS4">
  <label>5.4</label><title>Zero-shot generalization capability testing</title>
      <p id="d2e4378">To evaluate the model's zero-shot cross-domain generalization, the converged models from Sect. 5.1, trained at the 20 <inline-formula><mml:math id="M179" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M180" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 scale, are directly transferred to four unseen instances ranging from 13 <inline-formula><mml:math id="M181" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 6 <inline-formula><mml:math id="M182" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 4 to 30 <inline-formula><mml:math id="M183" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 15 <inline-formula><mml:math id="M184" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 7 without fine-tuning. The tests include two scenarios: a fixed-order scenario, where the order quantity remains constant to assess scalability under expanded dimensions, and a random-order scenario, where the order quantity fluctuates to assess robustness under dynamic loads.</p>

      <fig id="F8" specific-use="star"><label>Figure 8</label><caption><p id="d2e4426">Zero-shot generalization performance evaluation at different unseen problem scales: <bold>(a)</bold> fixed-order-quantity scenario; <bold>(b)</bold> random-order-quantity scenario.</p></caption>
          <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f08.png"/>

        </fig>

<table-wrap id="T2" specific-use="star"><label>Table 2</label><caption><p id="d2e4444">Zero-shot generalization performance comparison at different unseen problem scales.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="5">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:colspec colnum="3" colname="col3" align="left"/>
     <oasis:colspec colnum="4" colname="col4" align="left"/>
     <oasis:colspec colnum="5" colname="col5" align="left"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1">Problem scale</oasis:entry>
         <oasis:entry colname="col2">Scenario</oasis:entry>
         <oasis:entry colname="col3">Standard DRL</oasis:entry>
         <oasis:entry colname="col4">CH-MADRL (ours)</oasis:entry>
         <oasis:entry colname="col5">Improvement</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">13 <inline-formula><mml:math id="M185" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 6 <inline-formula><mml:math id="M186" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 4</oasis:entry>
         <oasis:entry colname="col2">Fixed</oasis:entry>
         <oasis:entry colname="col3">1205.00 <inline-formula><mml:math id="M187" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 258.48</oasis:entry>
         <oasis:entry colname="col4">1087.38 <inline-formula><mml:math id="M188" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 202.11</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M189" display="inline"><mml:mo>↑</mml:mo></mml:math></inline-formula> 9.76 %</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">Random</oasis:entry>
         <oasis:entry colname="col3">1103.47 <inline-formula><mml:math id="M190" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 217.78</oasis:entry>
         <oasis:entry colname="col4">1001.62 <inline-formula><mml:math id="M191" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 205.49</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M192" display="inline"><mml:mo>↑</mml:mo></mml:math></inline-formula> 9.23 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">17 <inline-formula><mml:math id="M193" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 8 <inline-formula><mml:math id="M194" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5</oasis:entry>
         <oasis:entry colname="col2">Fixed</oasis:entry>
         <oasis:entry colname="col3">1551.18 <inline-formula><mml:math id="M195" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 321.50</oasis:entry>
         <oasis:entry colname="col4">1523.32 <inline-formula><mml:math id="M196" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 291.68</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M197" display="inline"><mml:mo>↑</mml:mo></mml:math></inline-formula> 1.80 %</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">Random</oasis:entry>
         <oasis:entry colname="col3">1300.97 <inline-formula><mml:math id="M198" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 256.79</oasis:entry>
         <oasis:entry colname="col4">1259.36 <inline-formula><mml:math id="M199" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 255.15</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M200" display="inline"><mml:mo>↑</mml:mo></mml:math></inline-formula> 3.20 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">25 <inline-formula><mml:math id="M201" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 12 <inline-formula><mml:math id="M202" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 6</oasis:entry>
         <oasis:entry colname="col2">Fixed</oasis:entry>
         <oasis:entry colname="col3">2749.87 <inline-formula><mml:math id="M203" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 962.86</oasis:entry>
         <oasis:entry colname="col4">2710.55 <inline-formula><mml:math id="M204" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 466.19</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M205" display="inline"><mml:mo>↑</mml:mo></mml:math></inline-formula> 1.43 %</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">Random</oasis:entry>
         <oasis:entry colname="col3">2270.23 <inline-formula><mml:math id="M206" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 831.72</oasis:entry>
         <oasis:entry colname="col4">2235.53 <inline-formula><mml:math id="M207" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 541.18</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M208" display="inline"><mml:mo>↑</mml:mo></mml:math></inline-formula> 1.53 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">30 <inline-formula><mml:math id="M209" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 15 <inline-formula><mml:math id="M210" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 7</oasis:entry>
         <oasis:entry colname="col2">Fixed</oasis:entry>
         <oasis:entry colname="col3">3674.75 <inline-formula><mml:math id="M211" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 1060.23</oasis:entry>
         <oasis:entry colname="col4">3615.21 <inline-formula><mml:math id="M212" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 582.97</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M213" display="inline"><mml:mo>↑</mml:mo></mml:math></inline-formula> 1.62 %</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1"/>
         <oasis:entry colname="col2">Random</oasis:entry>
         <oasis:entry colname="col3">2941.36 <inline-formula><mml:math id="M214" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 1004.26</oasis:entry>
         <oasis:entry colname="col4">2875.22 <inline-formula><mml:math id="M215" display="inline"><mml:mo>±</mml:mo></mml:math></inline-formula> 737.65</oasis:entry>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M216" display="inline"><mml:mo>↑</mml:mo></mml:math></inline-formula> 2.25 %</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e4845">As illustrated in Fig. <xref ref-type="fig" rid="F8"/> and Table <xref ref-type="table" rid="T2"/>, CH-MADRL consistently outperforms standard DRL in terms of makespan and variance as problem dimensionality expands stepwise. Specifically, in in-distribution interpolation generalization tests (exemplified by 13 <inline-formula><mml:math id="M217" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 6 <inline-formula><mml:math id="M218" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 4 and other scales), the model achieves performance improvements of 9.76 % and 9.23 % under fixed- and random-order conditions, respectively, indicating that curriculum learning effectively drives agents to extract general representations of underlying collaborative logic rather than overfitting to specific training distributions. Furthermore, in out-of-distribution extrapolation generalization tests (facing the 30 <inline-formula><mml:math id="M219" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 15 <inline-formula><mml:math id="M220" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 7 scenario exceeding the training scale), CH-MADRL maintains performance advantages of 1.62 % and 2.25 %, with standard deviations consistently below the control group. These results comprehensively demonstrate that the proposed framework possesses high policy stability and structural generalization capability when facing cross-domain and out-of-bound scales.</p>

      <fig id="F9" specific-use="star"><label>Figure 9</label><caption><p id="d2e4883">Microscopic scheduling behavior visualization comparison between CH-MADRL and rule methods at 20 <inline-formula><mml:math id="M221" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M222" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 scale: <bold>(a)</bold> rule-based method; <bold>(b)</bold> CH-MADRL.</p></caption>
          <graphic xlink:href="https://ms.copernicus.org/articles/17/671/2026/ms-17-671-2026-f09.png"/>

        </fig>

</sec>
<sec id="Ch1.S5.SS5">
  <label>5.5</label><title>Interpretability and microscopic behavior analysis</title>
      <p id="d2e4920">To examine the decision behavior of the agents at a finer scale, Fig. <xref ref-type="fig" rid="F9"/> compares the Gantt-chart schedules produced by the two methods for a 20 <inline-formula><mml:math id="M223" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 10 <inline-formula><mml:math id="M224" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 5 instance. The traditional rule-based method, shown in Fig. <xref ref-type="fig" rid="F9"/>a, is limited by locally greedy decisions and does not account for the cumulative effects of cross-node logistics delays. As a result, manufacturing and logistics flows become temporally misaligned, leading to task fragmentation and resource idleness. By contrast, CH-MADRL, shown in Fig. <xref ref-type="fig" rid="F9"/>b, produces a more compact spatiotemporal schedule. Through time window coordination and dynamic load balancing, the agents improve the alignment between processing, transportation, and subsequent processing stages, avoid node congestion, and reduce the overall makespan.</p>
</sec>
</sec>
<sec id="Ch1.S6" sec-type="conclusions">
  <label>6</label><title>Conclusions</title>
      <p id="d2e4952">This paper proposes CH-MADRL for collaborative scheduling in complex supply chain networks. A hierarchical multi-agent Markov decision process is constructed for the joint optimization of order assignment, heterogeneous manufacturer selection, and logistics vehicle scheduling. To improve training under dynamic constraints, a constraint-progressive adaptive curriculum is introduced, together with a stage-conditioned dynamic masking mechanism and a dual-gated promotion strategy. Experimental results show that CH-MADRL achieves better convergence, lower makespan, and stronger zero-shot generalization across different problem scales.</p>
      <p id="d2e4955">The current work still has several limitations. Experimental evaluation is conducted in simulated environments with predefined scale progression. Although the framework incorporates dynamic order arrivals and resource quality fluctuations, it does not fully capture all disruption patterns encountered in real supply chain systems, such as supplier defaults and logistics network interruptions. Extending the framework to handle such real-world disruptions remains an important direction for future work. In addition, the current framework considers only makespan minimization and does not address other objectives, such as operational cost, carbon emissions, and delivery reliability. Future research will extend CH-MADRL to multi-objective scheduling and explore graph neural-network-based representations to improve scalability in larger and more complex supply chain scenarios.</p>
</sec>

      
      </body>
    <back><notes notes-type="dataavailability"><title>Data availability</title>

      <p id="d2e4962">Data will be made available on reasonable request to the corresponding author.</p>
  </notes><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d2e4968">Jingya Dong conceptualized this study, developed the methodology, and wrote the original draft. Han Zhao implemented the software and contributed to validation. Suyi Zhao curated the data and contributed to visualization. Yijie Wang conducted the investigation and validation. Mengfan Guo conducted the investigation and revised the paper. Chunhe Song acquired the funding and provided supervision. Mingliang Xu provided project administration.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d2e4974">The contact author has declared that none of the authors has any competing interests.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d2e4980">Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. The authors bear the ultimate responsibility for providing appropriate place names. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.</p>
  </notes><ack><title>Acknowledgements</title><p id="d2e4986">During the initial preparation of this paper, the authors used ChatGPT and Kimi for language polishing and readability improvement. The authors reviewed and revised all AI-assisted outputs and take full responsibility for the final content.</p></ack><notes notes-type="financialsupport"><title>Financial support</title>

      <p id="d2e4991">This work was supported in part by the National Key RD Program of China under grant no. 2024YFB3311600, in part by the Key Science and Technology Research Project of Henan Province (grant no. 252102211074), and in part by the Key Scientific Research Projects of Henan Higher Education Institutions under grant no. 25A520012.</p>
  </notes><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d2e4997">This paper was edited by Pengyuan Zhao and reviewed by two anonymous referees.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><label>Amirteimoori et al.(2023)</label><mixed-citation>Amirteimoori, A., Tirkolaee, E. B., Simic, V., and Weber, G.-W.: A parallel heuristic for hybrid job shop scheduling problem considering conflict-free AGV routing, Swarm Evol. Comput., 79, 101312, <ext-link xlink:href="https://doi.org/10.1016/j.swevo.2023.101312" ext-link-type="DOI">10.1016/j.swevo.2023.101312</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx2"><label>Atasagun and Karaoğlan(2024)</label><mixed-citation>Atasagun, G. C. and Karaoğlan, İ.: Integrated production and outbound distribution scheduling problem with multiple facilities/vehicles and perishable items, Appl. Soft Comput., 166, 112144, <ext-link xlink:href="https://doi.org/10.1016/j.asoc.2024.112144" ext-link-type="DOI">10.1016/j.asoc.2024.112144</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx3"><label>Chang et al.(2025)</label><mixed-citation>Chang, X., Jia, X., and Hu, H.: Energy-efficient and self-adaptive AGV scheduling approach based on hierarchical reinforcement learning for flexible shop floor, Comput. Ind. Eng., 205, 111140, <ext-link xlink:href="https://doi.org/10.1016/j.cie.2025.111140" ext-link-type="DOI">10.1016/j.cie.2025.111140</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx4"><label>Chen et al.(2021)</label><mixed-citation> Chen, J., Zhang, Y., Xu, Y., Ma, H., Yang, H., Song, J., Wang, Y., and Wu, Y.: Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems, in: Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx5"><label>Di et al.(2024)</label><mixed-citation>Di, Y., Deng, L., and Zhang, L.: A collaborative-learning multi-agent reinforcement learning method for distributed hybrid flow shop scheduling problem, Swarm Evol. Comput., 91, 101764, <ext-link xlink:href="https://doi.org/10.1016/j.swevo.2024.101764" ext-link-type="DOI">10.1016/j.swevo.2024.101764</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx6"><label>de Lima et al.(2022)</label><mixed-citation> de Lima, F. A., Seuring, S., and Sauer, P. C.: A systematic literature review exploring uncertainty management and sustainability outcomes in circular supply chains, Int. J. Prod. Res., 60, 6013–6046, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx7"><label>Hady et al.(2025)</label><mixed-citation>Hady, M. A., Hu, S., Pratama, M.,  Cao, Z., and Kowalczyk, R.: Multi-agent reinforcement learning for resources allocation optimization: a survey, Artif. Intell. Rev., 58, 354, <ext-link xlink:href="https://doi.org/10.1007/s10462-025-11340-5" ext-link-type="DOI">10.1007/s10462-025-11340-5</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx8"><label>Hamou et al.(2025)</label><mixed-citation>Hamou, K. A. B., Jarir, Z., and Elfirdoussi, S.: Using machine learning for production scheduling problems in the supply chain: A review, Comput. Ind. Eng., 206, 111243, <ext-link xlink:href="https://doi.org/10.1016/j.cie.2025.111243" ext-link-type="DOI">10.1016/j.cie.2025.111243</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx9"><label>Hao and Demir(2025)</label><mixed-citation> Hao, X. and Demir, E.: Artificial intelligence in supply chain management: enablers and constraints in pre-development, deployment, and post-development stages, Prod. Plan. Control, 36, 748–770, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx10"><label>Homayouni and Fontes(2021)</label><mixed-citation> Homayouni, S. M. and Fontes, D. B. M. M.: Production and transport scheduling in flexible job shop manufacturing systems, J. Global Optim., 79, 463–502, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx11"><label>Hu et al.(2025a)</label><mixed-citation>Hu, H., Liu, L., and Yang, X.: A deep reinforcement learning framework for real-time joint task assignment and storage allocation problems considering random tasks in automated container terminals, Comput. Ind. Eng., 111544, <ext-link xlink:href="https://doi.org/10.1016/j.cie.2025.111544" ext-link-type="DOI">10.1016/j.cie.2025.111544</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx12"><label>Hu et al.(2025b)</label><mixed-citation>Hu, Y., Wang, M., Min, R.,  Liu, J., Lukinykh, V. F., Tang, S., and Zhao, D.: Coordinated scheduling optimization of quay cranes and AGVs in automated container terminals, Comput. Oper. Res., 182, 107147, <ext-link xlink:href="https://doi.org/10.1016/j.cor.2025.107147" ext-link-type="DOI">10.1016/j.cor.2025.107147</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx13"><label>Huang and Ontañón(2022)</label><mixed-citation>Huang, S. and Ontañón, S.: A closer look at invalid action masking in policy gradient algorithms, in: International FLAIRS Conference Proceedings, 35, <ext-link xlink:href="https://doi.org/10.32473/flairs.v35i.130584" ext-link-type="DOI">10.32473/flairs.v35i.130584</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx14"><label>Iklassov et al.(2023)</label><mixed-citation>Iklassov, Z., Medvedev, D., Solozabal Ochoa de Retana, R., and Takac, M.: On the study of curriculum learning for inferring dispatching policies on the job shop scheduling, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 5350–5358, <ext-link xlink:href="https://doi.org/10.24963/ijcai.2023/594" ext-link-type="DOI">10.24963/ijcai.2023/594</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx15"><label>Karimi and Alinia(2025)</label><mixed-citation> Karimi, N. and Alinia, S.: Towards a sustainable future: Integrating energy efficiency in multi-factory supply chain scheduling, Process Integration and Optimization for Sustainability, 9, 1425–1443, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx16"><label>Kaven et al.(2024)</label><mixed-citation>Kaven, L., Huke, P., Göppert, A., and Schmitt, R. H.: Multi agent reinforcement learning for online layout planning and scheduling in flexible assembly systems, J. Intell. Manuf., 35, 3917–3936, <ext-link xlink:href="https://doi.org/10.1007/s10845-023-02309-8" ext-link-type="DOI">10.1007/s10845-023-02309-8</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx17"><label>Li et al.(2025b)</label><mixed-citation>Li, H., Gao, L., Fan, Q., Li, X., and Han, B.: An end-to-end decentralised scheduling framework based on deep reinforcement learning for dynamic distributed heterogeneous flowshop scheduling, Int. J. Prod. Res., 63, 4368–4388, <ext-link xlink:href="https://doi.org/10.1080/00207543.2024.2449240" ext-link-type="DOI">10.1080/00207543.2024.2449240</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx18"><label>Li et al.(2024)</label><mixed-citation>Li, S., Fan, L., and Jia, S.: A hierarchical solution framework for dynamic and conflict-free AGV scheduling in an automated container terminal, Transport. Res. C-Emer., 165, 104724, <ext-link xlink:href="https://doi.org/10.1016/j.trc.2024.104724" ext-link-type="DOI">10.1016/j.trc.2024.104724</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx19"><label>Li et al.(2025a)</label><mixed-citation>Li, Y., Li, X., and Gao, L.: Real-time scheduling for production-logistics collaborative environment using multi-agent deep reinforcement learning, Adv. Eng. Inform., 65, 103216, <ext-link xlink:href="https://doi.org/10.1016/j.aei.2025.103216" ext-link-type="DOI">10.1016/j.aei.2025.103216</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx20"><label>Liang et al.(2025)</label><mixed-citation> Liang, T., Zhou, L., and Jiang, Z.: Integrated scheduling of production and material delivery for the intelligent manufacturing system, Int. J. Prod. Res., 63, 882–903, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx21"><label>Lin et al.(2025)</label><mixed-citation>Lin, S., Mi, Q., and Gao, T.: A survey of curriculum learning in deep reinforcement learning, in: Proceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, 1141–1147, <ext-link xlink:href="https://doi.org/10.1109/CCWC62904.2025.10903795" ext-link-type="DOI">10.1109/CCWC62904.2025.10903795</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx22"><label>Liu and Huang(2023)</label><mixed-citation> Liu, C. L. and Huang, T. H.: Dynamic job-shop scheduling problems using graph neural network and deep reinforcement learning, IEEE T. Syst. Man. Cy.-S., 53, 6836–6848, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx23"><label>Liu et al.(2022)</label><mixed-citation> Liu, R., Piplani, R., and Toro, C.: Deep reinforcement learning for dynamic scheduling of a flexible job shop, Int. J. Prod. Res., 60, 4049–4069, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx24"><label>Liu et al.(2025)</label><mixed-citation> Liu, X., Hu, M., Peng, Y., and Yang, Y.: Multi-agent deep reinforcement learning for multi-echelon inventory management, Prod. Oper. Manag., 34, 1836–1856, https://doi.org/10.1177/10591478241305863, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx25"><label>Lu et al.(2025)</label><mixed-citation> Lu, C., Xiao, Y., Zhang, B., and Gao, L.: Curriculum reinforcement learning algorithm for flexible job shop scheduling problems, Journal of National University of Defense Technology, 47, 49–59, https://doi.org/10.11887/j.cn.202502004, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx26"><label>Narvekar et al.(2020)</label><mixed-citation> Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M. E., and Stone, P.: Curriculum learning for reinforcement learning domains: A framework and survey, J. Mach. Learn. Res., 21, 1–50, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx27"><label>Ngwu et al.(2026)</label><mixed-citation> Ngwu, C., Liu, Y., and Wu, R.: Reinforcement learning in dynamic job shop scheduling: a comprehensive review of AI-driven approaches in modern manufacturing, J. Intell. Manuf., 37, 1093–1108, 2026.</mixed-citation></ref>
      <ref id="bib1.bibx28"><label>Pérez et al.(2023)</label><mixed-citation> Pérez, C., Climent, L., Nicoló, G., Arbelaez, A., and Salido, M. A.: A hybrid metaheuristic with learning for a real supply chain scheduling problem, Eng. Appl. Artif. Intell., 126, 107188, https://doi.org/10.1016/j.engappai.2023.107188, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx29"><label>Uzunoglu et al.(2023)</label><mixed-citation> Uzunoglu, A., Gahm, C., Wahl, S., and Tuma, A.: Learning-augmented heuristics for scheduling parallel serial-batch processing machines, Comput. Oper. Res., 151, 106122, https://doi.org/10.1016/j.cor.2022.106122, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx30"><label>Shi et al.(2025)</label><mixed-citation> Shi, J., Qiao, F., Liu, J., Ma, Y., Wang, D., and Ding, C.: Production-logistics collaborative scheduling in dynamic flexible job shops using nested-hierarchical deep reinforcement learning, Adv. Eng. Inform., 65, 103195, https://doi.org/10.1016/j.aei.2025.103195, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx31"><label>Sidki et al.(2025)</label><mixed-citation> Sidki, M., Tchernev, N., Féniès, P., and Ren, L.: A monolithic batch-centric MILP approach for a real-world integrated production and pipeline distribution scheduling problem, Comput. Ind. Eng., 203, 111028, https://doi.org/10.1016/j.cie.2025.111028, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx32"><label>Vié et al.(2025)</label><mixed-citation> Vié, M. S., Zufferey, N., and Coelho, L. C.: A production and distribution scheduling matheuristic for reducing supply chain variations, Transport. Res. E-Log., 194, 103905, https://doi.org/10.1016/j.tre.2024.103905, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx33"><label>Wang et al.(2025b)</label><mixed-citation> Wang, W., Zhang, Y., Wang, Y., Pan, G., and Feng, Y.: Hierarchical multi-agent deep reinforcement learning for dynamic flexible job-shop scheduling with transportation, Int. J. Prod. Res., 1–28, https://doi.org/10.1080/00207543.2025.2511239, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx34"><label>Wang et al.(2025a)</label><mixed-citation> Wang, Y., Wang, R., Sun, J., Deng, F., Wang, G., and Chen, J.: Attention enhanced reinforcement learning for flexible job shop scheduling with transportation constraints, Expert Syst. Appl., 282, 127671, https://doi.org/10.1016/j.eswa.2025.127671, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx35"><label>Wu et al.(2024)</label><mixed-citation> Wu, C. C., Zhang, R. M., Zhao, P. Y., Li, L., and Zhang, D. G.: Curing simulation and data-driven curing curve prediction of thermoset composites, Sci. Rep., 14, 31860, https://doi.org/10.1038/s41598-024-83379-3, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx36"><label>Xu et al.(2025)</label><mixed-citation> Xu, W., Gu, J., Zhang, W., Gen, M., and Ohwada, H.: Multi-agent reinforcement learning for flexible shop scheduling problem: a survey, Front. Ind. Eng., 3, 1611512, https://doi.org/10.3389/fieng.2025.1611512, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx37"><label>Yang et al.(2026)</label><mixed-citation>Yang, L., Yang, Z., Bi, L., and Jiao, X.: Dynamic flexible job shop co-scheduling optimization based on graph neural network and deep reinforcement learning, Operations Research Perspectives, 16, 100379, https://doi.org/10.1016/j.orp.2026.100379, 2026.  </mixed-citation></ref>
      <ref id="bib1.bibx38"><label>Yao et al.(2024)</label><mixed-citation> Yao, Y., Liu, Q., Fu, L., Li, X., Yu, Y., Gao, L., and Zhou, W.: A novel mathematical model for the flexible job-shop scheduling problem with limited automated guided vehicles, IEEE T. Autom. Sci. Eng., 22, 7449–7462, https://doi.org/10.1109/TASE.2024.3356255, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx39"><label>Yu et al.(2026)</label><mixed-citation>Yu, H., Lv, M., Hu, B., Zhang, Y., and Zhao, P.: Review article: A review of control technologies for soft robots: from structural design to intelligent control, Mech. Sci., 17, 313–332, <ext-link xlink:href="https://doi.org/10.5194/ms-17-313-2026" ext-link-type="DOI">10.5194/ms-17-313-2026</ext-link>, 2026.</mixed-citation></ref>
      <ref id="bib1.bibx40"><label>Zhang et al.(2024b)</label><mixed-citation> Zhang, C., Juraschek, M., and Herrmann, C.: Deep reinforcement learning-based dynamic scheduling for resilient and sustainable manufacturing: A systematic review, J. Manuf. Syst., 77, 962–989, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx41"><label>Zhang et al.(2024a)</label><mixed-citation> Zhang, L., Yan, Y., and Hu, Y.: Dynamic flexible scheduling with transportation constraints by multi-agent reinforcement learning, Eng. Appl. Artif. Intell., 134, 108699, https://doi.org/10.1016/j.engappai.2024.108699, 2024.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>Curriculum-learning-driven hierarchical multi-agent deep reinforcement learning for collaborative scheduling in complex supply chain networks</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>Amirteimoori et al.(2023)</label><mixed-citation>
      
Amirteimoori, A., Tirkolaee, E. B., Simic, V., and Weber, G.-W.:
A parallel heuristic for hybrid job shop scheduling problem considering conflict-free AGV routing, Swarm Evol. Comput., 79, 101312, <a href="https://doi.org/10.1016/j.swevo.2023.101312" target="_blank">https://doi.org/10.1016/j.swevo.2023.101312</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>Atasagun and Karaoğlan(2024)</label><mixed-citation>
      
Atasagun, G. C. and Karaoğlan, İ.:
Integrated production and outbound distribution scheduling problem with multiple facilities/vehicles and perishable items, Appl. Soft Comput., 166, 112144, <a href="https://doi.org/10.1016/j.asoc.2024.112144" target="_blank">https://doi.org/10.1016/j.asoc.2024.112144</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>Chang et al.(2025)</label><mixed-citation>
      
Chang, X., Jia, X., and Hu, H.:
Energy-efficient and self-adaptive AGV scheduling approach based on hierarchical reinforcement learning for flexible shop floor, Comput. Ind. Eng., 205, 111140, <a href="https://doi.org/10.1016/j.cie.2025.111140" target="_blank">https://doi.org/10.1016/j.cie.2025.111140</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>Chen et al.(2021)</label><mixed-citation>
      
Chen, J., Zhang, Y., Xu, Y., Ma, H., Yang, H., Song, J., Wang, Y., and Wu, Y.:
Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems, in: Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>Di et al.(2024)</label><mixed-citation>
      
Di, Y., Deng, L., and Zhang, L.:
A collaborative-learning multi-agent reinforcement learning method for distributed hybrid flow shop scheduling problem, Swarm Evol. Comput., 91, 101764, <a href="https://doi.org/10.1016/j.swevo.2024.101764" target="_blank">https://doi.org/10.1016/j.swevo.2024.101764</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>de Lima et al.(2022)</label><mixed-citation>
      
de Lima, F. A., Seuring, S., and Sauer, P. C.:
A systematic literature review exploring uncertainty management and sustainability outcomes in circular supply chains, Int. J. Prod. Res., 60, 6013–6046, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>Hady et al.(2025)</label><mixed-citation>
      
Hady, M. A., Hu, S., Pratama, M.,  Cao, Z., and Kowalczyk, R.:
Multi-agent reinforcement learning for resources allocation optimization: a survey, Artif. Intell. Rev., 58, 354, <a href="https://doi.org/10.1007/s10462-025-11340-5" target="_blank">https://doi.org/10.1007/s10462-025-11340-5</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>Hamou et al.(2025)</label><mixed-citation>
      
Hamou, K. A. B., Jarir, Z., and Elfirdoussi, S.:
Using machine learning for production scheduling problems in the supply chain: A review, Comput. Ind. Eng., 206, 111243, <a href="https://doi.org/10.1016/j.cie.2025.111243" target="_blank">https://doi.org/10.1016/j.cie.2025.111243</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>Hao and Demir(2025)</label><mixed-citation>
      
Hao, X. and Demir, E.:
Artificial intelligence in supply chain management: enablers and constraints in pre-development, deployment, and post-development stages, Prod. Plan. Control, 36, 748–770, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>Homayouni and Fontes(2021)</label><mixed-citation>
      
Homayouni, S. M. and Fontes, D. B. M. M.:
Production and transport scheduling in flexible job shop manufacturing systems, J. Global Optim., 79, 463–502, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>Hu et al.(2025a)</label><mixed-citation>
      
Hu, H., Liu, L., and Yang, X.:
A deep reinforcement learning framework for real-time joint task assignment and storage allocation problems considering random tasks in automated container terminals, Comput. Ind. Eng., 111544, <a href="https://doi.org/10.1016/j.cie.2025.111544" target="_blank">https://doi.org/10.1016/j.cie.2025.111544</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>Hu et al.(2025b)</label><mixed-citation>
      
Hu, Y., Wang, M., Min, R.,  Liu, J., Lukinykh, V. F., Tang, S., and Zhao, D.:
Coordinated scheduling optimization of quay cranes and AGVs in automated container terminals, Comput. Oper. Res., 182, 107147, <a href="https://doi.org/10.1016/j.cor.2025.107147" target="_blank">https://doi.org/10.1016/j.cor.2025.107147</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>Huang and Ontañón(2022)</label><mixed-citation>
      
Huang, S. and Ontañón, S.:
A closer look at invalid action masking in policy gradient algorithms, in: International FLAIRS Conference Proceedings, 35, <a href="https://doi.org/10.32473/flairs.v35i.130584" target="_blank">https://doi.org/10.32473/flairs.v35i.130584</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>Iklassov et al.(2023)</label><mixed-citation>
      
Iklassov, Z., Medvedev, D., Solozabal Ochoa de Retana, R., and Takac, M.:
On the study of curriculum learning for inferring dispatching policies on the job shop scheduling, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 5350–5358, <a href="https://doi.org/10.24963/ijcai.2023/594" target="_blank">https://doi.org/10.24963/ijcai.2023/594</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>Karimi and Alinia(2025)</label><mixed-citation>
      
Karimi, N. and Alinia, S.:
Towards a sustainable future: Integrating energy efficiency in multi-factory supply chain scheduling, Process Integration and Optimization for Sustainability, 9, 1425–1443, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>Kaven et al.(2024)</label><mixed-citation>
      
Kaven, L., Huke, P., Göppert, A., and Schmitt, R. H.:
Multi agent reinforcement learning for online layout planning and scheduling in flexible assembly systems, J. Intell. Manuf., 35, 3917–3936, <a href="https://doi.org/10.1007/s10845-023-02309-8" target="_blank">https://doi.org/10.1007/s10845-023-02309-8</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>Li et al.(2025b)</label><mixed-citation>
      
Li, H., Gao, L., Fan, Q., Li, X., and Han, B.:
An end-to-end decentralised scheduling framework based on deep reinforcement learning for dynamic distributed heterogeneous flowshop scheduling, Int. J. Prod. Res., 63, 4368–4388, <a href="https://doi.org/10.1080/00207543.2024.2449240" target="_blank">https://doi.org/10.1080/00207543.2024.2449240</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>Li et al.(2024)</label><mixed-citation>
      
Li, S., Fan, L., and Jia, S.:
A hierarchical solution framework for dynamic and conflict-free AGV scheduling in an automated container terminal, Transport. Res. C-Emer., 165, 104724, <a href="https://doi.org/10.1016/j.trc.2024.104724" target="_blank">https://doi.org/10.1016/j.trc.2024.104724</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>Li et al.(2025a)</label><mixed-citation>
      
Li, Y., Li, X., and Gao, L.:
Real-time scheduling for production-logistics collaborative environment using multi-agent deep reinforcement learning, Adv. Eng. Inform., 65, 103216, <a href="https://doi.org/10.1016/j.aei.2025.103216" target="_blank">https://doi.org/10.1016/j.aei.2025.103216</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>Liang et al.(2025)</label><mixed-citation>
      
Liang, T., Zhou, L., and Jiang, Z.:
Integrated scheduling of production and material delivery for the intelligent manufacturing system, Int. J. Prod. Res., 63, 882–903, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>Lin et al.(2025)</label><mixed-citation>
      
Lin, S., Mi, Q., and Gao, T.:
A survey of curriculum learning in deep reinforcement learning, in: Proceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, 1141–1147, <a href="https://doi.org/10.1109/CCWC62904.2025.10903795" target="_blank">https://doi.org/10.1109/CCWC62904.2025.10903795</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>Liu and Huang(2023)</label><mixed-citation>
      
Liu, C. L. and Huang, T. H.:
Dynamic job-shop scheduling problems using graph neural network and deep reinforcement learning, IEEE T. Syst. Man. Cy.-S., 53, 6836–6848, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>Liu et al.(2022)</label><mixed-citation>
      
Liu, R., Piplani, R., and Toro, C.:
Deep reinforcement learning for dynamic scheduling of a flexible job shop, Int. J. Prod. Res., 60, 4049–4069, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>Liu et al.(2025)</label><mixed-citation>
      
Liu, X., Hu, M., Peng, Y., and Yang, Y.: Multi-agent deep reinforcement learning for multi-echelon inventory management, Prod. Oper. Manag., 34, 1836–1856, https://doi.org/10.1177/10591478241305863, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>Lu et al.(2025)</label><mixed-citation>
      
Lu, C., Xiao, Y., Zhang, B., and Gao, L.: Curriculum reinforcement learning algorithm for flexible job shop scheduling problems, Journal of National University of Defense Technology, 47, 49–59, https://doi.org/10.11887/j.cn.202502004, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>Narvekar et al.(2020)</label><mixed-citation>
      
Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M. E., and Stone, P.:
Curriculum learning for reinforcement learning domains: A framework and survey, J. Mach. Learn. Res., 21, 1–50, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>Ngwu et al.(2026)</label><mixed-citation>
      
Ngwu, C., Liu, Y., and Wu, R.:
Reinforcement learning in dynamic job shop scheduling: a comprehensive review of AI-driven approaches in modern manufacturing, J. Intell. Manuf., 37, 1093–1108, 2026.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>Pérez et al.(2023)</label><mixed-citation>
      
Pérez, C., Climent, L., Nicoló, G., Arbelaez, A., and Salido, M. A.: A hybrid metaheuristic with learning for a real supply chain scheduling problem, Eng. Appl. Artif. Intell., 126, 107188, https://doi.org/10.1016/j.engappai.2023.107188, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>Uzunoglu et al.(2023)</label><mixed-citation>
      
Uzunoglu, A., Gahm, C., Wahl, S., and Tuma, A.: Learning-augmented heuristics for scheduling parallel serial-batch processing machines, Comput. Oper. Res., 151, 106122, https://doi.org/10.1016/j.cor.2022.106122, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>Shi et al.(2025)</label><mixed-citation>
      
Shi, J., Qiao, F., Liu, J., Ma, Y., Wang, D., and Ding, C.: Production-logistics collaborative scheduling in dynamic flexible job shops using nested-hierarchical deep reinforcement learning, Adv. Eng. Inform., 65, 103195, https://doi.org/10.1016/j.aei.2025.103195, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>Sidki et al.(2025)</label><mixed-citation>
      
Sidki, M., Tchernev, N., Féniès, P., and Ren, L.: A monolithic batch-centric MILP approach for a real-world integrated production and pipeline distribution scheduling problem, Comput. Ind. Eng., 203, 111028, https://doi.org/10.1016/j.cie.2025.111028, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>Vié et al.(2025)</label><mixed-citation>
      
Vié, M. S., Zufferey, N., and Coelho, L. C.:
A production and distribution scheduling matheuristic for reducing supply chain variations, Transport. Res. E-Log., 194, 103905, https://doi.org/10.1016/j.tre.2024.103905, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>Wang et al.(2025b)</label><mixed-citation>
      
Wang, W., Zhang, Y., Wang, Y., Pan, G., and Feng, Y.: Hierarchical multi-agent deep reinforcement learning for dynamic flexible job-shop scheduling with transportation, Int. J. Prod. Res., 1–28, https://doi.org/10.1080/00207543.2025.2511239, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>Wang et al.(2025a)</label><mixed-citation>
      
Wang, Y., Wang, R., Sun, J., Deng, F., Wang, G., and Chen, J.: Attention enhanced reinforcement learning for flexible job shop scheduling with transportation constraints, Expert Syst. Appl., 282, 127671, https://doi.org/10.1016/j.eswa.2025.127671, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>Wu et al.(2024)</label><mixed-citation>
      
Wu, C. C., Zhang, R. M., Zhao, P. Y., Li, L., and Zhang, D. G.: Curing simulation and data-driven curing curve prediction of thermoset composites, Sci. Rep., 14, 31860, https://doi.org/10.1038/s41598-024-83379-3, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>Xu et al.(2025)</label><mixed-citation>
      
Xu, W., Gu, J., Zhang, W., Gen, M., and Ohwada, H.: Multi-agent reinforcement learning for flexible shop scheduling problem: a survey, Front. Ind. Eng., 3, 1611512, https://doi.org/10.3389/fieng.2025.1611512, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>Yang et al.(2026)</label><mixed-citation>
      
Yang, L., Yang, Z., Bi, L., and Jiao, X.: Dynamic flexible job shop co-scheduling optimization based on graph neural network and deep reinforcement learning, Operations Research Perspectives, 16, 100379, https://doi.org/10.1016/j.orp.2026.100379, 2026.


    </mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>Yao et al.(2024)</label><mixed-citation>
      
Yao, Y., Liu, Q., Fu, L., Li, X., Yu, Y., Gao, L., and Zhou, W.: A novel mathematical model for the flexible job-shop scheduling problem with limited automated guided vehicles, IEEE T. Autom. Sci. Eng., 22, 7449–7462, https://doi.org/10.1109/TASE.2024.3356255, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>Yu et al.(2026)</label><mixed-citation>
      
Yu, H., Lv, M., Hu, B., Zhang, Y., and Zhao, P.:
Review article: A review of control technologies for soft robots: from structural design to intelligent control, Mech. Sci., 17, 313–332, <a href="https://doi.org/10.5194/ms-17-313-2026" target="_blank">https://doi.org/10.5194/ms-17-313-2026</a>, 2026.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>Zhang et al.(2024b)</label><mixed-citation>
      
Zhang, C., Juraschek, M., and Herrmann, C.:
Deep reinforcement learning-based dynamic scheduling for resilient and sustainable manufacturing: A systematic review, J. Manuf. Syst., 77, 962–989, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib41"><label>Zhang et al.(2024a)</label><mixed-citation>
      
Zhang, L., Yan, Y., and Hu, Y.: Dynamic flexible scheduling with transportation constraints by multi-agent reinforcement learning, Eng. Appl. Artif. Intell., 134, 108699, https://doi.org/10.1016/j.engappai.2024.108699, 2024.

    </mixed-citation></ref-html>--></article>
