-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 2:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

world size:                                                             32      
data parallel size:                                                     32      
model parallel size:                                                    1       
batch size per GPU:                                                     16      
params per GPU:                                                         8.08 B  
params of model = params per GPU * mp_size:                             8.08 B  
fwd MACs per GPU:                                                       26.2 TMACs
fwd flops per GPU:                                                      52.41 T 
fwd flops of model = fwd flops per GPU * mp_size:                       52.41 T 
fwd latency:                                                            427.35 ms
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:                    122.64 TFLOPS
bwd latency:                                                            1.01 s  
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:                103.73 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):      109.35 TFLOPS
step latency:                                                           397.53 ms
iter latency:                                                           1.84 s  
FLOPS per GPU = 3 * fwd flops per GPU / iter latency:                   85.67 TFLOPS
samples/second:                                                         278.96  

----------------------------- Aggregated Profile per GPU -----------------------------
Top 1 modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
    params      - {'DiT': '8.08 B'}
    MACs        - {'DiT': '26.2 TMACs'}
    fwd latency - {'DiT': '427.17 ms'}
depth 1:
    params      - {'ModuleList': '8.05 B'}
    MACs        - {'ModuleList': '26.15 TMACs'}
    fwd latency - {'ModuleList': '406.48 ms'}
depth 2:
    params      - {'DiTLayer': '8.05 B'}
    MACs        - {'DiTLayer': '26.15 TMACs'}
    fwd latency - {'DiTLayer': '406.48 ms'}
depth 3:
    params      - {'GemmaMLP': '4.03 B'}
    MACs        - {'GemmaMLP': '16.49 TMACs'}
    fwd latency - {'DiTSelfAttention': '213.37 ms'}

------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs (or latency) and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.

DiT(
  8.08 B = 100% Params, 26.2 TMACs = 100% MACs, 427.17 ms = 100% latency, 122.7 TFLOPS
  (layers): ModuleList(
    (0): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.26 ms = 1.23% latency, 124.34 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 658.51 us = 0.15% latency, 1.22 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 40.77 us = 0.01% latency, 803.74 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 201.94 us = 0.05% latency, 3.99 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 247.24 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.74 ms = 0.64% latency, 87.64 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 467.3 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 276.8 us = 0.06% latency, 248.26 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 161.41 us = 0.04% latency, 106.44 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.53 us = 0.03% latency, 115.66 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.09 us = 0.03% latency, 125.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.38 us = 0.06% latency, 281.2 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.37 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.22 ms = 0.29% latency, 336.68 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 342.37 us = 0.08% latency, 401.44 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.34 us = 0.07% latency, 433.1 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.83 us = 0.07% latency, 463.02 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 89.65 us = 0.02% latency, 374.3 GFLOPS)
      )
    )
    (1): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.13 ms = 1.2% latency, 127.53 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 616.79 us = 0.14% latency, 1.31 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.75 us = 0.04% latency, 4.74 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.68 ms = 0.63% latency, 89.59 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 260.83 us = 0.06% latency, 263.46 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.93 us = 0.04% latency, 110.18 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.33 us = 0.03% latency, 125.1 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.74 us = 0.05% latency, 303.08 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.28 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.21 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.78 us = 0.08% latency, 418.03 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 305.18 us = 0.07% latency, 450.36 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 83.45 us = 0.02% latency, 402.11 GFLOPS)
      )
    )
    (2): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.73 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 608.92 us = 0.14% latency, 1.32 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.62 us = 0.01% latency, 848.39 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 176.43 us = 0.04% latency, 4.56 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.48 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.58 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.82 us = 0.06% latency, 273.98 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.02 us = 0.04% latency, 111.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.34 us = 0.03% latency, 116.6 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.99 us = 0.03% latency, 128.22 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 228.88 us = 0.05% latency, 300.24 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.67 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.97 us = 0.08% latency, 416.52 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.06 us = 0.07% latency, 470.58 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (3): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.7 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 610.83 us = 0.14% latency, 1.32 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.39 us = 0.01% latency, 853.66 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 174.52 us = 0.04% latency, 4.61 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.33 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.62% latency, 90.23 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.23 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.72 us = 0.06% latency, 277.41 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.78 us = 0.04% latency, 111.72 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 233.17 us = 0.05% latency, 294.71 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.37 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.02 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.06 us = 0.08% latency, 418.94 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (4): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.76 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 594.85 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.99 us = 0.04% latency, 4.88 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.28 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.44 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.54 us = 0.04% latency, 111.89 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 138.28 us = 0.03% latency, 124.24 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.55 us = 0.05% latency, 298.07 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.28 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.25 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.12 us = 0.08% latency, 413.83 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 316.14 us = 0.07% latency, 434.74 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS)
      )
    )
    (5): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.16 ms = 1.21% latency, 126.76 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 655.41 us = 0.15% latency, 1.23 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.23 us = 0.04% latency, 4.73 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.29 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.68 ms = 0.63% latency, 89.76 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.44 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.37 us = 0.05% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 259.64 us = 0.06% latency, 264.67 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.96 us = 0.03% latency, 118.52 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.27 us = 0.05% latency, 297.14 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.18 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.39 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.64 us = 0.08% latency, 414.42 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.67 us = 0.07% latency, 435.39 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS)
      )
    )
    (6): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.88 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 601.05 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.67 us = 0.01% latency, 869.87 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.33 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.72 us = 0.06% latency, 277.41 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.02 us = 0.04% latency, 111.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.85 us = 0.03% latency, 125.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.84 us = 0.05% latency, 298.99 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.66 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.18 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.54 us = 0.08% latency, 418.33 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.24 us = 0.07% latency, 437.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (7): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.12 ms = 1.2% latency, 127.66 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 610.59 us = 0.14% latency, 1.32 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 41.96 us = 0.01% latency, 780.9 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.23 us = 0.04% latency, 4.73 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 89.98 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.44 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.7 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 252.49 us = 0.06% latency, 272.17 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.21 us = 0.04% latency, 110.69 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.77 us = 0.03% latency, 115.48 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.09 us = 0.03% latency, 125.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.5 us = 0.05% latency, 303.4 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.22 ms = 0.29% latency, 338.26 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 335.93 us = 0.08% latency, 409.13 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 324.49 us = 0.08% latency, 423.56 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 298.26 us = 0.07% latency, 460.8 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.54 us = 0.02% latency, 411.51 GFLOPS)
      )
    )
    (8): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.13 ms = 1.2% latency, 127.36 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 587.46 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 32.66 us = 0.01% latency, 1 GFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 156.88 us = 0.04% latency, 5.13 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.7 ms = 0.63% latency, 89.09 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.15 us = 0.06% latency, 275.82 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.73 us = 0.04% latency, 111.03 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.3 us = 0.03% latency, 115.85 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 140.67 us = 0.03% latency, 122.13 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 138.52 us = 0.03% latency, 124.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.38 us = 0.06% latency, 281.2 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.22 ms = 0.29% latency, 338.26 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 341.65 us = 0.08% latency, 402.28 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 320.2 us = 0.07% latency, 429.23 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 298.26 us = 0.07% latency, 460.8 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.78 us = 0.02% latency, 410.31 GFLOPS)
      )
    )
    (9): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.17 ms = 1.21% latency, 126.52 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 603.68 us = 0.14% latency, 1.33 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.28 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.66 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.72 ms = 0.64% latency, 88.39 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.68 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.15 us = 0.06% latency, 275.82 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 156.16 us = 0.04% latency, 110.01 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 141.14 us = 0.03% latency, 121.72 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 238.42 us = 0.06% latency, 288.23 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.21 ms = 0.28% latency, 339.45 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 337.36 us = 0.08% latency, 407.39 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 319.72 us = 0.07% latency, 429.87 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 300.65 us = 0.07% latency, 457.15 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.54 us = 0.02% latency, 411.51 GFLOPS)
      )
    )
    (10): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.13 ms = 1.2% latency, 127.47 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 622.51 us = 0.15% latency, 1.29 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 43.63 us = 0.01% latency, 751.03 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 181.44 us = 0.04% latency, 4.44 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.57 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.31 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.96 us = 0.06% latency, 277.14 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.54 us = 0.04% latency, 111.89 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.85 us = 0.03% latency, 125.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.22 ms = 0.29% latency, 337.47 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 334.02 us = 0.08% latency, 411.46 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 320.43 us = 0.08% latency, 428.91 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 298.5 us = 0.07% latency, 460.43 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.3 us = 0.02% latency, 412.72 GFLOPS)
      )
    )
    (11): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.1 ms = 1.19% latency, 128.3 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 612.26 us = 0.14% latency, 1.32 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 46.73 us = 0.01% latency, 701.22 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.8 us = 0.04% latency, 4.77 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 90 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.7 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.82 us = 0.06% latency, 273.98 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 156.88 us = 0.04% latency, 109.51 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.53 us = 0.03% latency, 115.66 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.36 us = 0.05% latency, 299.62 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.84 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.87 us = 0.08% latency, 420.47 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.4 us = 0.07% latency, 465.26 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.54 us = 0.02% latency, 411.51 GFLOPS)
      )
    )
    (12): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.09 ms = 1.19% latency, 128.3 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 614.88 us = 0.14% latency, 1.31 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 40.29 us = 0.01% latency, 813.25 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 173.33 us = 0.04% latency, 4.65 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.62% latency, 90.11 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 248.91 us = 0.06% latency, 276.08 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.09 us = 0.03% latency, 125.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 143.29 us = 0.03% latency, 119.9 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.18 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.54 us = 0.08% latency, 418.33 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.76 us = 0.07% latency, 438.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS)
      )
    )
    (13): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.1 ms = 1.19% latency, 128.23 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 606.06 us = 0.14% latency, 1.33 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 177.15 us = 0.04% latency, 4.55 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.62% latency, 90.21 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.58 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.08 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 252.49 us = 0.06% latency, 272.17 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.73 us = 0.04% latency, 111.03 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.79 us = 0.05% latency, 297.76 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.21 ms = 0.28% latency, 341.67 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 333.31 us = 0.08% latency, 412.35 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 319.24 us = 0.07% latency, 430.52 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.83 us = 0.07% latency, 463.02 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.3 us = 0.02% latency, 412.72 GFLOPS)
      )
    )
    (14): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.11 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 592.95 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.65 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.19 us = 0.06% latency, 282.58 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.74 us = 0.05% latency, 296.53 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.46 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.19 us = 0.07% latency, 436.05 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.59 us = 0.07% latency, 463.39 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (15): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.86 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 587.7 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.65 us = 0.04% latency, 4.98 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 90.02 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 472.78 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.99 us = 0.06% latency, 283.97 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.26 us = 0.04% latency, 111.37 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.5 us = 0.05% latency, 296.84 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.76 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.63 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.21 us = 0.08% latency, 416.22 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.43 us = 0.07% latency, 435.72 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS)
      )
    )
    (16): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.13 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.18 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.89 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.31 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 259.64 us = 0.06% latency, 264.67 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.63 us = 0.04% latency, 113.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.63 us = 0.03% latency, 117.17 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.04 us = 0.03% latency, 129.14 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.78 us = 0.05% latency, 304.36 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.77 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.45 us = 0.08% latency, 415.92 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (17): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.11 ms = 1.2% latency, 127.96 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 593.42 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.27 us = 0.04% latency, 4.9 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.68 ms = 0.63% latency, 89.86 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 238.42 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.52 us = 0.06% latency, 284.53 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.83 us = 0.04% latency, 112.41 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.42 us = 0.03% latency, 126.86 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.21 ms = 0.28% latency, 340.52 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 337.6 us = 0.08% latency, 407.11 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.1 us = 0.07% latency, 433.43 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.12 us = 0.07% latency, 464.14 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 83.45 us = 0.02% latency, 402.11 GFLOPS)
      )
    )
    (18): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.91 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 601.53 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.2 us = 0.01% latency, 776.49 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.32 us = 0.04% latency, 4.93 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.65 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.49 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.86 us = 0.06% latency, 280.65 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.63 us = 0.04% latency, 113.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.42 us = 0.03% latency, 126.86 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.07 us = 0.05% latency, 298.68 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.21 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.73 us = 0.08% latency, 416.82 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 318.29 us = 0.07% latency, 431.81 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS)
      )
    )
    (19): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.97 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 597.24 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.01% latency, 910.19 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.99 us = 0.04% latency, 4.88 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.63 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.58 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.52 us = 0.06% latency, 278.75 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 228.4 us = 0.05% latency, 300.87 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.09 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.12 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.26 us = 0.08% latency, 417.42 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.71 us = 0.07% latency, 436.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.12 us = 0.07% latency, 464.14 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (20): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.41 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.89 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.73 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.68 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.86 us = 0.06% latency, 280.65 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.78 us = 0.04% latency, 111.72 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.07 us = 0.05% latency, 305.33 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.97 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.26 us = 0.08% latency, 417.42 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.71 us = 0.07% latency, 436.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (21): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.1 ms = 1.19% latency, 128.21 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 604.15 us = 0.14% latency, 1.33 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.91 us = 0.01% latency, 864.4 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.99 us = 0.04% latency, 4.74 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.86 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 90 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 465.15 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.05 us = 0.06% latency, 279.29 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.5 us = 0.04% latency, 111.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.36 us = 0.05% latency, 299.62 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.89 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.02 us = 0.08% latency, 417.73 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.4 us = 0.08% latency, 421.08 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.3 us = 0.02% latency, 412.72 GFLOPS)
      )
    )
    (22): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.6 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 600.81 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.61 us = 0.04% latency, 4.8 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.81 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.35 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.62 us = 0.06% latency, 275.29 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.93 us = 0.04% latency, 110.18 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.67 us = 0.03% latency, 117.93 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.56 us = 0.03% latency, 129.6 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 237.94 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 343.43 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.4 us = 0.08% latency, 421.08 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.76 us = 0.07% latency, 438.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 303.75 us = 0.07% latency, 452.48 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (23): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.12 ms = 1.2% latency, 127.66 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 611.07 us = 0.14% latency, 1.32 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.43 us = 0.01% latency, 875.41 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.28 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.81 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 89.97 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.34 us = 0.06% latency, 274.51 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.93 us = 0.04% latency, 110.18 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.3 us = 0.03% latency, 115.85 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.12 us = 0.05% latency, 299.93 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.68 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 333.55 us = 0.08% latency, 412.05 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 316.62 us = 0.07% latency, 434.08 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.12 us = 0.07% latency, 464.14 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS)
      )
    )
    (24): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.77 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 594.14 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.62% latency, 90.22 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.92 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.13 us = 0.05% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.86 us = 0.06% latency, 280.65 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.69 us = 0.04% latency, 110.35 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.34 us = 0.03% latency, 116.6 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.27 us = 0.05% latency, 297.14 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.39 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.83 us = 0.08% latency, 412.94 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS)
      )
    )
    (25): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.9 ms = 1.38% latency, 110.71 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 592.71 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.75 us = 0.04% latency, 4.89 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 3.43 ms = 0.8% latency, 70.05 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 541.93 us = 0.13% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 307.56 us = 0.07% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 257.02 us = 0.06% latency, 267.38 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.21 us = 0.04% latency, 110.69 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.58 us = 0.03% latency, 116.41 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 385.28 us = 0.09% latency, 44.59 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 269.17 us = 0.06% latency, 63.82 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 252.72 us = 0.06% latency, 271.92 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.25 ms = 0.29% latency, 329.75 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 343.32 us = 0.08% latency, 400.32 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.35 us = 0.07% latency, 463.77 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 87.74 us = 0.02% latency, 382.44 GFLOPS)
      )
    )
    (26): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.12 ms = 1.2% latency, 127.59 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 639.92 us = 0.15% latency, 1.26 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.32 us = 0.04% latency, 4.78 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 281.33 us = 0.07% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.26 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.08 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 251.53 us = 0.06% latency, 273.2 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.61 us = 0.03% latency, 125.75 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.22 us = 0.05% latency, 295.92 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.09 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 343.7 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.87 us = 0.08% latency, 420.47 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.83 us = 0.08% latency, 419.24 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (27): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.46 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.89 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.17 us = 0.04% latency, 5 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.14 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.52 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.05 us = 0.06% latency, 279.29 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 143.53 us = 0.03% latency, 119.7 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.99 us = 0.03% latency, 128.22 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.45 us = 0.05% latency, 302.13 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.87 us = 0.08% latency, 420.47 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.69 us = 0.07% latency, 466.39 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (28): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.65 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 617.98 us = 0.14% latency, 1.3 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 47.92 us = 0.01% latency, 683.78 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 173.57 us = 0.04% latency, 4.64 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.29 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.73 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.96 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.1 us = 0.06% latency, 274.77 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.78 us = 0.04% latency, 111.72 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.63 us = 0.03% latency, 117.17 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 223.88 us = 0.05% latency, 306.95 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.53 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.26 us = 0.08% latency, 417.42 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 318.29 us = 0.07% latency, 431.81 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (29): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.06 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.18 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.65 us = 0.04% latency, 4.98 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.35 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.48 us = 0.06% latency, 277.68 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.11 us = 0.04% latency, 112.94 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.67 us = 0.03% latency, 117.93 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 237.46 us = 0.06% latency, 289.39 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.73 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.12 us = 0.08% latency, 413.83 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.57 us = 0.07% latency, 432.78 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS)
      )
    )
    (30): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.78 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.66 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.57 us = 0.01% latency, 947.85 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.12 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.05 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.85 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.58 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 239.85 us = 0.06% latency, 286.51 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.63 us = 0.04% latency, 113.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 142.81 us = 0.03% latency, 120.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.71 us = 0.03% latency, 127.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 142.81 us = 0.03% latency, 120.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.79 us = 0.05% latency, 297.76 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.64 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.45 us = 0.08% latency, 415.92 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.33 us = 0.07% latency, 440.05 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS)
      )
    )
    (31): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.89 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 607.25 us = 0.14% latency, 1.33 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.79 us = 0.04% latency, 4.92 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.01 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 233.41 us = 0.05% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144 us = 0.03% latency, 119.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.34 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.4 us = 0.08% latency, 421.08 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.69 us = 0.07% latency, 466.39 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS)
      )
    )
    (32): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.66 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 592.95 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 33.38 us = 0.01% latency, 981.71 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.66 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.79 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.92 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.81 us = 0.06% latency, 279.56 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.45 us = 0.04% latency, 110.52 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.42 us = 0.03% latency, 126.86 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 228.17 us = 0.05% latency, 301.18 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 347.99 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.83 us = 0.08% latency, 419.24 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.76 us = 0.07% latency, 438.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (33): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.94 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.81 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.92 us = 0.01% latency, 763.55 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.99 us = 0.04% latency, 4.88 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.28 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.09 us = 0.06% latency, 286.23 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.06 us = 0.04% latency, 112.24 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.78 us = 0.05% latency, 304.36 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.92 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.87 us = 0.08% latency, 420.47 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.76 us = 0.07% latency, 438.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (34): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.94 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 605.34 us = 0.14% latency, 1.33 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.43 us = 0.01% latency, 875.41 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.23 us = 0.04% latency, 4.73 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.52 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.61 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.82 us = 0.06% latency, 273.98 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.83 us = 0.04% latency, 112.41 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.66 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.39 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.3 us = 0.08% latency, 418.64 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.24 us = 0.07% latency, 437.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (35): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.26 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 600.58 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 171.18 us = 0.04% latency, 4.7 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.81 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.78 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.29 us = 0.06% latency, 279.02 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.54 us = 0.04% latency, 111.89 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.67 us = 0.03% latency, 117.93 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.78 us = 0.05% latency, 304.36 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.28 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.36 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.35 us = 0.08% latency, 419.85 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314 us = 0.07% latency, 437.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.92 us = 0.07% latency, 466.02 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (36): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.83 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.33 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.01% latency, 910.19 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.36 us = 0.04% latency, 4.96 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.05 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.96 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.43 us = 0.06% latency, 282.3 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 149.73 us = 0.04% latency, 114.74 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 348.83 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 324.49 us = 0.08% latency, 423.56 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314 us = 0.07% latency, 437.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS)
      )
    )
    (37): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.11 ms = 1.2% latency, 128 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 600.58 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.01% latency, 892.46 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.85 us = 0.04% latency, 4.8 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.68 ms = 0.63% latency, 89.76 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.44 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.6 us = 0.05% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 262.74 us = 0.06% latency, 261.55 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 156.88 us = 0.04% latency, 109.51 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.31 us = 0.05% latency, 298.38 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.34 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.49 us = 0.08% latency, 417.12 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.52 us = 0.07% latency, 438.37 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 306.84 us = 0.07% latency, 447.91 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS)
      )
    )
    (38): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.53 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 593.66 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.01% latency, 904.2 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.36 us = 0.04% latency, 4.96 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.33 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.06 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.84 us = 0.05% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.71 us = 0.06% latency, 283.13 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 228.17 us = 0.05% latency, 301.18 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.35 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.59 us = 0.08% latency, 413.23 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.19 us = 0.07% latency, 436.05 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.39 us = 0.02% latency, 422.64 GFLOPS)
      )
    )
    (39): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.53 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.19 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.09 us = 0.04% latency, 4.79 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.52 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.01 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.47 us = 0.06% latency, 283.41 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144 us = 0.03% latency, 119.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.25 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.45 us = 0.08% latency, 415.92 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 316.38 us = 0.07% latency, 434.41 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (40): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.07 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.6 us = 0.04% latency, 4.95 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.26 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 465.63 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 260.11 us = 0.06% latency, 264.19 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.26 us = 0.04% latency, 111.37 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.72 us = 0.03% latency, 118.71 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.21 us = 0.05% latency, 302.45 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 348.48 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.35 us = 0.08% latency, 419.85 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.28 us = 0.07% latency, 438.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (41): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.97 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.91 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.01% latency, 892.46 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.04 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.31 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.23 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 251.53 us = 0.06% latency, 273.2 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.97 us = 0.04% latency, 110.86 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.34 us = 0.03% latency, 116.6 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.57 us = 0.03% latency, 124.88 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.71 us = 0.03% latency, 127.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.74 us = 0.05% latency, 303.08 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.74 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.16 us = 0.08% latency, 415.02 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.81 us = 0.07% latency, 439.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (42): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.06 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 601.05 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.01% latency, 904.2 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.52 us = 0.04% latency, 4.75 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.14 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.42 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.24 us = 0.06% latency, 277.95 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 130.65 us = 0.03% latency, 131.49 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.7 us = 0.05% latency, 295.32 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.29 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.88 us = 0.08% latency, 414.12 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 311.85 us = 0.07% latency, 440.72 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (43): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.11 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 604.63 us = 0.14% latency, 1.33 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 44.82 us = 0.01% latency, 731.06 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.04 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.54 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.14 us = 0.06% latency, 281.47 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.22 us = 0.05% latency, 295.92 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.5 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.83 us = 0.08% latency, 419.24 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.1 us = 0.07% latency, 433.43 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.06 us = 0.07% latency, 470.58 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (44): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.14 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.75 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.49 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.71 us = 0.06% latency, 283.13 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.68 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 341.65 us = 0.08% latency, 402.28 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (45): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.11 ms = 1.2% latency, 127.92 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 650.88 us = 0.15% latency, 1.24 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.57 us = 0.01% latency, 947.85 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.65 us = 0.04% latency, 4.98 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.29 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.79 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 238.18 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.05 us = 0.06% latency, 279.29 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.92 us = 0.04% latency, 113.84 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.72 us = 0.03% latency, 118.71 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.11 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.16 us = 0.08% latency, 415.02 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.43 us = 0.07% latency, 435.72 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (46): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.87 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.66 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.33 us = 0.01% latency, 954.44 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.56 us = 0.04% latency, 4.92 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.87 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.71 us = 0.06% latency, 283.13 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.11 us = 0.04% latency, 112.94 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.07 us = 0.05% latency, 305.33 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 348.55 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 324.96 us = 0.08% latency, 422.94 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 310.42 us = 0.07% latency, 442.75 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (47): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.68 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 614.88 us = 0.14% latency, 1.31 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 50.54 us = 0.01% latency, 648.3 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.04 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.57 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.52 us = 0.06% latency, 278.75 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.81 us = 0.03% latency, 124.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.49 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.59 us = 0.08% latency, 419.55 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.81 us = 0.07% latency, 439.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.69 us = 0.07% latency, 466.39 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS)
      )
    )
    (48): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.89 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.67 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.01% latency, 904.2 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.43 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.55 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.84 us = 0.05% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.24 us = 0.06% latency, 277.95 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.59 us = 0.04% latency, 112.59 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.1 us = 0.03% latency, 116.79 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 130.18 us = 0.03% latency, 131.97 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.46 us = 0.05% latency, 295.62 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.99 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 345.01 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.95 us = 0.07% latency, 436.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.83 us = 0.07% latency, 463.02 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (49): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.55 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 590.8 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.01% latency, 910.19 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.03 us = 0.04% latency, 4.91 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.28 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.88 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.99 us = 0.06% latency, 283.97 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.92 us = 0.04% latency, 113.84 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 143.29 us = 0.03% latency, 119.9 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.71 us = 0.03% latency, 127.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 140.67 us = 0.03% latency, 122.13 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.39 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.83 us = 0.08% latency, 419.24 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.95 us = 0.07% latency, 436.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (50): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.9 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 590.32 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.7 us = 0.04% latency, 4.86 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.63 ms = 0.62% latency, 91.3 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.43 us = 0.06% latency, 282.3 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.44 us = 0.04% latency, 114.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.96 us = 0.03% latency, 118.52 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.5 us = 0.05% latency, 303.4 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.15 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.02 us = 0.08% latency, 417.73 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 321.63 us = 0.08% latency, 427.32 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (51): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.63 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.57 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.2 us = 0.01% latency, 776.49 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.32 us = 0.04% latency, 4.93 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 90.96 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.56 us = 0.06% latency, 285.66 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.96 us = 0.03% latency, 118.52 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.37 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.57 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.97 us = 0.08% latency, 416.52 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.24 us = 0.07% latency, 437.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (52): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.77 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 587.46 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.36 us = 0.04% latency, 4.96 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.92 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.49 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.19 us = 0.06% latency, 282.58 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.67 us = 0.03% latency, 117.93 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.28 us = 0.03% latency, 128.9 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.98 us = 0.05% latency, 296.23 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.29 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.64 us = 0.08% latency, 414.42 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.28 us = 0.07% latency, 438.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 291.11 us = 0.07% latency, 472.12 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (53): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.75 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 589.85 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.79 us = 0.04% latency, 4.92 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.05 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.21 us = 0.05% latency, 302.45 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.15 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.02 us = 0.08% latency, 417.73 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.28 us = 0.07% latency, 438.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (54): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.66 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.42 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.17 us = 0.04% latency, 5 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.83 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.23 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.56 us = 0.06% latency, 285.66 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.63 us = 0.04% latency, 113.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.42 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.81 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.16 us = 0.08% latency, 415.02 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.71 us = 0.07% latency, 436.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (55): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.68 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 585.56 us = 0.14% latency, 1.38 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 961.11 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.36 us = 0.04% latency, 4.96 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.68 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.23 us = 0.06% latency, 283.69 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.34 us = 0.03% latency, 116.6 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.09 us = 0.03% latency, 125.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.55 us = 0.05% latency, 298.07 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 237.94 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.67 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.78 us = 0.08% latency, 418.03 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.64 us = 0.07% latency, 464.89 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (56): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.23 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 593.66 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 43.15 us = 0.01% latency, 759.33 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.12 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.5 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.68 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 253.92 us = 0.06% latency, 270.64 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.59 us = 0.04% latency, 112.59 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.7 us = 0.05% latency, 295.32 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.46 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.92 us = 0.08% latency, 415.32 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.95 us = 0.07% latency, 436.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (57): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.57 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.57 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 166.18 us = 0.04% latency, 4.85 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.14 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.92 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.7 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.96 us = 0.06% latency, 277.14 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.06 us = 0.04% latency, 112.24 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.37 us = 0.03% latency, 130.78 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 224.11 us = 0.05% latency, 306.63 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.74 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.54 us = 0.08% latency, 418.33 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.19 us = 0.07% latency, 436.05 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (58): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.57 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.43 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.95 us = 0.01% latency, 886.7 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 166.65 us = 0.04% latency, 4.83 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.79 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.84 us = 0.05% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.09 us = 0.06% latency, 280.38 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.28 us = 0.03% latency, 128.9 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.74 us = 0.05% latency, 303.08 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.85 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.52 us = 0.07% latency, 438.37 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.25 us = 0.07% latency, 468.67 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (59): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.13 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 600.34 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.01% latency, 892.46 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.32 us = 0.04% latency, 4.78 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.14 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.44 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.34 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.39 us = 0.06% latency, 275.55 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 164.03 us = 0.04% latency, 104.73 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.71 us = 0.03% latency, 127.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.15 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.3 us = 0.08% latency, 418.64 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 310.66 us = 0.07% latency, 442.41 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 298.5 us = 0.07% latency, 460.43 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (60): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.74 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 594.62 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.22 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.13 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 465.15 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.66 us = 0.06% latency, 282.03 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.26 us = 0.05% latency, 303.72 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.71 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.49 us = 0.08% latency, 417.12 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.81 us = 0.07% latency, 439.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (61): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.56 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.09 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 961.11 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.47 us = 0.04% latency, 4.72 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.74 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.81 us = 0.06% latency, 279.56 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.5 us = 0.04% latency, 111.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.63 us = 0.03% latency, 117.17 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.26 us = 0.05% latency, 303.72 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 348.83 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 325.2 us = 0.08% latency, 422.63 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 310.66 us = 0.07% latency, 442.41 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.39 us = 0.02% latency, 422.64 GFLOPS)
      )
    )
    (62): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.94 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 601.53 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.81 us = 0.01% latency, 941.36 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 173.57 us = 0.04% latency, 4.64 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.57 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.28 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.7 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 248.19 us = 0.06% latency, 276.88 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.02 us = 0.04% latency, 111.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 156.64 us = 0.04% latency, 109.68 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 138.04 us = 0.03% latency, 124.45 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.99 us = 0.03% latency, 128.22 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.56 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.73 us = 0.08% latency, 416.82 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 291.11 us = 0.07% latency, 472.12 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (63): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.79 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 608.92 us = 0.14% latency, 1.32 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.39 us = 0.01% latency, 853.66 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.95 us = 0.04% latency, 4.71 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.43 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.62 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.96 us = 0.06% latency, 277.14 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.11 us = 0.04% latency, 112.94 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.56 us = 0.03% latency, 129.6 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.21 us = 0.05% latency, 302.45 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.49 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.59 us = 0.08% latency, 413.23 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS)
      )
    )
    (64): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.06 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 599.62 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.51 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.96 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.86 us = 0.06% latency, 275.03 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.73 us = 0.04% latency, 111.03 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.74 us = 0.05% latency, 303.08 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.56 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.26 us = 0.08% latency, 417.42 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (65): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.2 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 593.9 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.33 us = 0.01% latency, 954.44 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.8 us = 0.04% latency, 4.77 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.46 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.48 us = 0.06% latency, 277.68 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.83 us = 0.04% latency, 112.41 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.67 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.06 us = 0.08% latency, 418.94 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.04 us = 0.07% latency, 439.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.88 us = 0.07% latency, 464.51 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (66): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 129.04 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 603.68 us = 0.14% latency, 1.33 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.01% latency, 910.19 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 172.38 us = 0.04% latency, 4.67 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.76 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.85 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.37 us = 0.03% latency, 130.78 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.27 us = 0.05% latency, 297.14 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.37 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.6 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.73 us = 0.08% latency, 416.82 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.34 us = 0.07% latency, 433.1 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.12 us = 0.07% latency, 464.14 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS)
      )
    )
    (67): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.79 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 589.13 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.89 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.11 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.09 us = 0.06% latency, 286.23 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.59 us = 0.04% latency, 112.59 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.6 us = 0.05% latency, 299.3 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.46 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.16 us = 0.08% latency, 415.02 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.02 us = 0.07% latency, 469.05 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (68): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.33 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 166.65 us = 0.04% latency, 4.83 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.92 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.76 us = 0.06% latency, 284.25 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 143.77 us = 0.03% latency, 119.5 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.33 us = 0.03% latency, 125.1 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 130.41 us = 0.03% latency, 131.73 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 234.84 us = 0.05% latency, 292.62 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.25 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 335.93 us = 0.08% latency, 409.13 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 316.62 us = 0.07% latency, 434.08 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.25 us = 0.07% latency, 468.67 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (69): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.72 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 587.22 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.81 us = 0.01% latency, 941.36 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.08 us = 0.04% latency, 4.94 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 90.96 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.34 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.23 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.66 us = 0.06% latency, 282.03 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.1 us = 0.03% latency, 116.79 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.84 us = 0.05% latency, 298.99 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.04 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.21 us = 0.08% latency, 416.22 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314 us = 0.07% latency, 437.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.88 us = 0.07% latency, 464.51 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS)
      )
    )
    (70): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.9 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 612.26 us = 0.14% latency, 1.32 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.01% latency, 904.2 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.04 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.29 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.64 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.81 us = 0.06% latency, 279.56 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 234.6 us = 0.05% latency, 292.92 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.78 us = 0.08% latency, 418.03 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.71 us = 0.07% latency, 436.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (71): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.85 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 608.21 us = 0.14% latency, 1.32 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.92 us = 0.01% latency, 763.55 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.52 us = 0.04% latency, 4.75 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.05 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.49 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.08 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.72 us = 0.06% latency, 277.41 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.59 us = 0.04% latency, 112.59 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.81 us = 0.03% latency, 124.67 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.6 us = 0.05% latency, 299.3 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.4 us = 0.08% latency, 414.72 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.52 us = 0.07% latency, 438.37 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (72): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.72 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 596.28 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.43 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.59 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 466.11 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.34 us = 0.06% latency, 274.51 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.26 us = 0.04% latency, 111.37 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.5 us = 0.05% latency, 303.4 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.99 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.39 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 337.12 us = 0.08% latency, 407.68 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (73): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.67 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.79 us = 0.04% latency, 4.92 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.19 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.95 us = 0.06% latency, 282.86 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.24 us = 0.03% latency, 119.1 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.26 us = 0.05% latency, 303.72 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.5 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.45 us = 0.08% latency, 415.92 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.92 us = 0.07% latency, 466.02 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS)
      )
    )
    (74): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.74 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 586.99 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.81 us = 0.01% latency, 941.36 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.6 us = 0.04% latency, 4.95 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.43 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.87 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 460.62 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.04 us = 0.06% latency, 285.09 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.72 us = 0.03% latency, 118.71 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.78 us = 0.05% latency, 304.36 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.53 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.21 us = 0.08% latency, 416.22 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS)
      )
    )
    (75): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.83 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 583.65 us = 0.14% latency, 1.38 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 961.11 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.12 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.84 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.09 us = 0.06% latency, 286.23 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 129.46 us = 0.03% latency, 132.7 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.55 us = 0.05% latency, 298.07 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.09 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.36 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.49 us = 0.08% latency, 417.12 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.95 us = 0.07% latency, 436.38 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.06 us = 0.07% latency, 470.58 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS)
      )
    )
    (76): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.42 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 589.37 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.81 us = 0.01% latency, 941.36 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.99 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.85 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.95 us = 0.06% latency, 282.86 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.92 us = 0.04% latency, 113.84 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.72 us = 0.03% latency, 118.71 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.56 us = 0.03% latency, 129.6 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.36 us = 0.05% latency, 299.62 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 343.98 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 333.31 us = 0.08% latency, 412.35 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.88 us = 0.07% latency, 464.51 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS)
      )
    )
    (77): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 130.02 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 586.99 us = 0.14% latency, 1.37 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 961.11 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.08 us = 0.04% latency, 4.94 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.66 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.63 ms = 0.62% latency, 91.36 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 233.41 us = 0.05% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.8 us = 0.06% latency, 285.38 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.37 us = 0.03% latency, 130.78 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.07 us = 0.05% latency, 298.68 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.76 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.25 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.64 us = 0.08% latency, 414.42 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.28 us = 0.07% latency, 438.71 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.39 us = 0.02% latency, 422.64 GFLOPS)
      )
    )
    (78): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.69 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 599.38 us = 0.14% latency, 1.34 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 43.63 us = 0.01% latency, 751.03 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.32 us = 0.04% latency, 4.93 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 90.95 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.23 us = 0.06% latency, 283.69 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.3 us = 0.03% latency, 115.85 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 223.88 us = 0.05% latency, 306.95 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.09 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 349.54 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 325.44 us = 0.08% latency, 422.32 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 311.61 us = 0.07% latency, 441.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.02 us = 0.07% latency, 469.05 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS)
      )
    )
    (79): DiTLayer(
      100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.24 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.67 us = 0.14% latency, 1.35 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.15 us = 0.01% latency, 858.99 MFLOPS)
        (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True)
        (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 90.97 TFLOPS
        (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS)
        (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.96 us = 0.06% latency, 277.14 TFLOPS, in_features=2048, out_features=4096, bias=False)
        (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False)
        (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 223.4 us = 0.05% latency, 307.61 TFLOPS, in_features=4096, out_features=2048, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.18 TFLOPS
        (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 339.27 us = 0.08% latency, 405.1 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False)
        (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.02 us = 0.07% latency, 469.05 TFLOPS, in_features=8192, out_features=2048, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS)
      )
    )
  )
  (patch_embed): PatchEmbed(
    133.12 K = 0% Params, 536.87 MMACs = 0% MACs, 528.57 us = 0.12% latency, 2.05 TFLOPS
    (proj): Conv2d(133.12 K = 0% Params, 536.87 MMACs = 0% MACs, 328.54 us = 0.08% latency, 3.29 TFLOPS, 16, 2048, kernel_size=(2, 2), stride=(2, 2))
  )
  (rotary_emb): GemmaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 s = 0% latency, 0 FLOPS)
  (time_proj): Timesteps(0 = 0% Params, 0 MACs = 0% MACs, 375.03 us = 0.09% latency, 0 FLOPS)
  (timestep_embedder): Sequential(
    4.72 M = 0.06% Params, 75.5 MMACs = 0% MACs, 568.63 us = 0.13% latency, 265.6 GFLOPS
    (0): Linear(526.34 K = 0.01% Params, 8.39 MMACs = 0% MACs, 251.29 us = 0.06% latency, 66.76 GFLOPS, in_features=256, out_features=2048, bias=True)
    (1): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 51.5 us = 0.01% latency, 636.29 MFLOPS)
    (2): Linear(4.2 M = 0.05% Params, 67.11 MMACs = 0% MACs, 191.21 us = 0.04% latency, 701.93 GFLOPS, in_features=2048, out_features=2048, bias=True)
  )
  (context_embedder): Sequential(
    4.2 M = 0.05% Params, 17.18 GMACs = 0.07% MACs, 483.51 us = 0.11% latency, 71.06 TFLOPS
    (0): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 190.26 us = 0.04% latency, 0 FLOPS)
    (1): Linear(4.2 M = 0.05% Params, 17.18 GMACs = 0.07% MACs, 240.56 us = 0.06% latency, 142.83 TFLOPS, in_features=2048, out_features=2048, bias=True)
  )
  (norm_out): AdaLayerNormOut(
    8.39 M = 0.1% Params, 134.22 MMACs = 0% MACs, 700.47 us = 0.16% latency, 383.27 GFLOPS
    (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS)
    (linear): Linear(8.39 M = 0.1% Params, 134.22 MMACs = 0% MACs, 268.7 us = 0.06% latency, 999.02 GFLOPS, in_features=2048, out_features=4096, bias=True)
    (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS)
  )
  (proj_out): Linear(131.14 K = 0% Params, 536.87 MMACs = 0% MACs, 185.01 us = 0.04% latency, 5.8 TFLOPS, in_features=2048, out_features=64, bias=True)
  (repa_projector): Sequential(
    9.97 M = 0.12% Params, 40.8 GMACs = 0.16% MACs, 710.01 us = 0.17% latency, 114.96 TFLOPS
    (0): Linear(4.2 M = 0.05% Params, 17.18 GMACs = 0.07% MACs, 214.58 us = 0.05% latency, 160.13 TFLOPS, in_features=2048, out_features=2048, bias=True)
    (1): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 246.04 GFLOPS)
    (2): Linear(4.2 M = 0.05% Params, 17.18 GMACs = 0.07% MACs, 175.95 us = 0.04% latency, 195.28 TFLOPS, in_features=2048, out_features=2048, bias=True)
    (3): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 29.56 us = 0.01% latency, 283.74 GFLOPS)
    (4): Linear(1.57 M = 0.02% Params, 6.44 GMACs = 0.02% MACs, 158.55 us = 0.04% latency, 81.27 TFLOPS, in_features=2048, out_features=768, bias=True)
  )
)
------------------------------------------------------------------------------